Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 1 | +----------------+ |
| 2 | | Peers protocol | |
Willy Tarreau | cfe1466 | 2017-11-19 09:55:29 +0100 | [diff] [blame] | 3 | | version 1.8 |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 4 | +----------------+ |
| 5 | |
| 6 | |
| 7 | Peers protocol has been implemented over TCP. Its aim is to transmit |
| 8 | stick-table entries information between several haproxy processes. |
| 9 | |
| 10 | This protocol is symmetrical. This means that at any time, each peer |
| 11 | may connect to other peers they have been configured for, so that to send |
| 12 | their last stick-table updates. There is no role of client or server in this |
| 13 | protocol. As peers may connect to each others at the same time, the protocol |
| 14 | ensures that only one peer session may stay opened between a couple of peers |
| 15 | before they start sending their stick-table information, possibly in both |
| 16 | directions (or not). |
| 17 | |
| 18 | |
| 19 | Handshake |
| 20 | +++++++++ |
| 21 | |
| 22 | Just after having connected to another one, a peer must identified itself |
| 23 | and identify the remote peer, sending a "hello" message. The remote peer |
| 24 | replies with a "status" message. |
| 25 | |
| 26 | A "hello" message is made of three lines terminated by a line feed character |
| 27 | as follows: |
| 28 | |
| 29 | <protocol identifier> <version>\n |
| 30 | <remote peer identifier>\n |
| 31 | <local peer identifier> <process ID> <relative process ID>\n |
| 32 | |
| 33 | protocol identifier : HAProxyS |
| 34 | version : 2.1 |
| 35 | remote peer identifier: the peer name this "hello" message is sent to. |
| 36 | local peer identifier : the name of the peer which sends this "hello" message. |
| 37 | process ID : the ID of the process handling this peer session. |
| 38 | relative process ID : the haproxy's relative process ID (0 if nbproc == 1). |
| 39 | |
| 40 | The "status" message is made of a unique line terminated by a line feed |
| 41 | character as follows: |
| 42 | |
| 43 | <status code>\n |
| 44 | |
| 45 | with these values as status code (a three-digit number): |
| 46 | |
| 47 | +-------------+---------------------------------+ |
| 48 | | status code | signification | |
| 49 | +-------------+---------------------------------+ |
| 50 | | 200 | Handshake succeeded | |
| 51 | +-------------+---------------------------------+ |
| 52 | | 300 | Try again later | |
| 53 | +-------------+---------------------------------+ |
| 54 | | 501 | Protocol error | |
| 55 | +-------------+---------------------------------+ |
| 56 | | 502 | Bad version | |
| 57 | +-------------+---------------------------------+ |
| 58 | | 503 | Local peer identifier mismatch | |
| 59 | +-------------+---------------------------------+ |
| 60 | | 504 | Remote peer identifier mismatch | |
| 61 | +-------------+---------------------------------+ |
| 62 | |
| 63 | As the protocol is symmetrical, some peers may connect to each others at the |
| 64 | same time. For efficiency reasons, the protocol ensures there may be only |
| 65 | one TCP session opened after the handshake succeeded and before transmitting |
| 66 | any stick-table data information. In fact for each couple of peer, this is |
| 67 | the last connected peer which wins. Each time a peer A receives a "hello" |
| 68 | message from a peer B, peer A checks if it already managed to open a peer |
| 69 | session with peer B, so with a successful handshake. If it is the case, |
| 70 | peer A closes its peer session. So, this is the peer session opened by B |
| 71 | which stays opened. |
| 72 | |
| 73 | |
| 74 | Peer A Peer B |
| 75 | hello |
| 76 | ----------------------> |
| 77 | status 200 |
| 78 | <---------------------- |
| 79 | hello |
| 80 | <++++++++++++++++++++++ |
| 81 | TCP/FIN-ACK |
| 82 | ----------------------> |
| 83 | TCP/FIN-ACK |
| 84 | <---------------------- |
| 85 | status 200 |
| 86 | ++++++++++++++++++++++> |
| 87 | data |
| 88 | <++++++++++++++++++++++ |
| 89 | data |
| 90 | ++++++++++++++++++++++> |
| 91 | data |
| 92 | ++++++++++++++++++++++> |
| 93 | data |
| 94 | <++++++++++++++++++++++ |
| 95 | . |
| 96 | . |
| 97 | . |
| 98 | |
| 99 | As it is still possible that a couple of peers decide to close both their |
| 100 | peer sessions at the same time, the protocol ensures peers will not reconnect |
| 101 | at the same time, adding a random delay (50 up to 2050 ms) before any |
| 102 | reconnection. |
| 103 | |
| 104 | |
| 105 | Encoding |
| 106 | ++++++++ |
| 107 | |
| 108 | As some TCP data may be corrupted, for integrity reason, some data fields |
| 109 | are encoded at peer session level. |
| 110 | |
| 111 | The following algorithms explain how to encode/decode the data. |
| 112 | |
| 113 | encode: |
| 114 | input : val (64bits integer) |
| 115 | output: bitf (variable-length bitfield) |
| 116 | |
| 117 | if val has no bit set above bit 4 (or if val is less than 0xf0) |
| 118 | set the next byte of bitf to the value of val |
| 119 | return bitf |
| 120 | |
| 121 | set the next byte of bitf to the value of val OR'ed with 0xf0 |
| 122 | substract 0xf0 from val |
| 123 | right shift val by 4 |
| 124 | |
| 125 | while val bit 7 is set (or if val is greater or equal to 0x80): |
| 126 | set the next byte of bitf to the value of the byte made of the last |
| 127 | 7 bits of val OR'ed with 0x80 |
| 128 | substract 0x80 from val |
| 129 | right shift val by 7 |
| 130 | |
| 131 | set the next byte of bitf to the value of val |
| 132 | return bitf |
| 133 | |
| 134 | decode: |
| 135 | input : bitf (variable-length bitfield) |
| 136 | output: val (64bits integer) |
| 137 | |
| 138 | set val to the value of the first byte of bitf |
| 139 | if bit 4 up to 7 of val are not set |
| 140 | return val |
| 141 | |
| 142 | set loop to 0 |
| 143 | do |
| 144 | add to val the value of the next byte of bitf left shifted by (4 + 7*loop) |
| 145 | set loop to (loop + 1) |
| 146 | while the bit 7 of the next byte of bitf is set |
| 147 | return val |
| 148 | |
| 149 | Example: |
| 150 | |
| 151 | let's say that we must encode 0x1234. |
| 152 | |
| 153 | "set the next byte of bitf to the value of val OR'ed with 0xf0" |
| 154 | => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4 |
| 155 | |
| 156 | "substract 0xf0 from val" |
| 157 | => val = 0x1144 |
| 158 | |
| 159 | right shift val by 4 |
| 160 | => val = 0x114 |
| 161 | |
| 162 | "set the next byte of bitf to the value of the byte made of the last |
| 163 | 7 bits of val OR'ed with 0x80" |
| 164 | => bitf[1] = (0x114 | 0x80) & 0xff = 0x94 |
| 165 | |
| 166 | "substract 0x80 from val" |
| 167 | => val= 0x94 |
| 168 | |
| 169 | "right shift val by 7" |
| 170 | => val = 0x1 |
| 171 | |
| 172 | => bitf[2] = 0x1 |
| 173 | |
| 174 | So, the encoded value of 0x1234 is 0xf49401. |
| 175 | |
| 176 | To decode this value: |
| 177 | |
| 178 | "set val to the value of the first byte of bitf" |
| 179 | => val = 0xf4 |
| 180 | |
| 181 | "add to val the value of the next byte of bitf left shifted by 4" |
| 182 | => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34 |
| 183 | |
| 184 | "add to val the value of the next byte of bitf left shifted by (4 + 7)" |
| 185 | => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234 |
| 186 | |
| 187 | |
| 188 | Messages |
| 189 | ++++++++ |
| 190 | |
| 191 | *** General *** |
| 192 | |
| 193 | After the handshake has successfully completed, peers are authorized to send |
| 194 | some messages to each others, possibly in both direction. |
| 195 | |
| 196 | All the messages are made at least of a two bytes length header. |
| 197 | |
| 198 | The first byte of this header identifies the class of the message. The next |
| 199 | byte identifies the type of message in the class. |
| 200 | |
| 201 | Some of these messages are variable-length. Others have a fixed size. |
| 202 | Variable-length messages are identified by the value of the message type |
| 203 | byte. For such messages, it is greater than or equal to 128. |
| 204 | |
| 205 | All variable-length message headers must be followed by the encoded length |
| 206 | of the remaining bytes (so the encoded length of the message minus 2 bytes |
| 207 | for the header and minus the length of the encoded length). |
| 208 | |
| 209 | There exist four classes of messages: |
| 210 | |
| 211 | +------------+---------------------+--------------+ |
| 212 | | class byte | signification | message size | |
| 213 | +------------+---------------------+--------------+ |
| 214 | | 0 | control | fixed (2) | |
| 215 | +------------+---------------------+--------------| |
| 216 | | 1 | error | fixed (2) | |
| 217 | +------------+---------------------+--------------| |
| 218 | | 10 | stick-table updates | variable | |
| 219 | +------------+---------------------+--------------| |
| 220 | | 255 | reserved | | |
| 221 | +------------+---------------------+--------------+ |
| 222 | |
| 223 | At this time of this writing, only control and error messages have a fixed |
| 224 | size of two bytes (header only). The stick-table updates messages are all |
| 225 | variable-length (their message type bytes are greater than 128). |
| 226 | |
| 227 | |
| 228 | *** Control message class *** |
| 229 | |
| 230 | At this time of writing, control messages are fixed-length messages used |
| 231 | only to control the synchonrizations between local and/or remote processes. |
| 232 | |
| 233 | There exist four types of such control messages: |
| 234 | |
| 235 | +------------+--------------------------------------------------------+ |
| 236 | | type byte | signification | |
| 237 | +------------+--------------------------------------------------------+ |
| 238 | | 0 | synchronisation request: ask a remote peer for a full | |
| 239 | | | synchronization | |
| 240 | +------------+--------------------------------------------------------+ |
| 241 | | 1 | synchronization finished: signal a remote peer that | |
| 242 | | | local updates have been pushed and local is considered | |
| 243 | | | up to date. | |
| 244 | +------------+--------------------------------------------------------+ |
| 245 | | 2 | synchronization partial: signal a remote peer that | |
| 246 | | | local updates have been pushed and local is not | |
| 247 | | | considered up to date. | |
| 248 | +------------+--------------------------------------------------------+ |
| 249 | | 3 | synchronization confirmed: acknowledge a finished or | |
| 250 | | | partial synchronization message. | |
| 251 | +------------+--------------------------------------------------------+ |
| 252 | |
| 253 | |
| 254 | *** Error message class *** |
| 255 | |
| 256 | There exits two types of such error messages: |
| 257 | |
| 258 | +-----------+------------------+ |
| 259 | | type byte | signification | |
| 260 | +-----------+------------------+ |
| 261 | | 0 | protocol error | |
| 262 | +-----------+------------------+ |
| 263 | | 1 | size limit error | |
| 264 | +-----------+------------------+ |
| 265 | |
| 266 | |
| 267 | *** Stick-table update message class *** |
| 268 | |
| 269 | This class is the more important one because it is in relation with the |
| 270 | stick-table entries handling between peers which is at the core of peers |
| 271 | protocol. |
| 272 | |
| 273 | All the messages of this class are variable-length. Their type bytes are |
| 274 | all greater than or equal to 128. |
| 275 | |
| 276 | There exits five types of such stick-table update messages: |
| 277 | |
| 278 | +-----------+--------------------------------+ |
| 279 | | type byte | signification | |
| 280 | +-----------+--------------------------------+ |
| 281 | | 128 | Entry update | |
| 282 | +-----------+--------------------------------+ |
| 283 | | 129 | Incremental entry update | |
| 284 | +-----------+--------------------------------+ |
| 285 | | 130 | Stick-table definition | |
| 286 | +-----------+--------------------------------+ |
| 287 | | 131 | Stick-table switch (unused) | |
| 288 | +-----------+--------------------------------+ |
| 289 | | 133 | Update message acknowledgement | |
| 290 | +-----------+--------------------------------+ |
| 291 | |
| 292 | Note that entry update messages may be multiplexed. This means that different |
| 293 | entry update messages for different stick-tables may be sent over the same |
| 294 | peer session. |
| 295 | |
| 296 | To do so, each time entry update messages have to sent, they must be preceeded |
| 297 | by a stick-table definition message. This remains true for incremental entry |
| 298 | update messages. |
| 299 | |
| 300 | As its name indicate, "Update message acknowledgement" messages are used to |
| 301 | acknowledge the entry update messages. |
| 302 | |
| 303 | In this following paragraph, we give some information about the format of |
| 304 | each stick-table update messages. This very simple following legend will |
| 305 | contribute in understanding it. The unit used is the octet. |
| 306 | |
| 307 | XX |
| 308 | +-----------+ |
| 309 | | foo | Unique fixed sized "foo" field, made of XX octets. |
| 310 | +-----------+ |
| 311 | |
| 312 | +===========+ |
| 313 | | foo | Variable-length "foo" field. |
| 314 | +===========+ |
| 315 | |
| 316 | +xxxxxxxxxxx+ |
| 317 | | foo | Encoded variable-length "foo" field. |
| 318 | +xxxxxxxxxxx+ |
| 319 | |
| 320 | +###########+ |
| 321 | | foo | hereunder described "foo" field. |
| 322 | +###########+ |
| 323 | |
| 324 | |
| 325 | With this legend, all the stick-table update messages have such a header: |
| 326 | |
| 327 | 1 1 |
| 328 | +--------------------+------------------------+xxxxxxxxxxxxxxxx+ |
| 329 | | Message Class (10) | Message type (128-133) | Message length | |
| 330 | +--------------------+------------------------+xxxxxxxxxxxxxxxx+ |
| 331 | |
| 332 | Note that to help in making communicate different versions of peers protocol, |
| 333 | such stick-table update messages may be extended adding non mandatory |
| 334 | fields at the end of such messages, announcing a total message length |
| 335 | which is greater than the message length of the previous versions of |
| 336 | peers protocol. After having parsed such messages, the remaining ones |
| 337 | will be skipped to parse the next message. |
| 338 | |
| 339 | - Definition message format: |
| 340 | |
| 341 | Before sending entry update messages, a peer must announce the configuration |
| 342 | of the stick-table in relation with these messages thanks to a |
| 343 | "Stick-table definition" message with such a following format: |
| 344 | |
| 345 | +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ |
| 346 | | Stick-table ID | Stick-table name length | Stick-table name | |
| 347 | +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ |
| 348 | |
| 349 | +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ |
| 350 | | Key type | Key length | Data types bitfield | Expiry | |
| 351 | +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ |
| 352 | |
| 353 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 354 | | Frequency counter #1 | Frequency counter #1 period | |
| 355 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 356 | |
| 357 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 358 | | Frequency counter #2 | Frequency counter #2 period | |
| 359 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 360 | . |
| 361 | . |
| 362 | . |
| 363 | |
| 364 | Note that "Stick-table ID" field is an encoded integer which is used to |
| 365 | identify the stick-table without using its name (or "Stick-table name" |
| 366 | field). It is local to the process handling the stick-table. So we can have |
| 367 | two peers attached to processes which generate stick-table updates for |
| 368 | the same stick-table (same name) but with different stick-table IDs. |
| 369 | |
| 370 | Also note that the list of "Frequency counter #X" and their associated |
| 371 | periods fields exists only if their underlying types are already defined |
| 372 | in "Data types bitfield" field. |
| 373 | |
| 374 | "Expiry" field and the remaining ones are not used by all the existing |
| 375 | version of haproxy peers. But they are MANDATORY, so that to make a |
| 376 | stick-table aggregator peer be able to autoconfigure itself. |
| 377 | |
| 378 | |
| 379 | - Entry update message format: |
| 380 | 4 |
| 381 | +-----------------+###########+############+ |
| 382 | | Local update ID | Key | Data | |
| 383 | +-----------------+###########+############+ |
| 384 | |
| 385 | with "Key" described as follows: |
| 386 | |
| 387 | +xxxxxxxxxxx+=======+ |
| 388 | | length | value | if key type is (non null terminated) "string", |
| 389 | +xxxxxxxxxxx+=======+ |
| 390 | |
| 391 | 4 |
| 392 | +-------+ |
| 393 | | value | if key type is "integer", |
| 394 | +-------+ |
| 395 | |
| 396 | +=======+ |
| 397 | | value | for other key types: the size is announced in |
| 398 | +=======+ the previous stick-table definition message. |
| 399 | |
| 400 | "Data" field is basically a list of encoded values for each type announced |
| 401 | by the "Data types bitfield" field of the previous "Stick-table definition" |
| 402 | message: |
| 403 | |
| 404 | +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ |
| 405 | | Data type #1 value | Data type #2 value | .... | Data type #n value | |
| 406 | +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ |
| 407 | |
| 408 | |
| 409 | |
| 410 | - Update message acknowledgement format: |
| 411 | |
| 412 | These messages are responses to "Entry update" messages only. |
| 413 | |
| 414 | Its format is very basic for efficiency reasons: |
| 415 | |
| 416 | 4 |
| 417 | +xxxxxxxxxxxxxxxx+-----------+ |
| 418 | | Stick-table ID | Update ID | |
| 419 | +xxxxxxxxxxxxxxxx+-----------+ |
| 420 | |
| 421 | |
| 422 | Note that the "Stick-table ID" field value is in relation with the one which |
| 423 | has been previously announce by a "Stick-table definition" message. |
| 424 | |
| 425 | The following schema may help in understanding how to handle a stream of |
| 426 | stick-table update messages. The handshake step is not represented. |
| 427 | Stick-table IDs are preceded by a '#' character. |
| 428 | |
| 429 | |
| 430 | Peer A Peer B |
| 431 | |
| 432 | stkt def. #1 |
| 433 | ----------------------> |
| 434 | updates (1-5) |
| 435 | ----------------------> |
| 436 | stkt def. #3 |
| 437 | ----------------------> |
| 438 | updates (1000-1005) |
| 439 | ----------------------> |
| 440 | |
| 441 | stkt def. #2 |
| 442 | <---------------------- |
| 443 | updates (10-15) |
| 444 | <---------------------- |
| 445 | ack 5 for #1 |
| 446 | <---------------------- |
| 447 | ack 1005 for #3 |
| 448 | <---------------------- |
| 449 | stkt def. #4 |
| 450 | <---------------------- |
| 451 | updates (100-105) |
| 452 | <---------------------- |
| 453 | |
| 454 | ack 10 for #2 |
| 455 | ----------------------> |
| 456 | ack 105 for #4 |
| 457 | ----------------------> |
| 458 | (from here, on both sides, all stick-table updates |
| 459 | are considered as received) |
| 460 | |