Willy Tarreau | 99795b1 | 2017-11-24 18:10:24 +0100 | [diff] [blame] | 1 | +--------------------+ |
| 2 | | Peers protocol 2.1 | |
| 3 | +--------------------+ |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 4 | |
| 5 | |
| 6 | Peers protocol has been implemented over TCP. Its aim is to transmit |
| 7 | stick-table entries information between several haproxy processes. |
| 8 | |
| 9 | This protocol is symmetrical. This means that at any time, each peer |
| 10 | may connect to other peers they have been configured for, so that to send |
| 11 | their last stick-table updates. There is no role of client or server in this |
| 12 | protocol. As peers may connect to each others at the same time, the protocol |
| 13 | ensures that only one peer session may stay opened between a couple of peers |
| 14 | before they start sending their stick-table information, possibly in both |
| 15 | directions (or not). |
| 16 | |
| 17 | |
| 18 | Handshake |
| 19 | +++++++++ |
| 20 | |
| 21 | Just after having connected to another one, a peer must identified itself |
| 22 | and identify the remote peer, sending a "hello" message. The remote peer |
| 23 | replies with a "status" message. |
| 24 | |
| 25 | A "hello" message is made of three lines terminated by a line feed character |
| 26 | as follows: |
| 27 | |
| 28 | <protocol identifier> <version>\n |
| 29 | <remote peer identifier>\n |
| 30 | <local peer identifier> <process ID> <relative process ID>\n |
| 31 | |
| 32 | protocol identifier : HAProxyS |
| 33 | version : 2.1 |
| 34 | remote peer identifier: the peer name this "hello" message is sent to. |
| 35 | local peer identifier : the name of the peer which sends this "hello" message. |
| 36 | process ID : the ID of the process handling this peer session. |
| 37 | relative process ID : the haproxy's relative process ID (0 if nbproc == 1). |
| 38 | |
| 39 | The "status" message is made of a unique line terminated by a line feed |
| 40 | character as follows: |
| 41 | |
| 42 | <status code>\n |
| 43 | |
| 44 | with these values as status code (a three-digit number): |
| 45 | |
| 46 | +-------------+---------------------------------+ |
| 47 | | status code | signification | |
| 48 | +-------------+---------------------------------+ |
| 49 | | 200 | Handshake succeeded | |
| 50 | +-------------+---------------------------------+ |
| 51 | | 300 | Try again later | |
| 52 | +-------------+---------------------------------+ |
| 53 | | 501 | Protocol error | |
| 54 | +-------------+---------------------------------+ |
| 55 | | 502 | Bad version | |
| 56 | +-------------+---------------------------------+ |
| 57 | | 503 | Local peer identifier mismatch | |
| 58 | +-------------+---------------------------------+ |
| 59 | | 504 | Remote peer identifier mismatch | |
| 60 | +-------------+---------------------------------+ |
| 61 | |
| 62 | As the protocol is symmetrical, some peers may connect to each others at the |
| 63 | same time. For efficiency reasons, the protocol ensures there may be only |
| 64 | one TCP session opened after the handshake succeeded and before transmitting |
| 65 | any stick-table data information. In fact for each couple of peer, this is |
| 66 | the last connected peer which wins. Each time a peer A receives a "hello" |
| 67 | message from a peer B, peer A checks if it already managed to open a peer |
| 68 | session with peer B, so with a successful handshake. If it is the case, |
| 69 | peer A closes its peer session. So, this is the peer session opened by B |
| 70 | which stays opened. |
| 71 | |
| 72 | |
| 73 | Peer A Peer B |
| 74 | hello |
| 75 | ----------------------> |
| 76 | status 200 |
| 77 | <---------------------- |
| 78 | hello |
| 79 | <++++++++++++++++++++++ |
| 80 | TCP/FIN-ACK |
| 81 | ----------------------> |
| 82 | TCP/FIN-ACK |
| 83 | <---------------------- |
| 84 | status 200 |
| 85 | ++++++++++++++++++++++> |
| 86 | data |
| 87 | <++++++++++++++++++++++ |
| 88 | data |
| 89 | ++++++++++++++++++++++> |
| 90 | data |
| 91 | ++++++++++++++++++++++> |
| 92 | data |
| 93 | <++++++++++++++++++++++ |
| 94 | . |
| 95 | . |
| 96 | . |
| 97 | |
| 98 | As it is still possible that a couple of peers decide to close both their |
| 99 | peer sessions at the same time, the protocol ensures peers will not reconnect |
| 100 | at the same time, adding a random delay (50 up to 2050 ms) before any |
| 101 | reconnection. |
| 102 | |
| 103 | |
| 104 | Encoding |
| 105 | ++++++++ |
| 106 | |
| 107 | As some TCP data may be corrupted, for integrity reason, some data fields |
| 108 | are encoded at peer session level. |
| 109 | |
| 110 | The following algorithms explain how to encode/decode the data. |
| 111 | |
| 112 | encode: |
| 113 | input : val (64bits integer) |
| 114 | output: bitf (variable-length bitfield) |
| 115 | |
| 116 | if val has no bit set above bit 4 (or if val is less than 0xf0) |
| 117 | set the next byte of bitf to the value of val |
| 118 | return bitf |
| 119 | |
| 120 | set the next byte of bitf to the value of val OR'ed with 0xf0 |
Joseph Herlant | 71b4b15 | 2018-11-13 16:55:16 -0800 | [diff] [blame] | 121 | subtract 0xf0 from val |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 122 | right shift val by 4 |
| 123 | |
| 124 | while val bit 7 is set (or if val is greater or equal to 0x80): |
| 125 | set the next byte of bitf to the value of the byte made of the last |
| 126 | 7 bits of val OR'ed with 0x80 |
Joseph Herlant | 71b4b15 | 2018-11-13 16:55:16 -0800 | [diff] [blame] | 127 | subtract 0x80 from val |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 128 | right shift val by 7 |
| 129 | |
| 130 | set the next byte of bitf to the value of val |
| 131 | return bitf |
| 132 | |
| 133 | decode: |
| 134 | input : bitf (variable-length bitfield) |
| 135 | output: val (64bits integer) |
| 136 | |
| 137 | set val to the value of the first byte of bitf |
| 138 | if bit 4 up to 7 of val are not set |
| 139 | return val |
| 140 | |
| 141 | set loop to 0 |
| 142 | do |
| 143 | add to val the value of the next byte of bitf left shifted by (4 + 7*loop) |
| 144 | set loop to (loop + 1) |
| 145 | while the bit 7 of the next byte of bitf is set |
| 146 | return val |
| 147 | |
| 148 | Example: |
| 149 | |
| 150 | let's say that we must encode 0x1234. |
| 151 | |
| 152 | "set the next byte of bitf to the value of val OR'ed with 0xf0" |
| 153 | => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4 |
| 154 | |
Joseph Herlant | 71b4b15 | 2018-11-13 16:55:16 -0800 | [diff] [blame] | 155 | "subtract 0xf0 from val" |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 156 | => val = 0x1144 |
| 157 | |
| 158 | right shift val by 4 |
| 159 | => val = 0x114 |
| 160 | |
| 161 | "set the next byte of bitf to the value of the byte made of the last |
| 162 | 7 bits of val OR'ed with 0x80" |
| 163 | => bitf[1] = (0x114 | 0x80) & 0xff = 0x94 |
| 164 | |
Joseph Herlant | 71b4b15 | 2018-11-13 16:55:16 -0800 | [diff] [blame] | 165 | "subtract 0x80 from val" |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 166 | => val= 0x94 |
| 167 | |
| 168 | "right shift val by 7" |
| 169 | => val = 0x1 |
| 170 | |
| 171 | => bitf[2] = 0x1 |
| 172 | |
| 173 | So, the encoded value of 0x1234 is 0xf49401. |
| 174 | |
| 175 | To decode this value: |
| 176 | |
| 177 | "set val to the value of the first byte of bitf" |
| 178 | => val = 0xf4 |
| 179 | |
| 180 | "add to val the value of the next byte of bitf left shifted by 4" |
| 181 | => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34 |
| 182 | |
| 183 | "add to val the value of the next byte of bitf left shifted by (4 + 7)" |
| 184 | => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234 |
| 185 | |
| 186 | |
| 187 | Messages |
| 188 | ++++++++ |
| 189 | |
| 190 | *** General *** |
| 191 | |
| 192 | After the handshake has successfully completed, peers are authorized to send |
| 193 | some messages to each others, possibly in both direction. |
| 194 | |
| 195 | All the messages are made at least of a two bytes length header. |
| 196 | |
| 197 | The first byte of this header identifies the class of the message. The next |
| 198 | byte identifies the type of message in the class. |
| 199 | |
| 200 | Some of these messages are variable-length. Others have a fixed size. |
| 201 | Variable-length messages are identified by the value of the message type |
| 202 | byte. For such messages, it is greater than or equal to 128. |
| 203 | |
| 204 | All variable-length message headers must be followed by the encoded length |
| 205 | of the remaining bytes (so the encoded length of the message minus 2 bytes |
| 206 | for the header and minus the length of the encoded length). |
| 207 | |
| 208 | There exist four classes of messages: |
| 209 | |
| 210 | +------------+---------------------+--------------+ |
| 211 | | class byte | signification | message size | |
| 212 | +------------+---------------------+--------------+ |
| 213 | | 0 | control | fixed (2) | |
| 214 | +------------+---------------------+--------------| |
| 215 | | 1 | error | fixed (2) | |
| 216 | +------------+---------------------+--------------| |
| 217 | | 10 | stick-table updates | variable | |
| 218 | +------------+---------------------+--------------| |
| 219 | | 255 | reserved | | |
| 220 | +------------+---------------------+--------------+ |
| 221 | |
| 222 | At this time of this writing, only control and error messages have a fixed |
| 223 | size of two bytes (header only). The stick-table updates messages are all |
| 224 | variable-length (their message type bytes are greater than 128). |
| 225 | |
| 226 | |
| 227 | *** Control message class *** |
| 228 | |
| 229 | At this time of writing, control messages are fixed-length messages used |
Frédéric Lécaille | cce34f8 | 2019-03-26 16:17:33 +0100 | [diff] [blame] | 230 | only to control the synchronizations between local and/or remote processes |
| 231 | and to emit heartbeat messages. |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 232 | |
Frédéric Lécaille | cce34f8 | 2019-03-26 16:17:33 +0100 | [diff] [blame] | 233 | There exists five types of such control messages: |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 234 | |
| 235 | +------------+--------------------------------------------------------+ |
| 236 | | type byte | signification | |
| 237 | +------------+--------------------------------------------------------+ |
| 238 | | 0 | synchronisation request: ask a remote peer for a full | |
| 239 | | | synchronization | |
| 240 | +------------+--------------------------------------------------------+ |
| 241 | | 1 | synchronization finished: signal a remote peer that | |
| 242 | | | local updates have been pushed and local is considered | |
| 243 | | | up to date. | |
| 244 | +------------+--------------------------------------------------------+ |
| 245 | | 2 | synchronization partial: signal a remote peer that | |
| 246 | | | local updates have been pushed and local is not | |
| 247 | | | considered up to date. | |
| 248 | +------------+--------------------------------------------------------+ |
| 249 | | 3 | synchronization confirmed: acknowledge a finished or | |
| 250 | | | partial synchronization message. | |
| 251 | +------------+--------------------------------------------------------+ |
Frédéric Lécaille | cce34f8 | 2019-03-26 16:17:33 +0100 | [diff] [blame] | 252 | | 4 | Heartbeat message. | |
| 253 | +------------+--------------------------------------------------------+ |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 254 | |
Frédéric Lécaille | cce34f8 | 2019-03-26 16:17:33 +0100 | [diff] [blame] | 255 | About hearbeat messages: a peer sends heartbeat messages to peers it is |
| 256 | connected to after periods of 3s of inactivity (i.e. when there is no |
| 257 | stick-table to synchronize for 3s). After a successful peer protocol |
| 258 | handshake between two peers, if one of them does not send any other peer |
| 259 | protocol messages (i.e. no heartbeat and no stick-table update messages) |
| 260 | during a 5s period, it is considered as no more alive by its remote peer |
| 261 | which closes the session and then tries to reconnect to the peer which |
| 262 | has just disappeared. |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 263 | |
| 264 | *** Error message class *** |
| 265 | |
| 266 | There exits two types of such error messages: |
| 267 | |
| 268 | +-----------+------------------+ |
| 269 | | type byte | signification | |
| 270 | +-----------+------------------+ |
| 271 | | 0 | protocol error | |
| 272 | +-----------+------------------+ |
| 273 | | 1 | size limit error | |
| 274 | +-----------+------------------+ |
| 275 | |
| 276 | |
| 277 | *** Stick-table update message class *** |
| 278 | |
| 279 | This class is the more important one because it is in relation with the |
| 280 | stick-table entries handling between peers which is at the core of peers |
| 281 | protocol. |
| 282 | |
| 283 | All the messages of this class are variable-length. Their type bytes are |
| 284 | all greater than or equal to 128. |
| 285 | |
| 286 | There exits five types of such stick-table update messages: |
| 287 | |
| 288 | +-----------+--------------------------------+ |
| 289 | | type byte | signification | |
| 290 | +-----------+--------------------------------+ |
| 291 | | 128 | Entry update | |
| 292 | +-----------+--------------------------------+ |
| 293 | | 129 | Incremental entry update | |
| 294 | +-----------+--------------------------------+ |
| 295 | | 130 | Stick-table definition | |
| 296 | +-----------+--------------------------------+ |
| 297 | | 131 | Stick-table switch (unused) | |
| 298 | +-----------+--------------------------------+ |
| 299 | | 133 | Update message acknowledgement | |
| 300 | +-----------+--------------------------------+ |
| 301 | |
| 302 | Note that entry update messages may be multiplexed. This means that different |
| 303 | entry update messages for different stick-tables may be sent over the same |
| 304 | peer session. |
| 305 | |
Joseph Herlant | 71b4b15 | 2018-11-13 16:55:16 -0800 | [diff] [blame] | 306 | To do so, each time entry update messages have to sent, they must be preceded |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 307 | by a stick-table definition message. This remains true for incremental entry |
| 308 | update messages. |
| 309 | |
| 310 | As its name indicate, "Update message acknowledgement" messages are used to |
| 311 | acknowledge the entry update messages. |
| 312 | |
| 313 | In this following paragraph, we give some information about the format of |
| 314 | each stick-table update messages. This very simple following legend will |
| 315 | contribute in understanding it. The unit used is the octet. |
| 316 | |
| 317 | XX |
| 318 | +-----------+ |
| 319 | | foo | Unique fixed sized "foo" field, made of XX octets. |
| 320 | +-----------+ |
| 321 | |
| 322 | +===========+ |
| 323 | | foo | Variable-length "foo" field. |
| 324 | +===========+ |
| 325 | |
| 326 | +xxxxxxxxxxx+ |
| 327 | | foo | Encoded variable-length "foo" field. |
| 328 | +xxxxxxxxxxx+ |
| 329 | |
| 330 | +###########+ |
| 331 | | foo | hereunder described "foo" field. |
| 332 | +###########+ |
| 333 | |
| 334 | |
| 335 | With this legend, all the stick-table update messages have such a header: |
| 336 | |
| 337 | 1 1 |
| 338 | +--------------------+------------------------+xxxxxxxxxxxxxxxx+ |
| 339 | | Message Class (10) | Message type (128-133) | Message length | |
| 340 | +--------------------+------------------------+xxxxxxxxxxxxxxxx+ |
| 341 | |
| 342 | Note that to help in making communicate different versions of peers protocol, |
| 343 | such stick-table update messages may be extended adding non mandatory |
| 344 | fields at the end of such messages, announcing a total message length |
| 345 | which is greater than the message length of the previous versions of |
| 346 | peers protocol. After having parsed such messages, the remaining ones |
| 347 | will be skipped to parse the next message. |
| 348 | |
| 349 | - Definition message format: |
| 350 | |
| 351 | Before sending entry update messages, a peer must announce the configuration |
| 352 | of the stick-table in relation with these messages thanks to a |
| 353 | "Stick-table definition" message with such a following format: |
| 354 | |
| 355 | +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ |
| 356 | | Stick-table ID | Stick-table name length | Stick-table name | |
| 357 | +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ |
| 358 | |
| 359 | +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ |
| 360 | | Key type | Key length | Data types bitfield | Expiry | |
| 361 | +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ |
| 362 | |
| 363 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 364 | | Frequency counter #1 | Frequency counter #1 period | |
| 365 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 366 | |
| 367 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 368 | | Frequency counter #2 | Frequency counter #2 period | |
| 369 | +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ |
| 370 | . |
| 371 | . |
| 372 | . |
| 373 | |
| 374 | Note that "Stick-table ID" field is an encoded integer which is used to |
| 375 | identify the stick-table without using its name (or "Stick-table name" |
| 376 | field). It is local to the process handling the stick-table. So we can have |
| 377 | two peers attached to processes which generate stick-table updates for |
| 378 | the same stick-table (same name) but with different stick-table IDs. |
| 379 | |
| 380 | Also note that the list of "Frequency counter #X" and their associated |
| 381 | periods fields exists only if their underlying types are already defined |
| 382 | in "Data types bitfield" field. |
| 383 | |
| 384 | "Expiry" field and the remaining ones are not used by all the existing |
| 385 | version of haproxy peers. But they are MANDATORY, so that to make a |
| 386 | stick-table aggregator peer be able to autoconfigure itself. |
| 387 | |
| 388 | |
| 389 | - Entry update message format: |
| 390 | 4 |
| 391 | +-----------------+###########+############+ |
| 392 | | Local update ID | Key | Data | |
| 393 | +-----------------+###########+############+ |
| 394 | |
| 395 | with "Key" described as follows: |
| 396 | |
| 397 | +xxxxxxxxxxx+=======+ |
| 398 | | length | value | if key type is (non null terminated) "string", |
| 399 | +xxxxxxxxxxx+=======+ |
| 400 | |
| 401 | 4 |
| 402 | +-------+ |
| 403 | | value | if key type is "integer", |
| 404 | +-------+ |
| 405 | |
| 406 | +=======+ |
| 407 | | value | for other key types: the size is announced in |
| 408 | +=======+ the previous stick-table definition message. |
| 409 | |
| 410 | "Data" field is basically a list of encoded values for each type announced |
| 411 | by the "Data types bitfield" field of the previous "Stick-table definition" |
| 412 | message: |
| 413 | |
| 414 | +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ |
| 415 | | Data type #1 value | Data type #2 value | .... | Data type #n value | |
| 416 | +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ |
| 417 | |
| 418 | |
Frédéric Lécaille | fdfa9e3 | 2019-06-06 15:53:20 +0200 | [diff] [blame] | 419 | Most of these fields are internally stored as uint32_t (see STD_T_SINT, |
| 420 | STD_T_UINT, STD_T_ULL C enumerations) or structures made of several uint32_t |
| 421 | (see STD_T_FRQP C enumeration). The remaining one STD_T_DICT is internally |
| 422 | used to store entries of LRU caches for others literal dictionary entries |
| 423 | (couples of IDs associated to strings). It is used to transmit these cache |
| 424 | entries as follows: |
| 425 | |
| 426 | +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+ |
| 427 | | length | ID | string length | string | |
| 428 | +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+ |
| 429 | |
| 430 | "length" is the length in bytes of the remaining data after this "length" field. |
| 431 | "string length" is the length of "string" field which follows. |
| 432 | |
| 433 | Here the cache is used so that not to have to send again and again an already |
| 434 | sent string. Indeed, the second time we have to send the same dictionary entry, |
| 435 | if still cached, a peer sends only its ID: |
| 436 | |
| 437 | +xxxxxxxxxxx+xxxx+ |
| 438 | | length | ID | |
| 439 | +xxxxxxxxxxx+xxxx+ |
Frédéric Lécaille | 4b6645d | 2017-11-15 14:41:00 +0100 | [diff] [blame] | 440 | |
| 441 | - Update message acknowledgement format: |
| 442 | |
| 443 | These messages are responses to "Entry update" messages only. |
| 444 | |
| 445 | Its format is very basic for efficiency reasons: |
| 446 | |
| 447 | 4 |
| 448 | +xxxxxxxxxxxxxxxx+-----------+ |
| 449 | | Stick-table ID | Update ID | |
| 450 | +xxxxxxxxxxxxxxxx+-----------+ |
| 451 | |
| 452 | |
| 453 | Note that the "Stick-table ID" field value is in relation with the one which |
| 454 | has been previously announce by a "Stick-table definition" message. |
| 455 | |
| 456 | The following schema may help in understanding how to handle a stream of |
| 457 | stick-table update messages. The handshake step is not represented. |
| 458 | Stick-table IDs are preceded by a '#' character. |
| 459 | |
| 460 | |
| 461 | Peer A Peer B |
| 462 | |
| 463 | stkt def. #1 |
| 464 | ----------------------> |
| 465 | updates (1-5) |
| 466 | ----------------------> |
| 467 | stkt def. #3 |
| 468 | ----------------------> |
| 469 | updates (1000-1005) |
| 470 | ----------------------> |
| 471 | |
| 472 | stkt def. #2 |
| 473 | <---------------------- |
| 474 | updates (10-15) |
| 475 | <---------------------- |
| 476 | ack 5 for #1 |
| 477 | <---------------------- |
| 478 | ack 1005 for #3 |
| 479 | <---------------------- |
| 480 | stkt def. #4 |
| 481 | <---------------------- |
| 482 | updates (100-105) |
| 483 | <---------------------- |
| 484 | |
| 485 | ack 10 for #2 |
| 486 | ----------------------> |
| 487 | ack 105 for #4 |
| 488 | ----------------------> |
| 489 | (from here, on both sides, all stick-table updates |
| 490 | are considered as received) |
| 491 | |