blob: 7ce2fcb466e3c267680c2676a29c970bab11d21f [file] [log] [blame]
Willy Tarreau99795b12017-11-24 18:10:24 +01001 +--------------------+
2 | Peers protocol 2.1 |
3 +--------------------+
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +01004
5
6 Peers protocol has been implemented over TCP. Its aim is to transmit
7 stick-table entries information between several haproxy processes.
8
9 This protocol is symmetrical. This means that at any time, each peer
John Roesler7f31fec2021-10-29 14:59:33 -050010 may connect to other peers they have been configured for, to send
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +010011 their last stick-table updates. There is no role of client or server in this
12 protocol. As peers may connect to each others at the same time, the protocol
13 ensures that only one peer session may stay opened between a couple of peers
14 before they start sending their stick-table information, possibly in both
15 directions (or not).
16
17
18 Handshake
19 +++++++++
20
John Roesler7f31fec2021-10-29 14:59:33 -050021 Just after having connected to another one, a peer must identify itself
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +010022 and identify the remote peer, sending a "hello" message. The remote peer
23 replies with a "status" message.
24
25 A "hello" message is made of three lines terminated by a line feed character
26 as follows:
27
28 <protocol identifier> <version>\n
29 <remote peer identifier>\n
30 <local peer identifier> <process ID> <relative process ID>\n
31
32 protocol identifier : HAProxyS
33 version : 2.1
34 remote peer identifier: the peer name this "hello" message is sent to.
35 local peer identifier : the name of the peer which sends this "hello" message.
36 process ID : the ID of the process handling this peer session.
37 relative process ID : the haproxy's relative process ID (0 if nbproc == 1).
38
39 The "status" message is made of a unique line terminated by a line feed
40 character as follows:
41
42 <status code>\n
43
44 with these values as status code (a three-digit number):
45
46 +-------------+---------------------------------+
47 | status code | signification |
48 +-------------+---------------------------------+
49 | 200 | Handshake succeeded |
50 +-------------+---------------------------------+
51 | 300 | Try again later |
52 +-------------+---------------------------------+
53 | 501 | Protocol error |
54 +-------------+---------------------------------+
55 | 502 | Bad version |
56 +-------------+---------------------------------+
57 | 503 | Local peer identifier mismatch |
58 +-------------+---------------------------------+
59 | 504 | Remote peer identifier mismatch |
60 +-------------+---------------------------------+
61
John Roesler7f31fec2021-10-29 14:59:33 -050062 As the protocol is symmetrical, some peers may connect to each other at the
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +010063 same time. For efficiency reasons, the protocol ensures there may be only
64 one TCP session opened after the handshake succeeded and before transmitting
John Roesler7f31fec2021-10-29 14:59:33 -050065 any stick-table data information. In fact, for each couple of peers, this is
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +010066 the last connected peer which wins. Each time a peer A receives a "hello"
67 message from a peer B, peer A checks if it already managed to open a peer
68 session with peer B, so with a successful handshake. If it is the case,
69 peer A closes its peer session. So, this is the peer session opened by B
70 which stays opened.
71
72
73 Peer A Peer B
74 hello
75 ---------------------->
76 status 200
77 <----------------------
78 hello
79 <++++++++++++++++++++++
80 TCP/FIN-ACK
81 ---------------------->
82 TCP/FIN-ACK
83 <----------------------
84 status 200
85 ++++++++++++++++++++++>
86 data
87 <++++++++++++++++++++++
88 data
89 ++++++++++++++++++++++>
90 data
91 ++++++++++++++++++++++>
92 data
93 <++++++++++++++++++++++
94 .
95 .
96 .
97
98 As it is still possible that a couple of peers decide to close both their
99 peer sessions at the same time, the protocol ensures peers will not reconnect
100 at the same time, adding a random delay (50 up to 2050 ms) before any
101 reconnection.
102
103
104 Encoding
105 ++++++++
106
107 As some TCP data may be corrupted, for integrity reason, some data fields
108 are encoded at peer session level.
109
110 The following algorithms explain how to encode/decode the data.
111
112 encode:
113 input : val (64bits integer)
114 output: bitf (variable-length bitfield)
115
116 if val has no bit set above bit 4 (or if val is less than 0xf0)
117 set the next byte of bitf to the value of val
118 return bitf
119
120 set the next byte of bitf to the value of val OR'ed with 0xf0
Joseph Herlant71b4b152018-11-13 16:55:16 -0800121 subtract 0xf0 from val
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100122 right shift val by 4
123
124 while val bit 7 is set (or if val is greater or equal to 0x80):
125 set the next byte of bitf to the value of the byte made of the last
126 7 bits of val OR'ed with 0x80
Joseph Herlant71b4b152018-11-13 16:55:16 -0800127 subtract 0x80 from val
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100128 right shift val by 7
129
130 set the next byte of bitf to the value of val
131 return bitf
132
133 decode:
134 input : bitf (variable-length bitfield)
135 output: val (64bits integer)
136
137 set val to the value of the first byte of bitf
138 if bit 4 up to 7 of val are not set
139 return val
140
141 set loop to 0
142 do
143 add to val the value of the next byte of bitf left shifted by (4 + 7*loop)
144 set loop to (loop + 1)
145 while the bit 7 of the next byte of bitf is set
146 return val
147
148 Example:
149
150 let's say that we must encode 0x1234.
151
152 "set the next byte of bitf to the value of val OR'ed with 0xf0"
153 => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4
154
Joseph Herlant71b4b152018-11-13 16:55:16 -0800155 "subtract 0xf0 from val"
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100156 => val = 0x1144
157
158 right shift val by 4
159 => val = 0x114
160
161 "set the next byte of bitf to the value of the byte made of the last
162 7 bits of val OR'ed with 0x80"
163 => bitf[1] = (0x114 | 0x80) & 0xff = 0x94
164
Joseph Herlant71b4b152018-11-13 16:55:16 -0800165 "subtract 0x80 from val"
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100166 => val= 0x94
167
168 "right shift val by 7"
169 => val = 0x1
170
171 => bitf[2] = 0x1
172
173 So, the encoded value of 0x1234 is 0xf49401.
174
175 To decode this value:
176
177 "set val to the value of the first byte of bitf"
178 => val = 0xf4
179
180 "add to val the value of the next byte of bitf left shifted by 4"
181 => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34
182
183 "add to val the value of the next byte of bitf left shifted by (4 + 7)"
184 => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234
185
186
187 Messages
188 ++++++++
189
190 *** General ***
191
192 After the handshake has successfully completed, peers are authorized to send
193 some messages to each others, possibly in both direction.
194
195 All the messages are made at least of a two bytes length header.
196
197 The first byte of this header identifies the class of the message. The next
198 byte identifies the type of message in the class.
199
200 Some of these messages are variable-length. Others have a fixed size.
201 Variable-length messages are identified by the value of the message type
202 byte. For such messages, it is greater than or equal to 128.
203
204 All variable-length message headers must be followed by the encoded length
205 of the remaining bytes (so the encoded length of the message minus 2 bytes
206 for the header and minus the length of the encoded length).
207
208 There exist four classes of messages:
209
210 +------------+---------------------+--------------+
211 | class byte | signification | message size |
212 +------------+---------------------+--------------+
213 | 0 | control | fixed (2) |
214 +------------+---------------------+--------------|
215 | 1 | error | fixed (2) |
216 +------------+---------------------+--------------|
217 | 10 | stick-table updates | variable |
218 +------------+---------------------+--------------|
219 | 255 | reserved | |
220 +------------+---------------------+--------------+
221
222 At this time of this writing, only control and error messages have a fixed
223 size of two bytes (header only). The stick-table updates messages are all
224 variable-length (their message type bytes are greater than 128).
225
226
227 *** Control message class ***
228
229 At this time of writing, control messages are fixed-length messages used
Frédéric Lécaillecce34f82019-03-26 16:17:33 +0100230 only to control the synchronizations between local and/or remote processes
231 and to emit heartbeat messages.
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100232
Frédéric Lécaillecce34f82019-03-26 16:17:33 +0100233 There exists five types of such control messages:
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100234
235 +------------+--------------------------------------------------------+
236 | type byte | signification |
237 +------------+--------------------------------------------------------+
238 | 0 | synchronisation request: ask a remote peer for a full |
239 | | synchronization |
240 +------------+--------------------------------------------------------+
241 | 1 | synchronization finished: signal a remote peer that |
242 | | local updates have been pushed and local is considered |
243 | | up to date. |
244 +------------+--------------------------------------------------------+
245 | 2 | synchronization partial: signal a remote peer that |
246 | | local updates have been pushed and local is not |
247 | | considered up to date. |
248 +------------+--------------------------------------------------------+
249 | 3 | synchronization confirmed: acknowledge a finished or |
250 | | partial synchronization message. |
251 +------------+--------------------------------------------------------+
Frédéric Lécaillecce34f82019-03-26 16:17:33 +0100252 | 4 | Heartbeat message. |
253 +------------+--------------------------------------------------------+
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100254
Ilya Shipitsin5fa29b82022-12-07 09:46:19 +0500255 About heartbeat messages: a peer sends heartbeat messages to peers it is
Frédéric Lécaillecce34f82019-03-26 16:17:33 +0100256 connected to after periods of 3s of inactivity (i.e. when there is no
257 stick-table to synchronize for 3s). After a successful peer protocol
258 handshake between two peers, if one of them does not send any other peer
259 protocol messages (i.e. no heartbeat and no stick-table update messages)
260 during a 5s period, it is considered as no more alive by its remote peer
261 which closes the session and then tries to reconnect to the peer which
262 has just disappeared.
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100263
264 *** Error message class ***
265
266 There exits two types of such error messages:
267
268 +-----------+------------------+
269 | type byte | signification |
270 +-----------+------------------+
271 | 0 | protocol error |
272 +-----------+------------------+
273 | 1 | size limit error |
274 +-----------+------------------+
275
276
277 *** Stick-table update message class ***
278
279 This class is the more important one because it is in relation with the
280 stick-table entries handling between peers which is at the core of peers
281 protocol.
282
283 All the messages of this class are variable-length. Their type bytes are
284 all greater than or equal to 128.
285
286 There exits five types of such stick-table update messages:
287
288 +-----------+--------------------------------+
289 | type byte | signification |
290 +-----------+--------------------------------+
291 | 128 | Entry update |
292 +-----------+--------------------------------+
293 | 129 | Incremental entry update |
294 +-----------+--------------------------------+
295 | 130 | Stick-table definition |
296 +-----------+--------------------------------+
297 | 131 | Stick-table switch (unused) |
298 +-----------+--------------------------------+
299 | 133 | Update message acknowledgement |
300 +-----------+--------------------------------+
301
302 Note that entry update messages may be multiplexed. This means that different
303 entry update messages for different stick-tables may be sent over the same
304 peer session.
305
Joseph Herlant71b4b152018-11-13 16:55:16 -0800306 To do so, each time entry update messages have to sent, they must be preceded
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100307 by a stick-table definition message. This remains true for incremental entry
308 update messages.
309
310 As its name indicate, "Update message acknowledgement" messages are used to
311 acknowledge the entry update messages.
312
313 In this following paragraph, we give some information about the format of
314 each stick-table update messages. This very simple following legend will
315 contribute in understanding it. The unit used is the octet.
316
317 XX
318 +-----------+
319 | foo | Unique fixed sized "foo" field, made of XX octets.
320 +-----------+
321
322 +===========+
323 | foo | Variable-length "foo" field.
324 +===========+
325
326 +xxxxxxxxxxx+
327 | foo | Encoded variable-length "foo" field.
328 +xxxxxxxxxxx+
329
330 +###########+
331 | foo | hereunder described "foo" field.
332 +###########+
333
334
335 With this legend, all the stick-table update messages have such a header:
336
337 1 1
338 +--------------------+------------------------+xxxxxxxxxxxxxxxx+
339 | Message Class (10) | Message type (128-133) | Message length |
340 +--------------------+------------------------+xxxxxxxxxxxxxxxx+
341
342 Note that to help in making communicate different versions of peers protocol,
343 such stick-table update messages may be extended adding non mandatory
344 fields at the end of such messages, announcing a total message length
345 which is greater than the message length of the previous versions of
346 peers protocol. After having parsed such messages, the remaining ones
347 will be skipped to parse the next message.
348
349 - Definition message format:
350
351 Before sending entry update messages, a peer must announce the configuration
352 of the stick-table in relation with these messages thanks to a
353 "Stick-table definition" message with such a following format:
354
355 +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
356 | Stick-table ID | Stick-table name length | Stick-table name |
357 +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
358
359 +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
360 | Key type | Key length | Data types bitfield | Expiry |
361 +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
362
363 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
364 | Frequency counter #1 | Frequency counter #1 period |
365 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
366
367 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
368 | Frequency counter #2 | Frequency counter #2 period |
369 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
370 .
371 .
372 .
373
374 Note that "Stick-table ID" field is an encoded integer which is used to
375 identify the stick-table without using its name (or "Stick-table name"
376 field). It is local to the process handling the stick-table. So we can have
377 two peers attached to processes which generate stick-table updates for
378 the same stick-table (same name) but with different stick-table IDs.
379
380 Also note that the list of "Frequency counter #X" and their associated
381 periods fields exists only if their underlying types are already defined
382 in "Data types bitfield" field.
383
384 "Expiry" field and the remaining ones are not used by all the existing
385 version of haproxy peers. But they are MANDATORY, so that to make a
386 stick-table aggregator peer be able to autoconfigure itself.
387
388
389 - Entry update message format:
390 4
391 +-----------------+###########+############+
392 | Local update ID | Key | Data |
393 +-----------------+###########+############+
394
395 with "Key" described as follows:
396
397 +xxxxxxxxxxx+=======+
398 | length | value | if key type is (non null terminated) "string",
399 +xxxxxxxxxxx+=======+
400
401 4
402 +-------+
403 | value | if key type is "integer",
404 +-------+
405
406 +=======+
407 | value | for other key types: the size is announced in
408 +=======+ the previous stick-table definition message.
409
410 "Data" field is basically a list of encoded values for each type announced
411 by the "Data types bitfield" field of the previous "Stick-table definition"
412 message:
413
414 +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
415 | Data type #1 value | Data type #2 value | .... | Data type #n value |
416 +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
417
418
Frédéric Lécaillefdfa9e32019-06-06 15:53:20 +0200419 Most of these fields are internally stored as uint32_t (see STD_T_SINT,
420 STD_T_UINT, STD_T_ULL C enumerations) or structures made of several uint32_t
421 (see STD_T_FRQP C enumeration). The remaining one STD_T_DICT is internally
422 used to store entries of LRU caches for others literal dictionary entries
423 (couples of IDs associated to strings). It is used to transmit these cache
424 entries as follows:
425
426 +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+
427 | length | ID | string length | string |
428 +xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+
429
430 "length" is the length in bytes of the remaining data after this "length" field.
431 "string length" is the length of "string" field which follows.
432
433 Here the cache is used so that not to have to send again and again an already
434 sent string. Indeed, the second time we have to send the same dictionary entry,
435 if still cached, a peer sends only its ID:
436
437 +xxxxxxxxxxx+xxxx+
438 | length | ID |
439 +xxxxxxxxxxx+xxxx+
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +0100440
441 - Update message acknowledgement format:
442
443 These messages are responses to "Entry update" messages only.
444
445 Its format is very basic for efficiency reasons:
446
447 4
448 +xxxxxxxxxxxxxxxx+-----------+
449 | Stick-table ID | Update ID |
450 +xxxxxxxxxxxxxxxx+-----------+
451
452
453 Note that the "Stick-table ID" field value is in relation with the one which
454 has been previously announce by a "Stick-table definition" message.
455
456 The following schema may help in understanding how to handle a stream of
457 stick-table update messages. The handshake step is not represented.
458 Stick-table IDs are preceded by a '#' character.
459
460
461 Peer A Peer B
462
463 stkt def. #1
464 ---------------------->
465 updates (1-5)
466 ---------------------->
467 stkt def. #3
468 ---------------------->
469 updates (1000-1005)
470 ---------------------->
471
472 stkt def. #2
473 <----------------------
474 updates (10-15)
475 <----------------------
476 ack 5 for #1
477 <----------------------
478 ack 1005 for #3
479 <----------------------
480 stkt def. #4
481 <----------------------
482 updates (100-105)
483 <----------------------
484
485 ack 10 for #2
486 ---------------------->
487 ack 105 for #4
488 ---------------------->
489 (from here, on both sides, all stick-table updates
490 are considered as received)
491