blob: 8310a0016a5fa44ef5d0d42f4dbca65f9fd15ce5 [file] [log] [blame]
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +01001 +----------------+
2 | Peers protocol |
Willy Tarreaucfe14662017-11-19 09:55:29 +01003 | version 1.8
Frédéric Lécaille4b6645d2017-11-15 14:41:00 +01004 +----------------+
5
6
7 Peers protocol has been implemented over TCP. Its aim is to transmit
8 stick-table entries information between several haproxy processes.
9
10 This protocol is symmetrical. This means that at any time, each peer
11 may connect to other peers they have been configured for, so that to send
12 their last stick-table updates. There is no role of client or server in this
13 protocol. As peers may connect to each others at the same time, the protocol
14 ensures that only one peer session may stay opened between a couple of peers
15 before they start sending their stick-table information, possibly in both
16 directions (or not).
17
18
19 Handshake
20 +++++++++
21
22 Just after having connected to another one, a peer must identified itself
23 and identify the remote peer, sending a "hello" message. The remote peer
24 replies with a "status" message.
25
26 A "hello" message is made of three lines terminated by a line feed character
27 as follows:
28
29 <protocol identifier> <version>\n
30 <remote peer identifier>\n
31 <local peer identifier> <process ID> <relative process ID>\n
32
33 protocol identifier : HAProxyS
34 version : 2.1
35 remote peer identifier: the peer name this "hello" message is sent to.
36 local peer identifier : the name of the peer which sends this "hello" message.
37 process ID : the ID of the process handling this peer session.
38 relative process ID : the haproxy's relative process ID (0 if nbproc == 1).
39
40 The "status" message is made of a unique line terminated by a line feed
41 character as follows:
42
43 <status code>\n
44
45 with these values as status code (a three-digit number):
46
47 +-------------+---------------------------------+
48 | status code | signification |
49 +-------------+---------------------------------+
50 | 200 | Handshake succeeded |
51 +-------------+---------------------------------+
52 | 300 | Try again later |
53 +-------------+---------------------------------+
54 | 501 | Protocol error |
55 +-------------+---------------------------------+
56 | 502 | Bad version |
57 +-------------+---------------------------------+
58 | 503 | Local peer identifier mismatch |
59 +-------------+---------------------------------+
60 | 504 | Remote peer identifier mismatch |
61 +-------------+---------------------------------+
62
63 As the protocol is symmetrical, some peers may connect to each others at the
64 same time. For efficiency reasons, the protocol ensures there may be only
65 one TCP session opened after the handshake succeeded and before transmitting
66 any stick-table data information. In fact for each couple of peer, this is
67 the last connected peer which wins. Each time a peer A receives a "hello"
68 message from a peer B, peer A checks if it already managed to open a peer
69 session with peer B, so with a successful handshake. If it is the case,
70 peer A closes its peer session. So, this is the peer session opened by B
71 which stays opened.
72
73
74 Peer A Peer B
75 hello
76 ---------------------->
77 status 200
78 <----------------------
79 hello
80 <++++++++++++++++++++++
81 TCP/FIN-ACK
82 ---------------------->
83 TCP/FIN-ACK
84 <----------------------
85 status 200
86 ++++++++++++++++++++++>
87 data
88 <++++++++++++++++++++++
89 data
90 ++++++++++++++++++++++>
91 data
92 ++++++++++++++++++++++>
93 data
94 <++++++++++++++++++++++
95 .
96 .
97 .
98
99 As it is still possible that a couple of peers decide to close both their
100 peer sessions at the same time, the protocol ensures peers will not reconnect
101 at the same time, adding a random delay (50 up to 2050 ms) before any
102 reconnection.
103
104
105 Encoding
106 ++++++++
107
108 As some TCP data may be corrupted, for integrity reason, some data fields
109 are encoded at peer session level.
110
111 The following algorithms explain how to encode/decode the data.
112
113 encode:
114 input : val (64bits integer)
115 output: bitf (variable-length bitfield)
116
117 if val has no bit set above bit 4 (or if val is less than 0xf0)
118 set the next byte of bitf to the value of val
119 return bitf
120
121 set the next byte of bitf to the value of val OR'ed with 0xf0
122 substract 0xf0 from val
123 right shift val by 4
124
125 while val bit 7 is set (or if val is greater or equal to 0x80):
126 set the next byte of bitf to the value of the byte made of the last
127 7 bits of val OR'ed with 0x80
128 substract 0x80 from val
129 right shift val by 7
130
131 set the next byte of bitf to the value of val
132 return bitf
133
134 decode:
135 input : bitf (variable-length bitfield)
136 output: val (64bits integer)
137
138 set val to the value of the first byte of bitf
139 if bit 4 up to 7 of val are not set
140 return val
141
142 set loop to 0
143 do
144 add to val the value of the next byte of bitf left shifted by (4 + 7*loop)
145 set loop to (loop + 1)
146 while the bit 7 of the next byte of bitf is set
147 return val
148
149 Example:
150
151 let's say that we must encode 0x1234.
152
153 "set the next byte of bitf to the value of val OR'ed with 0xf0"
154 => bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4
155
156 "substract 0xf0 from val"
157 => val = 0x1144
158
159 right shift val by 4
160 => val = 0x114
161
162 "set the next byte of bitf to the value of the byte made of the last
163 7 bits of val OR'ed with 0x80"
164 => bitf[1] = (0x114 | 0x80) & 0xff = 0x94
165
166 "substract 0x80 from val"
167 => val= 0x94
168
169 "right shift val by 7"
170 => val = 0x1
171
172 => bitf[2] = 0x1
173
174 So, the encoded value of 0x1234 is 0xf49401.
175
176 To decode this value:
177
178 "set val to the value of the first byte of bitf"
179 => val = 0xf4
180
181 "add to val the value of the next byte of bitf left shifted by 4"
182 => val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34
183
184 "add to val the value of the next byte of bitf left shifted by (4 + 7)"
185 => val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234
186
187
188 Messages
189 ++++++++
190
191 *** General ***
192
193 After the handshake has successfully completed, peers are authorized to send
194 some messages to each others, possibly in both direction.
195
196 All the messages are made at least of a two bytes length header.
197
198 The first byte of this header identifies the class of the message. The next
199 byte identifies the type of message in the class.
200
201 Some of these messages are variable-length. Others have a fixed size.
202 Variable-length messages are identified by the value of the message type
203 byte. For such messages, it is greater than or equal to 128.
204
205 All variable-length message headers must be followed by the encoded length
206 of the remaining bytes (so the encoded length of the message minus 2 bytes
207 for the header and minus the length of the encoded length).
208
209 There exist four classes of messages:
210
211 +------------+---------------------+--------------+
212 | class byte | signification | message size |
213 +------------+---------------------+--------------+
214 | 0 | control | fixed (2) |
215 +------------+---------------------+--------------|
216 | 1 | error | fixed (2) |
217 +------------+---------------------+--------------|
218 | 10 | stick-table updates | variable |
219 +------------+---------------------+--------------|
220 | 255 | reserved | |
221 +------------+---------------------+--------------+
222
223 At this time of this writing, only control and error messages have a fixed
224 size of two bytes (header only). The stick-table updates messages are all
225 variable-length (their message type bytes are greater than 128).
226
227
228 *** Control message class ***
229
230 At this time of writing, control messages are fixed-length messages used
231 only to control the synchonrizations between local and/or remote processes.
232
233 There exist four types of such control messages:
234
235 +------------+--------------------------------------------------------+
236 | type byte | signification |
237 +------------+--------------------------------------------------------+
238 | 0 | synchronisation request: ask a remote peer for a full |
239 | | synchronization |
240 +------------+--------------------------------------------------------+
241 | 1 | synchronization finished: signal a remote peer that |
242 | | local updates have been pushed and local is considered |
243 | | up to date. |
244 +------------+--------------------------------------------------------+
245 | 2 | synchronization partial: signal a remote peer that |
246 | | local updates have been pushed and local is not |
247 | | considered up to date. |
248 +------------+--------------------------------------------------------+
249 | 3 | synchronization confirmed: acknowledge a finished or |
250 | | partial synchronization message. |
251 +------------+--------------------------------------------------------+
252
253
254 *** Error message class ***
255
256 There exits two types of such error messages:
257
258 +-----------+------------------+
259 | type byte | signification |
260 +-----------+------------------+
261 | 0 | protocol error |
262 +-----------+------------------+
263 | 1 | size limit error |
264 +-----------+------------------+
265
266
267 *** Stick-table update message class ***
268
269 This class is the more important one because it is in relation with the
270 stick-table entries handling between peers which is at the core of peers
271 protocol.
272
273 All the messages of this class are variable-length. Their type bytes are
274 all greater than or equal to 128.
275
276 There exits five types of such stick-table update messages:
277
278 +-----------+--------------------------------+
279 | type byte | signification |
280 +-----------+--------------------------------+
281 | 128 | Entry update |
282 +-----------+--------------------------------+
283 | 129 | Incremental entry update |
284 +-----------+--------------------------------+
285 | 130 | Stick-table definition |
286 +-----------+--------------------------------+
287 | 131 | Stick-table switch (unused) |
288 +-----------+--------------------------------+
289 | 133 | Update message acknowledgement |
290 +-----------+--------------------------------+
291
292 Note that entry update messages may be multiplexed. This means that different
293 entry update messages for different stick-tables may be sent over the same
294 peer session.
295
296 To do so, each time entry update messages have to sent, they must be preceeded
297 by a stick-table definition message. This remains true for incremental entry
298 update messages.
299
300 As its name indicate, "Update message acknowledgement" messages are used to
301 acknowledge the entry update messages.
302
303 In this following paragraph, we give some information about the format of
304 each stick-table update messages. This very simple following legend will
305 contribute in understanding it. The unit used is the octet.
306
307 XX
308 +-----------+
309 | foo | Unique fixed sized "foo" field, made of XX octets.
310 +-----------+
311
312 +===========+
313 | foo | Variable-length "foo" field.
314 +===========+
315
316 +xxxxxxxxxxx+
317 | foo | Encoded variable-length "foo" field.
318 +xxxxxxxxxxx+
319
320 +###########+
321 | foo | hereunder described "foo" field.
322 +###########+
323
324
325 With this legend, all the stick-table update messages have such a header:
326
327 1 1
328 +--------------------+------------------------+xxxxxxxxxxxxxxxx+
329 | Message Class (10) | Message type (128-133) | Message length |
330 +--------------------+------------------------+xxxxxxxxxxxxxxxx+
331
332 Note that to help in making communicate different versions of peers protocol,
333 such stick-table update messages may be extended adding non mandatory
334 fields at the end of such messages, announcing a total message length
335 which is greater than the message length of the previous versions of
336 peers protocol. After having parsed such messages, the remaining ones
337 will be skipped to parse the next message.
338
339 - Definition message format:
340
341 Before sending entry update messages, a peer must announce the configuration
342 of the stick-table in relation with these messages thanks to a
343 "Stick-table definition" message with such a following format:
344
345 +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
346 | Stick-table ID | Stick-table name length | Stick-table name |
347 +xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+
348
349 +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
350 | Key type | Key length | Data types bitfield | Expiry |
351 +xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+
352
353 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
354 | Frequency counter #1 | Frequency counter #1 period |
355 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
356
357 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
358 | Frequency counter #2 | Frequency counter #2 period |
359 +xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+
360 .
361 .
362 .
363
364 Note that "Stick-table ID" field is an encoded integer which is used to
365 identify the stick-table without using its name (or "Stick-table name"
366 field). It is local to the process handling the stick-table. So we can have
367 two peers attached to processes which generate stick-table updates for
368 the same stick-table (same name) but with different stick-table IDs.
369
370 Also note that the list of "Frequency counter #X" and their associated
371 periods fields exists only if their underlying types are already defined
372 in "Data types bitfield" field.
373
374 "Expiry" field and the remaining ones are not used by all the existing
375 version of haproxy peers. But they are MANDATORY, so that to make a
376 stick-table aggregator peer be able to autoconfigure itself.
377
378
379 - Entry update message format:
380 4
381 +-----------------+###########+############+
382 | Local update ID | Key | Data |
383 +-----------------+###########+############+
384
385 with "Key" described as follows:
386
387 +xxxxxxxxxxx+=======+
388 | length | value | if key type is (non null terminated) "string",
389 +xxxxxxxxxxx+=======+
390
391 4
392 +-------+
393 | value | if key type is "integer",
394 +-------+
395
396 +=======+
397 | value | for other key types: the size is announced in
398 +=======+ the previous stick-table definition message.
399
400 "Data" field is basically a list of encoded values for each type announced
401 by the "Data types bitfield" field of the previous "Stick-table definition"
402 message:
403
404 +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
405 | Data type #1 value | Data type #2 value | .... | Data type #n value |
406 +xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+
407
408
409
410 - Update message acknowledgement format:
411
412 These messages are responses to "Entry update" messages only.
413
414 Its format is very basic for efficiency reasons:
415
416 4
417 +xxxxxxxxxxxxxxxx+-----------+
418 | Stick-table ID | Update ID |
419 +xxxxxxxxxxxxxxxx+-----------+
420
421
422 Note that the "Stick-table ID" field value is in relation with the one which
423 has been previously announce by a "Stick-table definition" message.
424
425 The following schema may help in understanding how to handle a stream of
426 stick-table update messages. The handshake step is not represented.
427 Stick-table IDs are preceded by a '#' character.
428
429
430 Peer A Peer B
431
432 stkt def. #1
433 ---------------------->
434 updates (1-5)
435 ---------------------->
436 stkt def. #3
437 ---------------------->
438 updates (1000-1005)
439 ---------------------->
440
441 stkt def. #2
442 <----------------------
443 updates (10-15)
444 <----------------------
445 ack 5 for #1
446 <----------------------
447 ack 1005 for #3
448 <----------------------
449 stkt def. #4
450 <----------------------
451 updates (100-105)
452 <----------------------
453
454 ack 10 for #2
455 ---------------------->
456 ack 105 for #4
457 ---------------------->
458 (from here, on both sides, all stick-table updates
459 are considered as received)
460