+--------------------+ | |
| Peers protocol 2.1 | | |
+--------------------+ | |
Peers protocol has been implemented over TCP. Its aim is to transmit | |
stick-table entries information between several haproxy processes. | |
This protocol is symmetrical. This means that at any time, each peer | |
may connect to other peers they have been configured for, so that to send | |
their last stick-table updates. There is no role of client or server in this | |
protocol. As peers may connect to each others at the same time, the protocol | |
ensures that only one peer session may stay opened between a couple of peers | |
before they start sending their stick-table information, possibly in both | |
directions (or not). | |
Handshake | |
+++++++++ | |
Just after having connected to another one, a peer must identified itself | |
and identify the remote peer, sending a "hello" message. The remote peer | |
replies with a "status" message. | |
A "hello" message is made of three lines terminated by a line feed character | |
as follows: | |
<protocol identifier> <version>\n | |
<remote peer identifier>\n | |
<local peer identifier> <process ID> <relative process ID>\n | |
protocol identifier : HAProxyS | |
version : 2.1 | |
remote peer identifier: the peer name this "hello" message is sent to. | |
local peer identifier : the name of the peer which sends this "hello" message. | |
process ID : the ID of the process handling this peer session. | |
relative process ID : the haproxy's relative process ID (0 if nbproc == 1). | |
The "status" message is made of a unique line terminated by a line feed | |
character as follows: | |
<status code>\n | |
with these values as status code (a three-digit number): | |
+-------------+---------------------------------+ | |
| status code | signification | | |
+-------------+---------------------------------+ | |
| 200 | Handshake succeeded | | |
+-------------+---------------------------------+ | |
| 300 | Try again later | | |
+-------------+---------------------------------+ | |
| 501 | Protocol error | | |
+-------------+---------------------------------+ | |
| 502 | Bad version | | |
+-------------+---------------------------------+ | |
| 503 | Local peer identifier mismatch | | |
+-------------+---------------------------------+ | |
| 504 | Remote peer identifier mismatch | | |
+-------------+---------------------------------+ | |
As the protocol is symmetrical, some peers may connect to each others at the | |
same time. For efficiency reasons, the protocol ensures there may be only | |
one TCP session opened after the handshake succeeded and before transmitting | |
any stick-table data information. In fact for each couple of peer, this is | |
the last connected peer which wins. Each time a peer A receives a "hello" | |
message from a peer B, peer A checks if it already managed to open a peer | |
session with peer B, so with a successful handshake. If it is the case, | |
peer A closes its peer session. So, this is the peer session opened by B | |
which stays opened. | |
Peer A Peer B | |
hello | |
----------------------> | |
status 200 | |
<---------------------- | |
hello | |
<++++++++++++++++++++++ | |
TCP/FIN-ACK | |
----------------------> | |
TCP/FIN-ACK | |
<---------------------- | |
status 200 | |
++++++++++++++++++++++> | |
data | |
<++++++++++++++++++++++ | |
data | |
++++++++++++++++++++++> | |
data | |
++++++++++++++++++++++> | |
data | |
<++++++++++++++++++++++ | |
. | |
. | |
. | |
As it is still possible that a couple of peers decide to close both their | |
peer sessions at the same time, the protocol ensures peers will not reconnect | |
at the same time, adding a random delay (50 up to 2050 ms) before any | |
reconnection. | |
Encoding | |
++++++++ | |
As some TCP data may be corrupted, for integrity reason, some data fields | |
are encoded at peer session level. | |
The following algorithms explain how to encode/decode the data. | |
encode: | |
input : val (64bits integer) | |
output: bitf (variable-length bitfield) | |
if val has no bit set above bit 4 (or if val is less than 0xf0) | |
set the next byte of bitf to the value of val | |
return bitf | |
set the next byte of bitf to the value of val OR'ed with 0xf0 | |
subtract 0xf0 from val | |
right shift val by 4 | |
while val bit 7 is set (or if val is greater or equal to 0x80): | |
set the next byte of bitf to the value of the byte made of the last | |
7 bits of val OR'ed with 0x80 | |
subtract 0x80 from val | |
right shift val by 7 | |
set the next byte of bitf to the value of val | |
return bitf | |
decode: | |
input : bitf (variable-length bitfield) | |
output: val (64bits integer) | |
set val to the value of the first byte of bitf | |
if bit 4 up to 7 of val are not set | |
return val | |
set loop to 0 | |
do | |
add to val the value of the next byte of bitf left shifted by (4 + 7*loop) | |
set loop to (loop + 1) | |
while the bit 7 of the next byte of bitf is set | |
return val | |
Example: | |
let's say that we must encode 0x1234. | |
"set the next byte of bitf to the value of val OR'ed with 0xf0" | |
=> bitf[0] = (0x1234 | 0xf0) & 0xff = 0xf4 | |
"subtract 0xf0 from val" | |
=> val = 0x1144 | |
right shift val by 4 | |
=> val = 0x114 | |
"set the next byte of bitf to the value of the byte made of the last | |
7 bits of val OR'ed with 0x80" | |
=> bitf[1] = (0x114 | 0x80) & 0xff = 0x94 | |
"subtract 0x80 from val" | |
=> val= 0x94 | |
"right shift val by 7" | |
=> val = 0x1 | |
=> bitf[2] = 0x1 | |
So, the encoded value of 0x1234 is 0xf49401. | |
To decode this value: | |
"set val to the value of the first byte of bitf" | |
=> val = 0xf4 | |
"add to val the value of the next byte of bitf left shifted by 4" | |
=> val = 0xf4 + (0x94 << 4) = 0xf4 + 0x940 = 0xa34 | |
"add to val the value of the next byte of bitf left shifted by (4 + 7)" | |
=> val = 0xa34 + (0x01 << 11) = 0xa34 + 0x800 = 0x1234 | |
Messages | |
++++++++ | |
*** General *** | |
After the handshake has successfully completed, peers are authorized to send | |
some messages to each others, possibly in both direction. | |
All the messages are made at least of a two bytes length header. | |
The first byte of this header identifies the class of the message. The next | |
byte identifies the type of message in the class. | |
Some of these messages are variable-length. Others have a fixed size. | |
Variable-length messages are identified by the value of the message type | |
byte. For such messages, it is greater than or equal to 128. | |
All variable-length message headers must be followed by the encoded length | |
of the remaining bytes (so the encoded length of the message minus 2 bytes | |
for the header and minus the length of the encoded length). | |
There exist four classes of messages: | |
+------------+---------------------+--------------+ | |
| class byte | signification | message size | | |
+------------+---------------------+--------------+ | |
| 0 | control | fixed (2) | | |
+------------+---------------------+--------------| | |
| 1 | error | fixed (2) | | |
+------------+---------------------+--------------| | |
| 10 | stick-table updates | variable | | |
+------------+---------------------+--------------| | |
| 255 | reserved | | | |
+------------+---------------------+--------------+ | |
At this time of this writing, only control and error messages have a fixed | |
size of two bytes (header only). The stick-table updates messages are all | |
variable-length (their message type bytes are greater than 128). | |
*** Control message class *** | |
At this time of writing, control messages are fixed-length messages used | |
only to control the synchronizations between local and/or remote processes | |
and to emit heartbeat messages. | |
There exists five types of such control messages: | |
+------------+--------------------------------------------------------+ | |
| type byte | signification | | |
+------------+--------------------------------------------------------+ | |
| 0 | synchronisation request: ask a remote peer for a full | | |
| | synchronization | | |
+------------+--------------------------------------------------------+ | |
| 1 | synchronization finished: signal a remote peer that | | |
| | local updates have been pushed and local is considered | | |
| | up to date. | | |
+------------+--------------------------------------------------------+ | |
| 2 | synchronization partial: signal a remote peer that | | |
| | local updates have been pushed and local is not | | |
| | considered up to date. | | |
+------------+--------------------------------------------------------+ | |
| 3 | synchronization confirmed: acknowledge a finished or | | |
| | partial synchronization message. | | |
+------------+--------------------------------------------------------+ | |
| 4 | Heartbeat message. | | |
+------------+--------------------------------------------------------+ | |
About hearbeat messages: a peer sends heartbeat messages to peers it is | |
connected to after periods of 3s of inactivity (i.e. when there is no | |
stick-table to synchronize for 3s). After a successful peer protocol | |
handshake between two peers, if one of them does not send any other peer | |
protocol messages (i.e. no heartbeat and no stick-table update messages) | |
during a 5s period, it is considered as no more alive by its remote peer | |
which closes the session and then tries to reconnect to the peer which | |
has just disappeared. | |
*** Error message class *** | |
There exits two types of such error messages: | |
+-----------+------------------+ | |
| type byte | signification | | |
+-----------+------------------+ | |
| 0 | protocol error | | |
+-----------+------------------+ | |
| 1 | size limit error | | |
+-----------+------------------+ | |
*** Stick-table update message class *** | |
This class is the more important one because it is in relation with the | |
stick-table entries handling between peers which is at the core of peers | |
protocol. | |
All the messages of this class are variable-length. Their type bytes are | |
all greater than or equal to 128. | |
There exits five types of such stick-table update messages: | |
+-----------+--------------------------------+ | |
| type byte | signification | | |
+-----------+--------------------------------+ | |
| 128 | Entry update | | |
+-----------+--------------------------------+ | |
| 129 | Incremental entry update | | |
+-----------+--------------------------------+ | |
| 130 | Stick-table definition | | |
+-----------+--------------------------------+ | |
| 131 | Stick-table switch (unused) | | |
+-----------+--------------------------------+ | |
| 133 | Update message acknowledgement | | |
+-----------+--------------------------------+ | |
Note that entry update messages may be multiplexed. This means that different | |
entry update messages for different stick-tables may be sent over the same | |
peer session. | |
To do so, each time entry update messages have to sent, they must be preceded | |
by a stick-table definition message. This remains true for incremental entry | |
update messages. | |
As its name indicate, "Update message acknowledgement" messages are used to | |
acknowledge the entry update messages. | |
In this following paragraph, we give some information about the format of | |
each stick-table update messages. This very simple following legend will | |
contribute in understanding it. The unit used is the octet. | |
XX | |
+-----------+ | |
| foo | Unique fixed sized "foo" field, made of XX octets. | |
+-----------+ | |
+===========+ | |
| foo | Variable-length "foo" field. | |
+===========+ | |
+xxxxxxxxxxx+ | |
| foo | Encoded variable-length "foo" field. | |
+xxxxxxxxxxx+ | |
+###########+ | |
| foo | hereunder described "foo" field. | |
+###########+ | |
With this legend, all the stick-table update messages have such a header: | |
1 1 | |
+--------------------+------------------------+xxxxxxxxxxxxxxxx+ | |
| Message Class (10) | Message type (128-133) | Message length | | |
+--------------------+------------------------+xxxxxxxxxxxxxxxx+ | |
Note that to help in making communicate different versions of peers protocol, | |
such stick-table update messages may be extended adding non mandatory | |
fields at the end of such messages, announcing a total message length | |
which is greater than the message length of the previous versions of | |
peers protocol. After having parsed such messages, the remaining ones | |
will be skipped to parse the next message. | |
- Definition message format: | |
Before sending entry update messages, a peer must announce the configuration | |
of the stick-table in relation with these messages thanks to a | |
"Stick-table definition" message with such a following format: | |
+xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ | |
| Stick-table ID | Stick-table name length | Stick-table name | | |
+xxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxx+==================+ | |
+xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ | |
| Key type | Key length | Data types bitfield | Expiry | | |
+xxxxxxxxxxxx+xxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxx+ | |
+xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ | |
| Frequency counter #1 | Frequency counter #1 period | | |
+xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ | |
+xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ | |
| Frequency counter #2 | Frequency counter #2 period | | |
+xxxxxxxxxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx+ | |
. | |
. | |
. | |
Note that "Stick-table ID" field is an encoded integer which is used to | |
identify the stick-table without using its name (or "Stick-table name" | |
field). It is local to the process handling the stick-table. So we can have | |
two peers attached to processes which generate stick-table updates for | |
the same stick-table (same name) but with different stick-table IDs. | |
Also note that the list of "Frequency counter #X" and their associated | |
periods fields exists only if their underlying types are already defined | |
in "Data types bitfield" field. | |
"Expiry" field and the remaining ones are not used by all the existing | |
version of haproxy peers. But they are MANDATORY, so that to make a | |
stick-table aggregator peer be able to autoconfigure itself. | |
- Entry update message format: | |
4 | |
+-----------------+###########+############+ | |
| Local update ID | Key | Data | | |
+-----------------+###########+############+ | |
with "Key" described as follows: | |
+xxxxxxxxxxx+=======+ | |
| length | value | if key type is (non null terminated) "string", | |
+xxxxxxxxxxx+=======+ | |
4 | |
+-------+ | |
| value | if key type is "integer", | |
+-------+ | |
+=======+ | |
| value | for other key types: the size is announced in | |
+=======+ the previous stick-table definition message. | |
"Data" field is basically a list of encoded values for each type announced | |
by the "Data types bitfield" field of the previous "Stick-table definition" | |
message: | |
+xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ | |
| Data type #1 value | Data type #2 value | .... | Data type #n value | | |
+xxxxxxxxxxxxxxxxxxxx+xxxxxxxxxxxxxxxxxxxx+ +xxxxxxxxxxxxxxxxxxxx+ | |
Most of these fields are internally stored as uint32_t (see STD_T_SINT, | |
STD_T_UINT, STD_T_ULL C enumerations) or structures made of several uint32_t | |
(see STD_T_FRQP C enumeration). The remaining one STD_T_DICT is internally | |
used to store entries of LRU caches for others literal dictionary entries | |
(couples of IDs associated to strings). It is used to transmit these cache | |
entries as follows: | |
+xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+ | |
| length | ID | string length | string | | |
+xxxxxxxxxxx+xxxx+xxxxxxxxxxxxxxx+========+ | |
"length" is the length in bytes of the remaining data after this "length" field. | |
"string length" is the length of "string" field which follows. | |
Here the cache is used so that not to have to send again and again an already | |
sent string. Indeed, the second time we have to send the same dictionary entry, | |
if still cached, a peer sends only its ID: | |
+xxxxxxxxxxx+xxxx+ | |
| length | ID | | |
+xxxxxxxxxxx+xxxx+ | |
- Update message acknowledgement format: | |
These messages are responses to "Entry update" messages only. | |
Its format is very basic for efficiency reasons: | |
4 | |
+xxxxxxxxxxxxxxxx+-----------+ | |
| Stick-table ID | Update ID | | |
+xxxxxxxxxxxxxxxx+-----------+ | |
Note that the "Stick-table ID" field value is in relation with the one which | |
has been previously announce by a "Stick-table definition" message. | |
The following schema may help in understanding how to handle a stream of | |
stick-table update messages. The handshake step is not represented. | |
Stick-table IDs are preceded by a '#' character. | |
Peer A Peer B | |
stkt def. #1 | |
----------------------> | |
updates (1-5) | |
----------------------> | |
stkt def. #3 | |
----------------------> | |
updates (1000-1005) | |
----------------------> | |
stkt def. #2 | |
<---------------------- | |
updates (10-15) | |
<---------------------- | |
ack 5 for #1 | |
<---------------------- | |
ack 1005 for #3 | |
<---------------------- | |
stkt def. #4 | |
<---------------------- | |
updates (100-105) | |
<---------------------- | |
ack 10 for #2 | |
----------------------> | |
ack 105 for #4 | |
----------------------> | |
(from here, on both sides, all stick-table updates | |
are considered as received) | |