blob: 8b3d040b892e40782f44426a6c612e99df95d0e8 [file] [log] [blame]
Willy Tarreau9382cdd2018-02-21 18:07:26 +010012018-02-21 - Layering in haproxy 1.9
2------------------------------------
3
42 main zones :
5 - application : reads from conn_streams, writes to conn_streams, often uses
6 streams
7
8 - connection : receives data from the network, presented into buffers
9 available via conn_streams, sends data to the network
10
11
12The connection zone contains multiple layers which behave independantly in each
13direction. The Rx direction is activated upon callbacks from the lower layers.
14The Tx direction is activated recursively from the upper layers. Between every
15two layers there may be a buffer, in each direction. When a buffer is full
16either in Tx or Rx direction, this direction is paused from the network layer
17and the location where the congestion is encountered. Upon end of congestion
18(cs_recv() from the upper layer, of sendto() at the lower layers), a
19tasklet_wakeup() is performed on the blocked layer so that suspended operations
20can be resumed. In this case, the Rx side restarts propagating data upwards
21from the lowest blocked level, while the Tx side restarts propagating data
22downwards from the highest blocked level. Proceeding like this ensures that
23information known to the producer may always be used to tailor the buffer sizes
24or decide of a strategy to best aggregate data. Additionally, each time a layer
25is crossed without transformation, it becomes possible to send without copying.
26
27The Rx side notifies the application of data readiness using a wakeup or a
28callback. The Tx side notifies the application of room availability once data
29have been moved resulting in the uppermost buffer having some free space.
30
31When crossing a mux downwards, it is possible that the sender is not allowed to
32access the buffer because it is not yet its turn. It is not a problem, the data
33remains in the conn_stream's buffer (or the stream one) and will be restarted
34once the mux is ready to consume these data.
35
36
37 cs_recv() -------. cs_send()
38 ^ +--------> |||||| -------------+ ^
39 | | -------' | | stream
40 --|----------|-------------------------------|-------|-------------------
41 | | V | connection
42 data .---. | | room
43 ready! |---| |---| available!
44 |---| |---|
45 |---| |---|
46 | | '---'
47 ^ +------------+-------+ |
48 | | ^ | /
49 / V | V /
50 / recvfrom() | sendto() |
51 -------------|----------------|--------------|---------------------------
52 | | poll! V kernel
53
54
55The cs_recv() function should act on pointers to buffer pointers, so that the
56callee may decide to pass its own buffer directly by simply swapping pointers.
57Similarly for cs_send() it is desirable to let the callee steal the buffer by
58swapping the pointers. This way it remains possible to implement zero-copy
59forwarding.
60
61Some operation flags will be needed on cs_recv() :
62 - RECV_ZERO_COPY : refuse to merge new data into the current buffer if it
63 will result in a data copy (ie the buffer is not empty), unless no more
64 than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper
65 than waiting and playing with pointers)
66
67 - RECV_AT_ONCE : only perform the operation if it will result in the source
68 buffer to become empty at the end of the operation so that no two buffers
69 remain allocated at the end. It will most of the time result in either a
70 small read or a zero-copy operation.
71
72 - RECV_PEEK : retrieve a copy of pending data without removing these data
73 from the source buffer. Maybe an alternate solution could consist in
74 finding the pointer to the source buffer and accessing these data directly,
75 except that it might be less interesting for the long term, thread-wise.
76
77 - RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail.
78 This should help various protocol parsers which need to receive a complete
79 frame before proceeding.
80
81 - RECV_ENOUGH : no more data expected after this read if it's of the
82 requested size, thus no need to re-enable receiving on the lower layers.
83
84 - RECV_ONE_SHOT : perform a single read without re-enabling reading on the
85 lower layers, like we currently do when receving an HTTP/1 request. Like
86 RECV_ENOUGH where any size is enough. Probably that the two could be merged
87 (eg: by having a MIN argument like RECV_MIN).
88
89
90Some operation flags will be needed on cs_send() :
91 - SEND_ZERO_COPY : refuse to merge the presented data with existing data and
92 prefer to wait for current data to leave and try again, unless the consumer
93 considers the amount of data acceptable for a copy.
94
95 - SEND_AT_ONCE : only perform the operation if it will result in the source
96 buffer to become empty at the end of the operation so that no two buffers
97 remain allocated at the end. It will most of the time result in either a
98 small write or a zero-copy operation.
99
100
101Both operations should return a composite status :
102 - number of bytes transfered
103 - status flags (shutr, shutw, reset, empty, full, ...)
104
Willy Tarreau7cc040c2018-07-23 17:29:37 +0200105
1062018-07-23 - Update after merging rxbuf
107---------------------------------------
108
109It becomes visible that the mux will not always be welcome to decode incoming
110data because it will sometimes imply extra memory copies and/or usage for no
111benefit.
112
113Ideally, when when a stream is instanciated based on incoming data, these
114incoming data should be passed and the upper layers called, but it should then
115be up these upper layers to peek more data in certain circumstances. Typically
116if the pending connection data are larger than what is expected to be passed
117above, it means some data may cause head-of-line blocking (HOL) to other
118streams, and needs to be pushed up through the layers to let other streams
119continue to work. Similarly very large H2 data frames after header frames
120should probably not be passed as they may require copies that could be avoided
121if passed later. However if the decoded frame fits into the conn_stream's
122buffer, there is an opportunity to use a single buffer for the conn_stream
123and the channel. The H2 demux could set a blocking flag indicating it's waiting
124for the upper stream to take over demuxing. This flag would be purged once the
125upper stream would start reading, or when extra data come and change the
126conditions.
127
128Forcing structured headers and raw data to coexist within a single buffer is
129quite challenging for many code parts. For example it's perfectly possible to
130see a fragmented buffer containing series of headers, then a small data chunk
131that was received at the same time, then a few other headers added by request
132processing, then another data block received afterwards, then possibly yet
133another header added by option http-send-name-header, and yet another data
134block. This causes some pain for compression which still needs to know where
135compressed and uncompressed data start/stop. It also makes it very difficult
136to account the exact bytes to pass through the various layers.
137
138One solution consists in thinking about buffers using 3 representations :
139
140 - a structured message, which is used for the internal HTTP representation.
141 This message may only be atomically processed. It has no clear byte count,
142 it's a message.
143
144 - a raw stream, consisting in sequences of bytes. That's typically what
145 happens in data sequences or in tunnel.
146
147 - a pipe, which contains data to be forwarded, and that haproxy cannot have
148 access to.
149
150The processing efficiency decreases with the higher complexity above, but the
151capabilities increase. The structured message can contain anything including
152serialized data blocks to be processed or forwarded. The raw stream contains
153data blocks to be processed or forwarded. The pipe only contains data blocks
154to be forwarded. The the latter ones are only an optimization of the former
155ones.
156
157Thus ideally a channel should have access to all such 3 storage areas at once,
158depending on the use case :
159 (1) a structured message,
160 (2) a raw stream,
161 (3) a pipe
162
163Right now a channel only has (2) and (3) but after the native HTTP rework, it
164will only have (1) and (3). Placing a raw stream exclusively in (1) comes with
165some performance drawbacks which are not easily recovered, and with some quite
166difficult management still involving the reserve to ensure that a data block
167doesn't prevent headers from being appended. But during header processing, the
168payload may be necessary so we cannot decide to drop this option.
169
170A long-term approach would consist in ensuring that a single channel may have
171access to all 3 representations at once, and to enumerate priority rules to
172define how they interact together. That's exactly what is currently being done
173with the pipe and the raw buffer right now. Doing so would also save the need
174for storing payload in the structured message and void the requirement for the
175reserve. But it would cost more memory to process POST data and server
176responses. Thus an intermediary step consists in keeping this model in mind but
177not implementing everything yet.
178
179Short term proposal : a channel has access to a buffer and a pipe. A non-empty
180buffer is either in structured message format OR raw stream format. Only the
181channel knows. However a structured buffer MAY contain raw data in a properly
182formated way (using the envelope defined by the structured message format).
183
184By default, when a demux writes to a CS rxbuf, it will try to use the lowest
185possible level for what is being done (i.e. splice if possible, otherwise raw
186stream, otherwise structured message). If the buffer already contains a
187structured message, then this format is exclusive. From this point the MUX has
188two options : either encode the incoming data to match the structured message
189format, or refrain from receiving into the CS's rxbuf and wait until the upper
190layer request those data.
191
192This opens a simplified option which could be suited even for the long term :
193 - cs_recv() will take one or two flags to indicate if a buffer already
194 contains a structured message or not ; the upper layer knows it.
195
196 - cs_recv() will take two flags to indicate what the upper layer is willing
197 to take :
198 - structured message only
199 - raw stream only
200 - any of them
201
202 From this point the mux can decide to either pass anything or refrain from
203 doing so.
204
205 - the demux stores the knowledge it has from the contents into some CS flags
206 to indicate whether or not some structured message are still available, and
207 whether or not some raw data are still available. Thus the caller knows
208 whether or not extra data are available.
209
210 - when the demux works on its own, it refrains from passing structured data
211 to a non-empty buffer, unless these data are causing trouble to other
212 streams (HOL).
213
214 - when a demux has to encapsulate raw data into a structured message, it will
215 always have to respect a configured reserve so that extra header processing
216 can be done on the structured message inside the buffer, regardless of the
217 supposed available room. In addition, the upper layer may indicate using an
218 extra recv() flag whether it wants the demux to defragment serialized data
219 (for example by moving trailing headers apart) or if it's not necessary.
220 This flag will be set by the stream interface if compression is required or
221 if the http-buffer-request option is set for example. Probably that using
222 to_forward==0 is a stronger indication that the reserve must be respected.
223
224 - cs_recv() and cs_send() when fed with a message, should not return byte
225 counts but message counts (i.e. 0 or 1). This implies that a single call to
226 either of these functions cannot mix raw data and structured messages at
227 the same time.
228
229At this point it looks like the conn_stream will have some encapsulation work
230to do for the payload if it needs to be encapsulated into a message. This
231further magnifies the importance of *not* decoding DATA frames into the CS's
232rxbuf until really needed.
233
234The CS will probably need to hold indication of what is available at the mux
235level, not only in the CS. Eg: we know that payload is still available.
236
237Using these elements, it should be possible to ensure that full header frames
238may be received without enforcing any reserve, that too large frames that do
239not fit will be detected because they return 0 message and indicate that such
240a message is still pending, and that data availability is correctly detected
241(later we may expect that the stream-interface allocates a larger or second
242buffer to place the payload).
243
244Regarding the ability for the channel to forward data, it looks like having a
245new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in
246optimizing the forwarding to make use of splicing when available. It is not yet
247totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)"
248followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it
249still needs to be studied. The general idea seems to be that the receiver might
250have to call the sender directly once they agree on how to transfer data (pipe
251or buffer). If the transfer is incomplete, the cs_xfer() return value and/or
252flags will indicate the current situation (src empty, dst full, etc) so that
253the caller may register for notifications on the appropriate event and wait to
254be called again to continue.
255
256Short term implementation :
257 1) add new CS flags to qualify what the buffer contains and what we expect
258 to read into it;
259
260 2) set these flags to pretend we have a structured message when receiving
261 headers (after all, H1 is an atomic header as well) and see what it
262 implies for the code; for H1 it's unclear whether it makes sense to try
263 to set it without the H1 mux.
264
265 3) use these flags to refrain from sending DATA frames after HEADERS frames
266 in H2.
267
268 4) flush the flags at the stream interface layer when performing a cs_send().
269
270 5) use the flags to enforce receipt of data only when necessary
271
272We should be able to end up with sequencial receipt in H2 modelling what is
273needed for other protocols without interfering with the native H1 devs.
Willy Tarreauf7e39552018-08-17 09:58:29 +0200274
275
2762018-08-17 - Considerations after killing cs_recv()
277---------------------------------------------------
278
279With the ongoing reorganisation of the I/O layers, it's visible that cs_recv()
280will have to transfer data between the cs' rxbuf and the channel's buffer while
281not being aware of the data format. Moreover, in case there's no data there, it
282needs to recursively call the mux's rcv_buf() to trigger a decoding, while this
283function is sometimes replaced with cs_recv(). All this shows that cs_recv() is
284in fact needed while data are pushed upstream from the lower layers, and is not
285suitable for the "pull" mode. Thus it was decided to remove this function and
286put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be
287replaced with cs_recv() since it is the only one knowing about the buffer's
288format.
289
290This opportunity simplified something : if the cs's rxbuf is only read by the
291mux's rcv_buf() method, then it doesn't need to be located into the CS and is
292well placed into the mux's representation of the stream. This has an important
293impact for H2 as it offers more freedom to the mux to allocate/free/reallocate
294this buffer, and it ensures the mux always has access to it.
295
296Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1
297mux has already uncovered the difficulty related to the channel shutting down
298on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled
299to the stream and the stream can close immediately once its buffers are empty,
300it required a way to support orphaned CS with pending data in their txbuf. This
301is something that the H2 mux already has to deal with, by carefully leaving the
302data in the channel's buffer. But due to the snd_buf() call being top-down, it
303is always possible to push the stream's data via the mux's snd_buf() call
304without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only
305implemented in the mux and attached to the mux's representation of the stream,
306and doing so allows to immediately release the channel once the data are safe
307in the mux's buffer.
308
309This is an important change which clarifies the roles and responsibilities of
310each layer in the chain : when receiving data from a mux, it's the mux's
311responsibility to make sure it can correctly decode the incoming data and to
312buffer the possible excess of data it cannot pass to the requester. This means
313that decoding an H2 frame, which is not retryable since it has an impact on the
314HPACK decompression context, and which cannot be reordered for the same reason,
315simply needs to be performed to the H2 stream's rxbuf which will then be passed
316to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a
317time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to
318read as much as it needs to be able to restart later, possibly by buffering
319some data into a local buffer. And it's only once all the output data has been
320consumed by snd_buf() that the stream is free to disappear.
321
322This model presents the nice benefit of being infinitely stackable and solving
323the last identified showstoppers to move towards a structured message internal
324representation, as it will give full power to the rcv_buf() and snd_buf() to
325process what they need.
326
327For now the conn_stream's flags indicating whether a shutdown has been seen in
328any direction or if an end of stream was seen will remain in the conn_stream,
329though it's likely that some of them will move to the mux's representation of
330the stream after structured messages are implemented.