Willy Tarreau | 9382cdd | 2018-02-21 18:07:26 +0100 | [diff] [blame] | 1 | 2018-02-21 - Layering in haproxy 1.9 |
| 2 | ------------------------------------ |
| 3 | |
| 4 | 2 main zones : |
| 5 | - application : reads from conn_streams, writes to conn_streams, often uses |
| 6 | streams |
| 7 | |
| 8 | - connection : receives data from the network, presented into buffers |
| 9 | available via conn_streams, sends data to the network |
| 10 | |
| 11 | |
| 12 | The connection zone contains multiple layers which behave independantly in each |
| 13 | direction. The Rx direction is activated upon callbacks from the lower layers. |
| 14 | The Tx direction is activated recursively from the upper layers. Between every |
| 15 | two layers there may be a buffer, in each direction. When a buffer is full |
| 16 | either in Tx or Rx direction, this direction is paused from the network layer |
| 17 | and the location where the congestion is encountered. Upon end of congestion |
| 18 | (cs_recv() from the upper layer, of sendto() at the lower layers), a |
| 19 | tasklet_wakeup() is performed on the blocked layer so that suspended operations |
| 20 | can be resumed. In this case, the Rx side restarts propagating data upwards |
| 21 | from the lowest blocked level, while the Tx side restarts propagating data |
| 22 | downwards from the highest blocked level. Proceeding like this ensures that |
| 23 | information known to the producer may always be used to tailor the buffer sizes |
| 24 | or decide of a strategy to best aggregate data. Additionally, each time a layer |
| 25 | is crossed without transformation, it becomes possible to send without copying. |
| 26 | |
| 27 | The Rx side notifies the application of data readiness using a wakeup or a |
| 28 | callback. The Tx side notifies the application of room availability once data |
| 29 | have been moved resulting in the uppermost buffer having some free space. |
| 30 | |
| 31 | When crossing a mux downwards, it is possible that the sender is not allowed to |
| 32 | access the buffer because it is not yet its turn. It is not a problem, the data |
| 33 | remains in the conn_stream's buffer (or the stream one) and will be restarted |
| 34 | once the mux is ready to consume these data. |
| 35 | |
| 36 | |
| 37 | cs_recv() -------. cs_send() |
| 38 | ^ +--------> |||||| -------------+ ^ |
| 39 | | | -------' | | stream |
| 40 | --|----------|-------------------------------|-------|------------------- |
| 41 | | | V | connection |
| 42 | data .---. | | room |
| 43 | ready! |---| |---| available! |
| 44 | |---| |---| |
| 45 | |---| |---| |
| 46 | | | '---' |
| 47 | ^ +------------+-------+ | |
| 48 | | | ^ | / |
| 49 | / V | V / |
| 50 | / recvfrom() | sendto() | |
| 51 | -------------|----------------|--------------|--------------------------- |
| 52 | | | poll! V kernel |
| 53 | |
| 54 | |
| 55 | The cs_recv() function should act on pointers to buffer pointers, so that the |
| 56 | callee may decide to pass its own buffer directly by simply swapping pointers. |
| 57 | Similarly for cs_send() it is desirable to let the callee steal the buffer by |
| 58 | swapping the pointers. This way it remains possible to implement zero-copy |
| 59 | forwarding. |
| 60 | |
| 61 | Some operation flags will be needed on cs_recv() : |
| 62 | - RECV_ZERO_COPY : refuse to merge new data into the current buffer if it |
| 63 | will result in a data copy (ie the buffer is not empty), unless no more |
| 64 | than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper |
| 65 | than waiting and playing with pointers) |
| 66 | |
| 67 | - RECV_AT_ONCE : only perform the operation if it will result in the source |
| 68 | buffer to become empty at the end of the operation so that no two buffers |
| 69 | remain allocated at the end. It will most of the time result in either a |
| 70 | small read or a zero-copy operation. |
| 71 | |
| 72 | - RECV_PEEK : retrieve a copy of pending data without removing these data |
| 73 | from the source buffer. Maybe an alternate solution could consist in |
| 74 | finding the pointer to the source buffer and accessing these data directly, |
| 75 | except that it might be less interesting for the long term, thread-wise. |
| 76 | |
| 77 | - RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail. |
| 78 | This should help various protocol parsers which need to receive a complete |
| 79 | frame before proceeding. |
| 80 | |
| 81 | - RECV_ENOUGH : no more data expected after this read if it's of the |
| 82 | requested size, thus no need to re-enable receiving on the lower layers. |
| 83 | |
| 84 | - RECV_ONE_SHOT : perform a single read without re-enabling reading on the |
| 85 | lower layers, like we currently do when receving an HTTP/1 request. Like |
| 86 | RECV_ENOUGH where any size is enough. Probably that the two could be merged |
| 87 | (eg: by having a MIN argument like RECV_MIN). |
| 88 | |
| 89 | |
| 90 | Some operation flags will be needed on cs_send() : |
| 91 | - SEND_ZERO_COPY : refuse to merge the presented data with existing data and |
| 92 | prefer to wait for current data to leave and try again, unless the consumer |
| 93 | considers the amount of data acceptable for a copy. |
| 94 | |
| 95 | - SEND_AT_ONCE : only perform the operation if it will result in the source |
| 96 | buffer to become empty at the end of the operation so that no two buffers |
| 97 | remain allocated at the end. It will most of the time result in either a |
| 98 | small write or a zero-copy operation. |
| 99 | |
| 100 | |
| 101 | Both operations should return a composite status : |
| 102 | - number of bytes transfered |
| 103 | - status flags (shutr, shutw, reset, empty, full, ...) |
| 104 | |
Willy Tarreau | 7cc040c | 2018-07-23 17:29:37 +0200 | [diff] [blame] | 105 | |
| 106 | 2018-07-23 - Update after merging rxbuf |
| 107 | --------------------------------------- |
| 108 | |
| 109 | It becomes visible that the mux will not always be welcome to decode incoming |
| 110 | data because it will sometimes imply extra memory copies and/or usage for no |
| 111 | benefit. |
| 112 | |
| 113 | Ideally, when when a stream is instanciated based on incoming data, these |
| 114 | incoming data should be passed and the upper layers called, but it should then |
| 115 | be up these upper layers to peek more data in certain circumstances. Typically |
| 116 | if the pending connection data are larger than what is expected to be passed |
| 117 | above, it means some data may cause head-of-line blocking (HOL) to other |
| 118 | streams, and needs to be pushed up through the layers to let other streams |
| 119 | continue to work. Similarly very large H2 data frames after header frames |
| 120 | should probably not be passed as they may require copies that could be avoided |
| 121 | if passed later. However if the decoded frame fits into the conn_stream's |
| 122 | buffer, there is an opportunity to use a single buffer for the conn_stream |
| 123 | and the channel. The H2 demux could set a blocking flag indicating it's waiting |
| 124 | for the upper stream to take over demuxing. This flag would be purged once the |
| 125 | upper stream would start reading, or when extra data come and change the |
| 126 | conditions. |
| 127 | |
| 128 | Forcing structured headers and raw data to coexist within a single buffer is |
| 129 | quite challenging for many code parts. For example it's perfectly possible to |
| 130 | see a fragmented buffer containing series of headers, then a small data chunk |
| 131 | that was received at the same time, then a few other headers added by request |
| 132 | processing, then another data block received afterwards, then possibly yet |
| 133 | another header added by option http-send-name-header, and yet another data |
| 134 | block. This causes some pain for compression which still needs to know where |
| 135 | compressed and uncompressed data start/stop. It also makes it very difficult |
| 136 | to account the exact bytes to pass through the various layers. |
| 137 | |
| 138 | One solution consists in thinking about buffers using 3 representations : |
| 139 | |
| 140 | - a structured message, which is used for the internal HTTP representation. |
| 141 | This message may only be atomically processed. It has no clear byte count, |
| 142 | it's a message. |
| 143 | |
| 144 | - a raw stream, consisting in sequences of bytes. That's typically what |
| 145 | happens in data sequences or in tunnel. |
| 146 | |
| 147 | - a pipe, which contains data to be forwarded, and that haproxy cannot have |
| 148 | access to. |
| 149 | |
| 150 | The processing efficiency decreases with the higher complexity above, but the |
| 151 | capabilities increase. The structured message can contain anything including |
| 152 | serialized data blocks to be processed or forwarded. The raw stream contains |
| 153 | data blocks to be processed or forwarded. The pipe only contains data blocks |
| 154 | to be forwarded. The the latter ones are only an optimization of the former |
| 155 | ones. |
| 156 | |
| 157 | Thus ideally a channel should have access to all such 3 storage areas at once, |
| 158 | depending on the use case : |
| 159 | (1) a structured message, |
| 160 | (2) a raw stream, |
| 161 | (3) a pipe |
| 162 | |
| 163 | Right now a channel only has (2) and (3) but after the native HTTP rework, it |
| 164 | will only have (1) and (3). Placing a raw stream exclusively in (1) comes with |
| 165 | some performance drawbacks which are not easily recovered, and with some quite |
| 166 | difficult management still involving the reserve to ensure that a data block |
| 167 | doesn't prevent headers from being appended. But during header processing, the |
| 168 | payload may be necessary so we cannot decide to drop this option. |
| 169 | |
| 170 | A long-term approach would consist in ensuring that a single channel may have |
| 171 | access to all 3 representations at once, and to enumerate priority rules to |
| 172 | define how they interact together. That's exactly what is currently being done |
| 173 | with the pipe and the raw buffer right now. Doing so would also save the need |
| 174 | for storing payload in the structured message and void the requirement for the |
| 175 | reserve. But it would cost more memory to process POST data and server |
| 176 | responses. Thus an intermediary step consists in keeping this model in mind but |
| 177 | not implementing everything yet. |
| 178 | |
| 179 | Short term proposal : a channel has access to a buffer and a pipe. A non-empty |
| 180 | buffer is either in structured message format OR raw stream format. Only the |
| 181 | channel knows. However a structured buffer MAY contain raw data in a properly |
| 182 | formated way (using the envelope defined by the structured message format). |
| 183 | |
| 184 | By default, when a demux writes to a CS rxbuf, it will try to use the lowest |
| 185 | possible level for what is being done (i.e. splice if possible, otherwise raw |
| 186 | stream, otherwise structured message). If the buffer already contains a |
| 187 | structured message, then this format is exclusive. From this point the MUX has |
| 188 | two options : either encode the incoming data to match the structured message |
| 189 | format, or refrain from receiving into the CS's rxbuf and wait until the upper |
| 190 | layer request those data. |
| 191 | |
| 192 | This opens a simplified option which could be suited even for the long term : |
| 193 | - cs_recv() will take one or two flags to indicate if a buffer already |
| 194 | contains a structured message or not ; the upper layer knows it. |
| 195 | |
| 196 | - cs_recv() will take two flags to indicate what the upper layer is willing |
| 197 | to take : |
| 198 | - structured message only |
| 199 | - raw stream only |
| 200 | - any of them |
| 201 | |
| 202 | From this point the mux can decide to either pass anything or refrain from |
| 203 | doing so. |
| 204 | |
| 205 | - the demux stores the knowledge it has from the contents into some CS flags |
| 206 | to indicate whether or not some structured message are still available, and |
| 207 | whether or not some raw data are still available. Thus the caller knows |
| 208 | whether or not extra data are available. |
| 209 | |
| 210 | - when the demux works on its own, it refrains from passing structured data |
| 211 | to a non-empty buffer, unless these data are causing trouble to other |
| 212 | streams (HOL). |
| 213 | |
| 214 | - when a demux has to encapsulate raw data into a structured message, it will |
| 215 | always have to respect a configured reserve so that extra header processing |
| 216 | can be done on the structured message inside the buffer, regardless of the |
| 217 | supposed available room. In addition, the upper layer may indicate using an |
| 218 | extra recv() flag whether it wants the demux to defragment serialized data |
| 219 | (for example by moving trailing headers apart) or if it's not necessary. |
| 220 | This flag will be set by the stream interface if compression is required or |
| 221 | if the http-buffer-request option is set for example. Probably that using |
| 222 | to_forward==0 is a stronger indication that the reserve must be respected. |
| 223 | |
| 224 | - cs_recv() and cs_send() when fed with a message, should not return byte |
| 225 | counts but message counts (i.e. 0 or 1). This implies that a single call to |
| 226 | either of these functions cannot mix raw data and structured messages at |
| 227 | the same time. |
| 228 | |
| 229 | At this point it looks like the conn_stream will have some encapsulation work |
| 230 | to do for the payload if it needs to be encapsulated into a message. This |
| 231 | further magnifies the importance of *not* decoding DATA frames into the CS's |
| 232 | rxbuf until really needed. |
| 233 | |
| 234 | The CS will probably need to hold indication of what is available at the mux |
| 235 | level, not only in the CS. Eg: we know that payload is still available. |
| 236 | |
| 237 | Using these elements, it should be possible to ensure that full header frames |
| 238 | may be received without enforcing any reserve, that too large frames that do |
| 239 | not fit will be detected because they return 0 message and indicate that such |
| 240 | a message is still pending, and that data availability is correctly detected |
| 241 | (later we may expect that the stream-interface allocates a larger or second |
| 242 | buffer to place the payload). |
| 243 | |
| 244 | Regarding the ability for the channel to forward data, it looks like having a |
| 245 | new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in |
| 246 | optimizing the forwarding to make use of splicing when available. It is not yet |
| 247 | totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)" |
| 248 | followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it |
| 249 | still needs to be studied. The general idea seems to be that the receiver might |
| 250 | have to call the sender directly once they agree on how to transfer data (pipe |
| 251 | or buffer). If the transfer is incomplete, the cs_xfer() return value and/or |
| 252 | flags will indicate the current situation (src empty, dst full, etc) so that |
| 253 | the caller may register for notifications on the appropriate event and wait to |
| 254 | be called again to continue. |
| 255 | |
| 256 | Short term implementation : |
| 257 | 1) add new CS flags to qualify what the buffer contains and what we expect |
| 258 | to read into it; |
| 259 | |
| 260 | 2) set these flags to pretend we have a structured message when receiving |
| 261 | headers (after all, H1 is an atomic header as well) and see what it |
| 262 | implies for the code; for H1 it's unclear whether it makes sense to try |
| 263 | to set it without the H1 mux. |
| 264 | |
| 265 | 3) use these flags to refrain from sending DATA frames after HEADERS frames |
| 266 | in H2. |
| 267 | |
| 268 | 4) flush the flags at the stream interface layer when performing a cs_send(). |
| 269 | |
| 270 | 5) use the flags to enforce receipt of data only when necessary |
| 271 | |
| 272 | We should be able to end up with sequencial receipt in H2 modelling what is |
| 273 | needed for other protocols without interfering with the native H1 devs. |
Willy Tarreau | f7e3955 | 2018-08-17 09:58:29 +0200 | [diff] [blame] | 274 | |
| 275 | |
| 276 | 2018-08-17 - Considerations after killing cs_recv() |
| 277 | --------------------------------------------------- |
| 278 | |
| 279 | With the ongoing reorganisation of the I/O layers, it's visible that cs_recv() |
| 280 | will have to transfer data between the cs' rxbuf and the channel's buffer while |
| 281 | not being aware of the data format. Moreover, in case there's no data there, it |
| 282 | needs to recursively call the mux's rcv_buf() to trigger a decoding, while this |
| 283 | function is sometimes replaced with cs_recv(). All this shows that cs_recv() is |
| 284 | in fact needed while data are pushed upstream from the lower layers, and is not |
| 285 | suitable for the "pull" mode. Thus it was decided to remove this function and |
| 286 | put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be |
| 287 | replaced with cs_recv() since it is the only one knowing about the buffer's |
| 288 | format. |
| 289 | |
| 290 | This opportunity simplified something : if the cs's rxbuf is only read by the |
| 291 | mux's rcv_buf() method, then it doesn't need to be located into the CS and is |
| 292 | well placed into the mux's representation of the stream. This has an important |
| 293 | impact for H2 as it offers more freedom to the mux to allocate/free/reallocate |
| 294 | this buffer, and it ensures the mux always has access to it. |
| 295 | |
| 296 | Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1 |
| 297 | mux has already uncovered the difficulty related to the channel shutting down |
| 298 | on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled |
| 299 | to the stream and the stream can close immediately once its buffers are empty, |
| 300 | it required a way to support orphaned CS with pending data in their txbuf. This |
| 301 | is something that the H2 mux already has to deal with, by carefully leaving the |
| 302 | data in the channel's buffer. But due to the snd_buf() call being top-down, it |
| 303 | is always possible to push the stream's data via the mux's snd_buf() call |
| 304 | without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only |
| 305 | implemented in the mux and attached to the mux's representation of the stream, |
| 306 | and doing so allows to immediately release the channel once the data are safe |
| 307 | in the mux's buffer. |
| 308 | |
| 309 | This is an important change which clarifies the roles and responsibilities of |
| 310 | each layer in the chain : when receiving data from a mux, it's the mux's |
| 311 | responsibility to make sure it can correctly decode the incoming data and to |
| 312 | buffer the possible excess of data it cannot pass to the requester. This means |
| 313 | that decoding an H2 frame, which is not retryable since it has an impact on the |
| 314 | HPACK decompression context, and which cannot be reordered for the same reason, |
| 315 | simply needs to be performed to the H2 stream's rxbuf which will then be passed |
| 316 | to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a |
| 317 | time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to |
| 318 | read as much as it needs to be able to restart later, possibly by buffering |
| 319 | some data into a local buffer. And it's only once all the output data has been |
| 320 | consumed by snd_buf() that the stream is free to disappear. |
| 321 | |
| 322 | This model presents the nice benefit of being infinitely stackable and solving |
| 323 | the last identified showstoppers to move towards a structured message internal |
| 324 | representation, as it will give full power to the rcv_buf() and snd_buf() to |
| 325 | process what they need. |
| 326 | |
| 327 | For now the conn_stream's flags indicating whether a shutdown has been seen in |
| 328 | any direction or if an end of stream was seen will remain in the conn_stream, |
| 329 | though it's likely that some of them will move to the mux's representation of |
| 330 | the stream after structured messages are implemented. |