| 2018-02-21 - Layering in haproxy 1.9 |
| ------------------------------------ |
| |
| 2 main zones : |
| - application : reads from conn_streams, writes to conn_streams, often uses |
| streams |
| |
| - connection : receives data from the network, presented into buffers |
| available via conn_streams, sends data to the network |
| |
| |
| The connection zone contains multiple layers which behave independantly in each |
| direction. The Rx direction is activated upon callbacks from the lower layers. |
| The Tx direction is activated recursively from the upper layers. Between every |
| two layers there may be a buffer, in each direction. When a buffer is full |
| either in Tx or Rx direction, this direction is paused from the network layer |
| and the location where the congestion is encountered. Upon end of congestion |
| (cs_recv() from the upper layer, of sendto() at the lower layers), a |
| tasklet_wakeup() is performed on the blocked layer so that suspended operations |
| can be resumed. In this case, the Rx side restarts propagating data upwards |
| from the lowest blocked level, while the Tx side restarts propagating data |
| downwards from the highest blocked level. Proceeding like this ensures that |
| information known to the producer may always be used to tailor the buffer sizes |
| or decide of a strategy to best aggregate data. Additionally, each time a layer |
| is crossed without transformation, it becomes possible to send without copying. |
| |
| The Rx side notifies the application of data readiness using a wakeup or a |
| callback. The Tx side notifies the application of room availability once data |
| have been moved resulting in the uppermost buffer having some free space. |
| |
| When crossing a mux downwards, it is possible that the sender is not allowed to |
| access the buffer because it is not yet its turn. It is not a problem, the data |
| remains in the conn_stream's buffer (or the stream one) and will be restarted |
| once the mux is ready to consume these data. |
| |
| |
| cs_recv() -------. cs_send() |
| ^ +--------> |||||| -------------+ ^ |
| | | -------' | | stream |
| --|----------|-------------------------------|-------|------------------- |
| | | V | connection |
| data .---. | | room |
| ready! |---| |---| available! |
| |---| |---| |
| |---| |---| |
| | | '---' |
| ^ +------------+-------+ | |
| | | ^ | / |
| / V | V / |
| / recvfrom() | sendto() | |
| -------------|----------------|--------------|--------------------------- |
| | | poll! V kernel |
| |
| |
| The cs_recv() function should act on pointers to buffer pointers, so that the |
| callee may decide to pass its own buffer directly by simply swapping pointers. |
| Similarly for cs_send() it is desirable to let the callee steal the buffer by |
| swapping the pointers. This way it remains possible to implement zero-copy |
| forwarding. |
| |
| Some operation flags will be needed on cs_recv() : |
| - RECV_ZERO_COPY : refuse to merge new data into the current buffer if it |
| will result in a data copy (ie the buffer is not empty), unless no more |
| than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper |
| than waiting and playing with pointers) |
| |
| - RECV_AT_ONCE : only perform the operation if it will result in the source |
| buffer to become empty at the end of the operation so that no two buffers |
| remain allocated at the end. It will most of the time result in either a |
| small read or a zero-copy operation. |
| |
| - RECV_PEEK : retrieve a copy of pending data without removing these data |
| from the source buffer. Maybe an alternate solution could consist in |
| finding the pointer to the source buffer and accessing these data directly, |
| except that it might be less interesting for the long term, thread-wise. |
| |
| - RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail. |
| This should help various protocol parsers which need to receive a complete |
| frame before proceeding. |
| |
| - RECV_ENOUGH : no more data expected after this read if it's of the |
| requested size, thus no need to re-enable receiving on the lower layers. |
| |
| - RECV_ONE_SHOT : perform a single read without re-enabling reading on the |
| lower layers, like we currently do when receving an HTTP/1 request. Like |
| RECV_ENOUGH where any size is enough. Probably that the two could be merged |
| (eg: by having a MIN argument like RECV_MIN). |
| |
| |
| Some operation flags will be needed on cs_send() : |
| - SEND_ZERO_COPY : refuse to merge the presented data with existing data and |
| prefer to wait for current data to leave and try again, unless the consumer |
| considers the amount of data acceptable for a copy. |
| |
| - SEND_AT_ONCE : only perform the operation if it will result in the source |
| buffer to become empty at the end of the operation so that no two buffers |
| remain allocated at the end. It will most of the time result in either a |
| small write or a zero-copy operation. |
| |
| |
| Both operations should return a composite status : |
| - number of bytes transfered |
| - status flags (shutr, shutw, reset, empty, full, ...) |
| |
| |
| 2018-07-23 - Update after merging rxbuf |
| --------------------------------------- |
| |
| It becomes visible that the mux will not always be welcome to decode incoming |
| data because it will sometimes imply extra memory copies and/or usage for no |
| benefit. |
| |
| Ideally, when when a stream is instanciated based on incoming data, these |
| incoming data should be passed and the upper layers called, but it should then |
| be up these upper layers to peek more data in certain circumstances. Typically |
| if the pending connection data are larger than what is expected to be passed |
| above, it means some data may cause head-of-line blocking (HOL) to other |
| streams, and needs to be pushed up through the layers to let other streams |
| continue to work. Similarly very large H2 data frames after header frames |
| should probably not be passed as they may require copies that could be avoided |
| if passed later. However if the decoded frame fits into the conn_stream's |
| buffer, there is an opportunity to use a single buffer for the conn_stream |
| and the channel. The H2 demux could set a blocking flag indicating it's waiting |
| for the upper stream to take over demuxing. This flag would be purged once the |
| upper stream would start reading, or when extra data come and change the |
| conditions. |
| |
| Forcing structured headers and raw data to coexist within a single buffer is |
| quite challenging for many code parts. For example it's perfectly possible to |
| see a fragmented buffer containing series of headers, then a small data chunk |
| that was received at the same time, then a few other headers added by request |
| processing, then another data block received afterwards, then possibly yet |
| another header added by option http-send-name-header, and yet another data |
| block. This causes some pain for compression which still needs to know where |
| compressed and uncompressed data start/stop. It also makes it very difficult |
| to account the exact bytes to pass through the various layers. |
| |
| One solution consists in thinking about buffers using 3 representations : |
| |
| - a structured message, which is used for the internal HTTP representation. |
| This message may only be atomically processed. It has no clear byte count, |
| it's a message. |
| |
| - a raw stream, consisting in sequences of bytes. That's typically what |
| happens in data sequences or in tunnel. |
| |
| - a pipe, which contains data to be forwarded, and that haproxy cannot have |
| access to. |
| |
| The processing efficiency decreases with the higher complexity above, but the |
| capabilities increase. The structured message can contain anything including |
| serialized data blocks to be processed or forwarded. The raw stream contains |
| data blocks to be processed or forwarded. The pipe only contains data blocks |
| to be forwarded. The the latter ones are only an optimization of the former |
| ones. |
| |
| Thus ideally a channel should have access to all such 3 storage areas at once, |
| depending on the use case : |
| (1) a structured message, |
| (2) a raw stream, |
| (3) a pipe |
| |
| Right now a channel only has (2) and (3) but after the native HTTP rework, it |
| will only have (1) and (3). Placing a raw stream exclusively in (1) comes with |
| some performance drawbacks which are not easily recovered, and with some quite |
| difficult management still involving the reserve to ensure that a data block |
| doesn't prevent headers from being appended. But during header processing, the |
| payload may be necessary so we cannot decide to drop this option. |
| |
| A long-term approach would consist in ensuring that a single channel may have |
| access to all 3 representations at once, and to enumerate priority rules to |
| define how they interact together. That's exactly what is currently being done |
| with the pipe and the raw buffer right now. Doing so would also save the need |
| for storing payload in the structured message and void the requirement for the |
| reserve. But it would cost more memory to process POST data and server |
| responses. Thus an intermediary step consists in keeping this model in mind but |
| not implementing everything yet. |
| |
| Short term proposal : a channel has access to a buffer and a pipe. A non-empty |
| buffer is either in structured message format OR raw stream format. Only the |
| channel knows. However a structured buffer MAY contain raw data in a properly |
| formated way (using the envelope defined by the structured message format). |
| |
| By default, when a demux writes to a CS rxbuf, it will try to use the lowest |
| possible level for what is being done (i.e. splice if possible, otherwise raw |
| stream, otherwise structured message). If the buffer already contains a |
| structured message, then this format is exclusive. From this point the MUX has |
| two options : either encode the incoming data to match the structured message |
| format, or refrain from receiving into the CS's rxbuf and wait until the upper |
| layer request those data. |
| |
| This opens a simplified option which could be suited even for the long term : |
| - cs_recv() will take one or two flags to indicate if a buffer already |
| contains a structured message or not ; the upper layer knows it. |
| |
| - cs_recv() will take two flags to indicate what the upper layer is willing |
| to take : |
| - structured message only |
| - raw stream only |
| - any of them |
| |
| From this point the mux can decide to either pass anything or refrain from |
| doing so. |
| |
| - the demux stores the knowledge it has from the contents into some CS flags |
| to indicate whether or not some structured message are still available, and |
| whether or not some raw data are still available. Thus the caller knows |
| whether or not extra data are available. |
| |
| - when the demux works on its own, it refrains from passing structured data |
| to a non-empty buffer, unless these data are causing trouble to other |
| streams (HOL). |
| |
| - when a demux has to encapsulate raw data into a structured message, it will |
| always have to respect a configured reserve so that extra header processing |
| can be done on the structured message inside the buffer, regardless of the |
| supposed available room. In addition, the upper layer may indicate using an |
| extra recv() flag whether it wants the demux to defragment serialized data |
| (for example by moving trailing headers apart) or if it's not necessary. |
| This flag will be set by the stream interface if compression is required or |
| if the http-buffer-request option is set for example. Probably that using |
| to_forward==0 is a stronger indication that the reserve must be respected. |
| |
| - cs_recv() and cs_send() when fed with a message, should not return byte |
| counts but message counts (i.e. 0 or 1). This implies that a single call to |
| either of these functions cannot mix raw data and structured messages at |
| the same time. |
| |
| At this point it looks like the conn_stream will have some encapsulation work |
| to do for the payload if it needs to be encapsulated into a message. This |
| further magnifies the importance of *not* decoding DATA frames into the CS's |
| rxbuf until really needed. |
| |
| The CS will probably need to hold indication of what is available at the mux |
| level, not only in the CS. Eg: we know that payload is still available. |
| |
| Using these elements, it should be possible to ensure that full header frames |
| may be received without enforcing any reserve, that too large frames that do |
| not fit will be detected because they return 0 message and indicate that such |
| a message is still pending, and that data availability is correctly detected |
| (later we may expect that the stream-interface allocates a larger or second |
| buffer to place the payload). |
| |
| Regarding the ability for the channel to forward data, it looks like having a |
| new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in |
| optimizing the forwarding to make use of splicing when available. It is not yet |
| totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)" |
| followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it |
| still needs to be studied. The general idea seems to be that the receiver might |
| have to call the sender directly once they agree on how to transfer data (pipe |
| or buffer). If the transfer is incomplete, the cs_xfer() return value and/or |
| flags will indicate the current situation (src empty, dst full, etc) so that |
| the caller may register for notifications on the appropriate event and wait to |
| be called again to continue. |
| |
| Short term implementation : |
| 1) add new CS flags to qualify what the buffer contains and what we expect |
| to read into it; |
| |
| 2) set these flags to pretend we have a structured message when receiving |
| headers (after all, H1 is an atomic header as well) and see what it |
| implies for the code; for H1 it's unclear whether it makes sense to try |
| to set it without the H1 mux. |
| |
| 3) use these flags to refrain from sending DATA frames after HEADERS frames |
| in H2. |
| |
| 4) flush the flags at the stream interface layer when performing a cs_send(). |
| |
| 5) use the flags to enforce receipt of data only when necessary |
| |
| We should be able to end up with sequencial receipt in H2 modelling what is |
| needed for other protocols without interfering with the native H1 devs. |
| |
| |
| 2018-08-17 - Considerations after killing cs_recv() |
| --------------------------------------------------- |
| |
| With the ongoing reorganisation of the I/O layers, it's visible that cs_recv() |
| will have to transfer data between the cs' rxbuf and the channel's buffer while |
| not being aware of the data format. Moreover, in case there's no data there, it |
| needs to recursively call the mux's rcv_buf() to trigger a decoding, while this |
| function is sometimes replaced with cs_recv(). All this shows that cs_recv() is |
| in fact needed while data are pushed upstream from the lower layers, and is not |
| suitable for the "pull" mode. Thus it was decided to remove this function and |
| put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be |
| replaced with cs_recv() since it is the only one knowing about the buffer's |
| format. |
| |
| This opportunity simplified something : if the cs's rxbuf is only read by the |
| mux's rcv_buf() method, then it doesn't need to be located into the CS and is |
| well placed into the mux's representation of the stream. This has an important |
| impact for H2 as it offers more freedom to the mux to allocate/free/reallocate |
| this buffer, and it ensures the mux always has access to it. |
| |
| Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1 |
| mux has already uncovered the difficulty related to the channel shutting down |
| on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled |
| to the stream and the stream can close immediately once its buffers are empty, |
| it required a way to support orphaned CS with pending data in their txbuf. This |
| is something that the H2 mux already has to deal with, by carefully leaving the |
| data in the channel's buffer. But due to the snd_buf() call being top-down, it |
| is always possible to push the stream's data via the mux's snd_buf() call |
| without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only |
| implemented in the mux and attached to the mux's representation of the stream, |
| and doing so allows to immediately release the channel once the data are safe |
| in the mux's buffer. |
| |
| This is an important change which clarifies the roles and responsibilities of |
| each layer in the chain : when receiving data from a mux, it's the mux's |
| responsibility to make sure it can correctly decode the incoming data and to |
| buffer the possible excess of data it cannot pass to the requester. This means |
| that decoding an H2 frame, which is not retryable since it has an impact on the |
| HPACK decompression context, and which cannot be reordered for the same reason, |
| simply needs to be performed to the H2 stream's rxbuf which will then be passed |
| to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a |
| time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to |
| read as much as it needs to be able to restart later, possibly by buffering |
| some data into a local buffer. And it's only once all the output data has been |
| consumed by snd_buf() that the stream is free to disappear. |
| |
| This model presents the nice benefit of being infinitely stackable and solving |
| the last identified showstoppers to move towards a structured message internal |
| representation, as it will give full power to the rcv_buf() and snd_buf() to |
| process what they need. |
| |
| For now the conn_stream's flags indicating whether a shutdown has been seen in |
| any direction or if an end of stream was seen will remain in the conn_stream, |
| though it's likely that some of them will move to the mux's representation of |
| the stream after structured messages are implemented. |