| 2018-05-18 - Buffer rework |
| |
| 1. Summary |
| |
| The situation with the current buffer structure is becoming problematic in |
| the newly introduced muxes and causes serious difficulties preventing muxes |
| from being used on both sides, unless requiring that all code is duplicated |
| to use buf->i on the Rx path and buf->o on the Tx path. |
| |
| |
| 2. History |
| |
| A very long time ago, buffers were used to receive data using the recv() call, |
| to parse them, forward them, and send them over the network using the send() |
| call. Buffers were split into (buffer,channel) when some layers started to be |
| introduced, and were reorganized a few times to ease content processing and |
| rewriting. The current definition of the buffer structure is the following : |
| |
| struct buffer { |
| char *p; |
| uint size; |
| uint i; |
| uint o; |
| char data[0]; |
| }; |
| |
| data p |
| | | |
| V V |
| +-----------+--------------------+------------+-------------+ |
| | |////////////////////|////////////| | |
| +-----------+--------------------+------------+-------------+ |
| <---------------------------------------------------------> size |
| <------------------> <----------> |
| o i |
| |
| Pointer (p) is initialized to (data) when the buffer is empty. Data are |
| received after (p+i), increasing (i) by the number of bytes read. Data are sent |
| from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent. |
| Forwarding data in the channel consists in advancing (p) by the number of bytes |
| to forward, increasing (o) and decreasing (i) by the same amount. |
| |
| This representation is convenient for channel operations because most of them |
| require to parse input data between (p) and (p+i), and to have a simple way to |
| forward data. Additionally, it's always easy to know if some data are scheduled |
| for departure (o), or if the buffer has some room available (size-i-o). |
| |
| |
| 3. Problems |
| |
| When applets were introduced, the initial code that was made to write data into |
| the output part was modified to send it into the input part since we had to |
| rely on the stream code to forward these data via the channel. This explains |
| the flood of bi_* functions that were introduced to perform the same operations |
| as the initial bo_*, to write into an input buffer from an applet. |
| |
| Health checks however continue to use output because checks do not use streams |
| nor channels. Thus the check buffers use buf->o for requests and buf->i for |
| responses. |
| |
| The introduction of muxes has changed this again by requiring that most request |
| code was able to write to buf->i, pretending to be the equivalent of a socket |
| recv() call. New bi_* functions had to be created to write headers and chunks |
| from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic |
| from buf->o while all the original code was made to parse this from buf->i. |
| |
| Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to |
| duplicate a lot of the code to use buf->i instead of buf->o and conversely, |
| just because the mux will not be placed on the same side of the buffer. Not |
| only it complicates code maintenance but it also emphasizes the risk to use |
| the wrong function at any moment. |
| |
| From a performance perspective, applets have to suffer a useless copy most of |
| the time, only due to API limitatoins : it is not possible to write directly to |
| an input buffer, one has to write to a chunk and then copy it into a buffer. A |
| compatible structure could allow to share the same data between the chunk and |
| the buffer without having to perform an extra copy. |
| |
| |
| 4. Proposal |
| |
| In checks and muxes, it is obvious that a single "side" of the buffer is used, |
| and it generally is the one associated with the I/O to be performed. Only the |
| channel requires the distinction between (i) and (o). |
| |
| The proposal is to remove this distinction from the buffer and move ->o into |
| the channel. |
| |
| A buffer will then become only a linear (possibly wrapping) storage area with |
| a beginning, and an end. |
| |
| Given the experience gathered from past buffer API updates, we know that the |
| buffer's end is not as much important as its data length. This will give the |
| current representation : |
| |
| |
| struct buffer { |
| void *area; // start of the storage area |
| uint head; // start offset of remaining data relative to area |
| uint len; // contents length after head |
| uint size; // size of the storage area (wrapping point) |
| }; |
| |
| area |
| | |
| V |
| +-----------+---------------------------------+-------------+ |
| | |/////////////////////////////////| | |
| +-----------+---------------------------------+-------------+ |
| <---------------------------------------------------------> size |
| <---------> <-------------------------------> |
| head len |
| |
| The channel will contain an "output" field corresponding to the current buf->o, |
| indicating how many bytes of the current buffer are actually scheduled for |
| being forwarded and must not be considered anymore. It means that a stream |
| parser will simply start to parse from (buf->area + buf->head + chn->output) |
| and stop at (buf->area + buf->head + buf->len). |
| |
| For esnding data, the caller of cs_send() or whatever function will have to |
| pass the desired number of bytes to send, and one will not expect anymore that |
| all the buffer's contents have to be sent. In general the caller will have |
| access to chn->output if it needs to use this (typically from the stream |
| interface code at the moment). |
| |
| |
| 5. First implementation step |
| |
| The first step will consist in limiting the changes to the current buffers. The |
| buffer structure will still contain both a descriptor and the storage area. A |
| buffer will first be declared this way : |
| |
| struct buffer { |
| uint head; // start offset of remaining data relative to area |
| uint len; // contents length after head |
| uint size; // size of the storage area (wrapping point) |
| void area[0]; // start of the storage area |
| }; |
| |
| Thanks to this, no changes will have to be performed on memory management, and |
| buffers will continue to be allocated from a pool of size (sizeof(buffer) + |
| tune.bufsize). |
| |
| The following translations will have to be performed on the code : |
| - occurrences of (buf->i + buf->o) will have to be replaced with (buf->len) |
| - bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del() |
| - bo_ptr() -> b_head() ; bo_end() -> co_end() |
| - b_adv() -> c_adv() ; b_rew() -> c_rew() |
| - buf->o will have to be replaced with either chn->output or a function |
| argument containing a copy of chn->output. These ones should cancel out |
| at the end of the operation. |
| - buf->i -> (b_len(buf) - chn->output) |
| |
| Temporary difficulties : |
| - compression makes use of both (i) and (o), by taking care of only touching |
| (i) and never (o). The filters know how not to touch (o), and the internal |
| compression API needs a small update so that this previous ->o value is |
| passed as an argument that the filter will collect from the channel. If it |
| is simpler (it probably isn't), a temporary "struct oldbuf" could be |
| created to emulate the old behaviour and be fed/used by the filters code. |
| |
| - buffer_slow_realign() distinguishes input data from output data so that the |
| output data is always placed at the end, leaving a clean contigous buffer |
| once forwarded. Instead, a "split" argument will have to be added so that |
| the caller may decide where to split the contents. Muxes will pass zero |
| here while channels will pass chn->output. |
| |
| |
| 6. Second implementation step |
| |
| The second step will consist in making "struct buffer" only hold the descriptor |
| and not the data anymore. It will then look like this : |
| |
| struct buffer { |
| void *area; // start of the storage area |
| uint head; // start offset of remaining data relative to area |
| uint len; // contents length after head |
| uint size; // size of the storage area (wrapping point) |
| }; |
| |
| Each buffer allocation will have to atomically allocate a struct buffer and an |
| area. Buffer copies will consist in exchanging the "struct buffer" contents |
| only. |
| |
| The chunk API must then be updated so that some new versions of chunk_putblk(), |
| chunk_printf() etc can write to a storage area, and so that bi_putchk() and |
| bo_putchk() instead can swap the storage areas when possible. |
| |
| At this point b_size() will be used to know where to release the allocated |
| area. The storage will simply consist in (start,len) which is perfectly suited |
| to have slabs. Just like chunks, b_size()==0 can be used to mention that no |
| free() must be done on the area. Doing so will make it much simpler to send |
| pre-formated messages (eg: error messages) into a buffer because such messages |
| will then be stored into a "struct ist" and sending such a message will be as |
| simple as doing : |
| |
| b->area = ist.str; |
| b->len = ist.len; |
| b->head = 0; |
| b->size = 0; |
| |
| The chunk struct can then be removed and replaced everywhere with a struct |
| buffer. Only the functions will remain, though they will likely have to be |
| renamed. Maybe the buffer manipulation functions will have to be split between |
| those which support wrapping and those which don't (chunks don't support |
| wrapping). |
| |
| The buf_empty structure will then disappear since a small 20-bytes structure |
| will be enough to represent an empty buffer. |
| |
| |
| 7. Third implementation step |
| |
| The third step will consist in placing a struct buffer into the struct channel. |
| This way no allocation is needed at all, and any storage can be used to deliver |
| contents. This allows to trivially upgrade a buffer on the fly by picking from |
| a different slab. It also allows to deliver error messages without ever having |
| to perform a buffer allocation. Doing so removes the need for the early buffer |
| allocation for the response in process_stream(), as it is only needed to have a |
| reliable place to send an error message to. This will ensure the buffer |
| allocator can be simplified and made more robust against the risk of deadlock |
| on memory shortage. |
| |
| |
| 8. Caveats |
| |
| The following points require extra care : |
| - there will be some subtracts to figure the buffer "input" length (formerly |
| buf->i). In the past it always used to be an unsigned value. Extreme care |
| will have to be taken to always use an inline function to compute this so |
| that it doesn't accidently become signed. |
| |
| - supporting buf->size==0 to point to a special string may require some extra |
| checks to avoid causing an integer underflow when calculating (size-len) or |
| (size-len-head) to figure the available room. |
| |
| - it is very likely that some further changes will be tempting to do in the |
| channel to better integrate the buffer (which becomes very similar to the |
| pipe), but we must not go as far as removing the visibility of the "struct |
| buffer" because it will be used as entry point for many functions. |
| |
| - it is likely that a number of the chunk_*, bi_* and bo_* variants have very |
| minor variations like return codes or error checking that can make their |
| replacement very tricky. Each set of such functions must be studied in |
| advance, and their users as well. |