Willy Tarreau | fa9f9cc | 2018-05-18 16:18:17 +0200 | [diff] [blame] | 1 | 2018-05-18 - Buffer rework |
| 2 | |
| 3 | 1. Summary |
| 4 | |
| 5 | The situation with the current buffer structure is becoming problematic in |
| 6 | the newly introduced muxes and causes serious difficulties preventing muxes |
| 7 | from being used on both sides, unless requiring that all code is duplicated |
| 8 | to use buf->i on the Rx path and buf->o on the Tx path. |
| 9 | |
| 10 | |
| 11 | 2. History |
| 12 | |
| 13 | A very long time ago, buffers were used to receive data using the recv() call, |
| 14 | to parse them, forward them, and send them over the network using the send() |
| 15 | call. Buffers were split into (buffer,channel) when some layers started to be |
| 16 | introduced, and were reorganized a few times to ease content processing and |
| 17 | rewriting. The current definition of the buffer structure is the following : |
| 18 | |
| 19 | struct buffer { |
| 20 | char *p; |
| 21 | uint size; |
| 22 | uint i; |
| 23 | uint o; |
| 24 | char data[0]; |
| 25 | }; |
| 26 | |
| 27 | data p |
| 28 | | | |
| 29 | V V |
| 30 | +-----------+--------------------+------------+-------------+ |
| 31 | | |////////////////////|////////////| | |
| 32 | +-----------+--------------------+------------+-------------+ |
| 33 | <---------------------------------------------------------> size |
| 34 | <------------------> <----------> |
| 35 | o i |
| 36 | |
| 37 | Pointer (p) is initialized to (data) when the buffer is empty. Data are |
| 38 | received after (p+i), increasing (i) by the number of bytes read. Data are sent |
| 39 | from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent. |
| 40 | Forwarding data in the channel consists in advancing (p) by the number of bytes |
| 41 | to forward, increasing (o) and decreasing (i) by the same amount. |
| 42 | |
| 43 | This representation is convenient for channel operations because most of them |
| 44 | require to parse input data between (p) and (p+i), and to have a simple way to |
| 45 | forward data. Additionally, it's always easy to know if some data are scheduled |
| 46 | for departure (o), or if the buffer has some room available (size-i-o). |
| 47 | |
| 48 | |
| 49 | 3. Problems |
| 50 | |
| 51 | When applets were introduced, the initial code that was made to write data into |
| 52 | the output part was modified to send it into the input part since we had to |
| 53 | rely on the stream code to forward these data via the channel. This explains |
| 54 | the flood of bi_* functions that were introduced to perform the same operations |
| 55 | as the initial bo_*, to write into an input buffer from an applet. |
| 56 | |
| 57 | Health checks however continue to use output because checks do not use streams |
| 58 | nor channels. Thus the check buffers use buf->o for requests and buf->i for |
| 59 | responses. |
| 60 | |
| 61 | The introduction of muxes has changed this again by requiring that most request |
| 62 | code was able to write to buf->i, pretending to be the equivalent of a socket |
| 63 | recv() call. New bi_* functions had to be created to write headers and chunks |
| 64 | from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic |
| 65 | from buf->o while all the original code was made to parse this from buf->i. |
| 66 | |
| 67 | Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to |
| 68 | duplicate a lot of the code to use buf->i instead of buf->o and conversely, |
| 69 | just because the mux will not be placed on the same side of the buffer. Not |
| 70 | only it complicates code maintenance but it also emphasizes the risk to use |
| 71 | the wrong function at any moment. |
| 72 | |
| 73 | From a performance perspective, applets have to suffer a useless copy most of |
| 74 | the time, only due to API limitatoins : it is not possible to write directly to |
| 75 | an input buffer, one has to write to a chunk and then copy it into a buffer. A |
| 76 | compatible structure could allow to share the same data between the chunk and |
| 77 | the buffer without having to perform an extra copy. |
| 78 | |
| 79 | |
| 80 | 4. Proposal |
| 81 | |
| 82 | In checks and muxes, it is obvious that a single "side" of the buffer is used, |
| 83 | and it generally is the one associated with the I/O to be performed. Only the |
| 84 | channel requires the distinction between (i) and (o). |
| 85 | |
| 86 | The proposal is to remove this distinction from the buffer and move ->o into |
| 87 | the channel. |
| 88 | |
| 89 | A buffer will then become only a linear (possibly wrapping) storage area with |
| 90 | a beginning, and an end. |
| 91 | |
| 92 | Given the experience gathered from past buffer API updates, we know that the |
| 93 | buffer's end is not as much important as its data length. This will give the |
| 94 | current representation : |
| 95 | |
| 96 | |
| 97 | struct buffer { |
| 98 | void *area; // start of the storage area |
| 99 | uint head; // start offset of remaining data relative to area |
| 100 | uint len; // contents length after head |
| 101 | uint size; // size of the storage area (wrapping point) |
| 102 | }; |
| 103 | |
| 104 | area |
| 105 | | |
| 106 | V |
| 107 | +-----------+---------------------------------+-------------+ |
| 108 | | |/////////////////////////////////| | |
| 109 | +-----------+---------------------------------+-------------+ |
| 110 | <---------------------------------------------------------> size |
| 111 | <---------> <-------------------------------> |
| 112 | head len |
| 113 | |
| 114 | The channel will contain an "output" field corresponding to the current buf->o, |
| 115 | indicating how many bytes of the current buffer are actually scheduled for |
| 116 | being forwarded and must not be considered anymore. It means that a stream |
| 117 | parser will simply start to parse from (buf->area + buf->head + chn->output) |
| 118 | and stop at (buf->area + buf->head + buf->len). |
| 119 | |
| 120 | For esnding data, the caller of cs_send() or whatever function will have to |
| 121 | pass the desired number of bytes to send, and one will not expect anymore that |
| 122 | all the buffer's contents have to be sent. In general the caller will have |
| 123 | access to chn->output if it needs to use this (typically from the stream |
| 124 | interface code at the moment). |
| 125 | |
| 126 | |
| 127 | 5. First implementation step |
| 128 | |
| 129 | The first step will consist in limiting the changes to the current buffers. The |
| 130 | buffer structure will still contain both a descriptor and the storage area. A |
| 131 | buffer will first be declared this way : |
| 132 | |
| 133 | struct buffer { |
| 134 | uint head; // start offset of remaining data relative to area |
| 135 | uint len; // contents length after head |
| 136 | uint size; // size of the storage area (wrapping point) |
| 137 | void area[0]; // start of the storage area |
| 138 | }; |
| 139 | |
| 140 | Thanks to this, no changes will have to be performed on memory management, and |
| 141 | buffers will continue to be allocated from a pool of size (sizeof(buffer) + |
| 142 | tune.bufsize). |
| 143 | |
| 144 | The following translations will have to be performed on the code : |
| 145 | - occurrences of (buf->i + buf->o) will have to be replaced with (buf->len) |
| 146 | - bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del() |
| 147 | - bo_ptr() -> b_head() ; bo_end() -> co_end() |
| 148 | - b_adv() -> c_adv() ; b_rew() -> c_rew() |
| 149 | - buf->o will have to be replaced with either chn->output or a function |
| 150 | argument containing a copy of chn->output. These ones should cancel out |
| 151 | at the end of the operation. |
| 152 | - buf->i -> (b_len(buf) - chn->output) |
| 153 | |
| 154 | Temporary difficulties : |
| 155 | - compression makes use of both (i) and (o), by taking care of only touching |
| 156 | (i) and never (o). The filters know how not to touch (o), and the internal |
| 157 | compression API needs a small update so that this previous ->o value is |
| 158 | passed as an argument that the filter will collect from the channel. If it |
| 159 | is simpler (it probably isn't), a temporary "struct oldbuf" could be |
| 160 | created to emulate the old behaviour and be fed/used by the filters code. |
| 161 | |
| 162 | - buffer_slow_realign() distinguishes input data from output data so that the |
| 163 | output data is always placed at the end, leaving a clean contigous buffer |
| 164 | once forwarded. Instead, a "split" argument will have to be added so that |
| 165 | the caller may decide where to split the contents. Muxes will pass zero |
| 166 | here while channels will pass chn->output. |
| 167 | |
| 168 | |
| 169 | 6. Second implementation step |
| 170 | |
| 171 | The second step will consist in making "struct buffer" only hold the descriptor |
| 172 | and not the data anymore. It will then look like this : |
| 173 | |
| 174 | struct buffer { |
| 175 | void *area; // start of the storage area |
| 176 | uint head; // start offset of remaining data relative to area |
| 177 | uint len; // contents length after head |
| 178 | uint size; // size of the storage area (wrapping point) |
| 179 | }; |
| 180 | |
| 181 | Each buffer allocation will have to atomically allocate a struct buffer and an |
| 182 | area. Buffer copies will consist in exchanging the "struct buffer" contents |
| 183 | only. |
| 184 | |
| 185 | The chunk API must then be updated so that some new versions of chunk_putblk(), |
| 186 | chunk_printf() etc can write to a storage area, and so that bi_putchk() and |
| 187 | bo_putchk() instead can swap the storage areas when possible. |
| 188 | |
| 189 | At this point b_size() will be used to know where to release the allocated |
| 190 | area. The storage will simply consist in (start,len) which is perfectly suited |
| 191 | to have slabs. Just like chunks, b_size()==0 can be used to mention that no |
| 192 | free() must be done on the area. Doing so will make it much simpler to send |
| 193 | pre-formated messages (eg: error messages) into a buffer because such messages |
| 194 | will then be stored into a "struct ist" and sending such a message will be as |
| 195 | simple as doing : |
| 196 | |
| 197 | b->area = ist.str; |
| 198 | b->len = ist.len; |
| 199 | b->head = 0; |
| 200 | b->size = 0; |
| 201 | |
| 202 | The chunk struct can then be removed and replaced everywhere with a struct |
| 203 | buffer. Only the functions will remain, though they will likely have to be |
| 204 | renamed. Maybe the buffer manipulation functions will have to be split between |
| 205 | those which support wrapping and those which don't (chunks don't support |
| 206 | wrapping). |
| 207 | |
| 208 | The buf_empty structure will then disappear since a small 20-bytes structure |
| 209 | will be enough to represent an empty buffer. |
| 210 | |
| 211 | |
| 212 | 7. Third implementation step |
| 213 | |
| 214 | The third step will consist in placing a struct buffer into the struct channel. |
| 215 | This way no allocation is needed at all, and any storage can be used to deliver |
| 216 | contents. This allows to trivially upgrade a buffer on the fly by picking from |
| 217 | a different slab. It also allows to deliver error messages without ever having |
| 218 | to perform a buffer allocation. Doing so removes the need for the early buffer |
| 219 | allocation for the response in process_stream(), as it is only needed to have a |
| 220 | reliable place to send an error message to. This will ensure the buffer |
| 221 | allocator can be simplified and made more robust against the risk of deadlock |
| 222 | on memory shortage. |
| 223 | |
| 224 | |
| 225 | 8. Caveats |
| 226 | |
| 227 | The following points require extra care : |
| 228 | - there will be some subtracts to figure the buffer "input" length (formerly |
| 229 | buf->i). In the past it always used to be an unsigned value. Extreme care |
| 230 | will have to be taken to always use an inline function to compute this so |
| 231 | that it doesn't accidently become signed. |
| 232 | |
| 233 | - supporting buf->size==0 to point to a special string may require some extra |
| 234 | checks to avoid causing an integer underflow when calculating (size-len) or |
| 235 | (size-len-head) to figure the available room. |
| 236 | |
| 237 | - it is very likely that some further changes will be tempting to do in the |
| 238 | channel to better integrate the buffer (which becomes very similar to the |
| 239 | pipe), but we must not go as far as removing the visibility of the "struct |
| 240 | buffer" because it will be used as entry point for many functions. |
| 241 | |
| 242 | - it is likely that a number of the chunk_*, bi_* and bo_* variants have very |
| 243 | minor variations like return codes or error checking that can make their |
| 244 | replacement very tricky. Each set of such functions must be studied in |
| 245 | advance, and their users as well. |