DOC: add some description of the pending rework of the buffer structure

The "struct buffer" needs to be reworked, this new doc lists the changes
and steps to do this.
diff --git a/doc/internals/buffer-rework.txt b/doc/internals/buffer-rework.txt
new file mode 100644
index 0000000..136377f
--- /dev/null
+++ b/doc/internals/buffer-rework.txt
@@ -0,0 +1,245 @@
+2018-05-18 - Buffer rework
+
+1. Summary
+
+The situation with the current buffer structure is becoming problematic in
+the newly introduced muxes and causes serious difficulties preventing muxes
+from being used on both sides, unless requiring that all code is duplicated
+to use buf->i on the Rx path and buf->o on the Tx path.
+
+
+2. History
+
+A very long time ago, buffers were used to receive data using the recv() call,
+to parse them, forward them, and send them over the network using the send()
+call. Buffers were split into (buffer,channel) when some layers started to be
+introduced, and were reorganized a few times to ease content processing and
+rewriting. The current definition of the buffer structure is the following :
+
+    struct buffer {
+        char *p;
+        uint size;
+        uint i;
+        uint o;
+        char data[0];
+    };
+
+    data                               p
+      |                                |
+      V                                V
+      +-----------+--------------------+------------+-------------+
+      |           |////////////////////|////////////|             |
+      +-----------+--------------------+------------+-------------+
+       <---------------------------------------------------------> size
+                   <------------------> <---------->
+                            o                i
+
+Pointer (p) is initialized to (data) when the buffer is empty. Data are
+received after (p+i), increasing (i) by the number of bytes read. Data are sent
+from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent.
+Forwarding data in the channel consists in advancing (p) by the number of bytes
+to forward, increasing (o) and decreasing (i) by the same amount.
+
+This representation is convenient for channel operations because most of them
+require to parse input data between (p) and (p+i), and to have a simple way to
+forward data. Additionally, it's always easy to know if some data are scheduled
+for departure (o), or if the buffer has some room available (size-i-o).
+
+
+3. Problems
+
+When applets were introduced, the initial code that was made to write data into
+the output part was modified to send it into the input part since we had to
+rely on the stream code to forward these data via the channel. This explains
+the flood of bi_* functions that were introduced to perform the same operations
+as the initial bo_*, to write into an input buffer from an applet.
+
+Health checks however continue to use output because checks do not use streams
+nor channels. Thus the check buffers use buf->o for requests and buf->i for
+responses.
+
+The introduction of muxes has changed this again by requiring that most request
+code was able to write to buf->i, pretending to be the equivalent of a socket
+recv() call. New bi_* functions had to be created to write headers and chunks
+from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic
+from buf->o while all the original code was made to parse this from buf->i.
+
+Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to
+duplicate a lot of the code to use buf->i instead of buf->o and conversely,
+just because the mux will not be placed on the same side of the buffer. Not
+only it complicates code maintenance but it also emphasizes the risk to use
+the wrong function at any moment.
+
+From a performance perspective, applets have to suffer a useless copy most of
+the time, only due to API limitatoins : it is not possible to write directly to
+an input buffer, one has to write to a chunk and then copy it into a buffer. A
+compatible structure could allow to share the same data between the chunk and
+the buffer without having to perform an extra copy.
+
+
+4. Proposal
+
+In checks and muxes, it is obvious that a single "side" of the buffer is used,
+and it generally is the one associated with the I/O to be performed. Only the
+channel requires the distinction between (i) and (o).
+
+The proposal is to remove this distinction from the buffer and move ->o into
+the channel.
+
+A buffer will then become only a linear (possibly wrapping) storage area with
+a beginning, and an end.
+
+Given the experience gathered from past buffer API updates, we know that the
+buffer's end is not as much important as its data length. This will give the
+current representation :
+
+
+    struct buffer {
+        void *area;    // start of the storage area
+        uint head;     // start offset of remaining data relative to area
+        uint len;      // contents length after head
+        uint size;     // size of the storage area (wrapping point)
+    };
+
+    area
+      |
+      V
+      +-----------+---------------------------------+-------------+
+      |           |/////////////////////////////////|             |
+      +-----------+---------------------------------+-------------+
+       <---------------------------------------------------------> size
+       <---------> <------------------------------->
+           head                 len
+
+The channel will contain an "output" field corresponding to the current buf->o,
+indicating how many bytes of the current buffer are actually scheduled for
+being forwarded and must not be considered anymore. It means that a stream
+parser will simply start to parse from (buf->area + buf->head + chn->output)
+and stop at (buf->area + buf->head + buf->len).
+
+For esnding data, the caller of cs_send() or whatever function will have to
+pass the desired number of bytes to send, and one will not expect anymore that
+all the buffer's contents have to be sent. In general the caller will have
+access to chn->output if it needs to use this (typically from the stream
+interface code at the moment).
+
+
+5. First implementation step
+
+The first step will consist in limiting the changes to the current buffers. The
+buffer structure will still contain both a descriptor and the storage area. A
+buffer will first be declared this way :
+
+    struct buffer {
+        uint head;     // start offset of remaining data relative to area
+        uint len;      // contents length after head
+        uint size;     // size of the storage area (wrapping point)
+        void area[0];  // start of the storage area
+    };
+
+Thanks to this, no changes will have to be performed on memory management, and
+buffers will continue to be allocated from a pool of size (sizeof(buffer) +
+tune.bufsize).
+
+The following translations will have to be performed on the code :
+  - occurrences of (buf->i + buf->o) will have to be replaced with (buf->len)
+  - bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del()
+  - bo_ptr() -> b_head() ; bo_end() -> co_end()
+  - b_adv() -> c_adv() ; b_rew() -> c_rew()
+  - buf->o will have to be replaced with either chn->output or a function
+    argument containing a copy of chn->output. These ones should cancel out
+    at the end of the operation.
+  - buf->i -> (b_len(buf) - chn->output)
+
+Temporary difficulties :
+  - compression makes use of both (i) and (o), by taking care of only touching
+    (i) and never (o). The filters know how not to touch (o), and the internal
+    compression API needs a small update so that this previous ->o value is
+    passed as an argument that the filter will collect from the channel. If it
+    is simpler (it probably isn't), a temporary "struct oldbuf" could be
+    created to emulate the old behaviour and be fed/used by the filters code.
+
+  - buffer_slow_realign() distinguishes input data from output data so that the
+    output data is always placed at the end, leaving a clean contigous buffer
+    once forwarded. Instead, a "split" argument will have to be added so that
+    the caller may decide where to split the contents. Muxes will pass zero
+    here while channels will pass chn->output.
+
+
+6. Second implementation step
+
+The second step will consist in making "struct buffer" only hold the descriptor
+and not the data anymore. It will then look like this :
+
+    struct buffer {
+        void *area;    // start of the storage area
+        uint head;     // start offset of remaining data relative to area
+        uint len;      // contents length after head
+        uint size;     // size of the storage area (wrapping point)
+    };
+
+Each buffer allocation will have to atomically allocate a struct buffer and an
+area. Buffer copies will consist in exchanging the "struct buffer" contents
+only.
+
+The chunk API must then be updated so that some new versions of chunk_putblk(),
+chunk_printf() etc can write to a storage area, and so that bi_putchk() and
+bo_putchk() instead can swap the storage areas when possible.
+
+At this point b_size() will be used to know where to release the allocated
+area. The storage will simply consist in (start,len) which is perfectly suited
+to have slabs. Just like chunks, b_size()==0 can be used to mention that no
+free() must be done on the area. Doing so will make it much simpler to send
+pre-formated messages (eg: error messages) into a buffer because such messages
+will then be stored into a "struct ist" and sending such a message will be as
+simple as doing :
+
+      b->area = ist.str;
+      b->len  = ist.len;
+      b->head = 0;
+      b->size = 0;
+
+The chunk struct can then be removed and replaced everywhere with a struct
+buffer. Only the functions will remain, though they will likely have to be
+renamed. Maybe the buffer manipulation functions will have to be split between
+those which support wrapping and those which don't (chunks don't support
+wrapping).
+
+The buf_empty structure will then disappear since a small 20-bytes structure
+will be enough to represent an empty buffer.
+
+
+7. Third implementation step
+
+The third step will consist in placing a struct buffer into the struct channel.
+This way no allocation is needed at all, and any storage can be used to deliver
+contents. This allows to trivially upgrade a buffer on the fly by picking from
+a different slab. It also allows to deliver error messages without ever having
+to perform a buffer allocation. Doing so removes the need for the early buffer
+allocation for the response in process_stream(), as it is only needed to have a
+reliable place to send an error message to. This will ensure the buffer
+allocator can be simplified and made more robust against the risk of deadlock
+on memory shortage.
+
+
+8. Caveats
+
+The following points require extra care :
+  - there will be some subtracts to figure the buffer "input" length (formerly
+    buf->i). In the past it always used to be an unsigned value. Extreme care
+    will have to be taken to always use an inline function to compute this so
+    that it doesn't accidently become signed.
+
+  - supporting buf->size==0 to point to a special string may require some extra
+    checks to avoid causing an integer underflow when calculating (size-len) or
+    (size-len-head) to figure the available room.
+
+  - it is very likely that some further changes will be tempting to do in the
+    channel to better integrate the buffer (which becomes very similar to the
+    pipe), but we must not go as far as removing the visibility of the "struct
+    buffer" because it will be used as entry point for many functions.
+
+  - it is likely that a number of the chunk_*, bi_* and bo_* variants have very
+    minor variations like return codes or error checking that can make their
+    replacement very tricky. Each set of such functions must be studied in
+    advance, and their users as well.