DOC: add design thoughts on HTTP/2

This is doc/design-thoughts/http2.txt.
diff --git a/doc/design-thoughts/http2.txt b/doc/design-thoughts/http2.txt
new file mode 100644
index 0000000..eadaaeb
--- /dev/null
+++ b/doc/design-thoughts/http2.txt
@@ -0,0 +1,277 @@
+2014/10/23 - design thoughts for HTTP/2
+
+- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a
+  connection holds a compression context (headers table, etc...). We probably
+  need to have an h2_conn struct.
+
+- multiple transactions will be handled in parallel for a given h2_conn. They
+  are called streams in HTTP/2 terminology.
+
+- multiplexing : for a given client-side h2 connection, we can have multiple
+  server-side h2 connections. And for a server-side h2 connection, we can have
+  multiple client-side h2 connections. Streams circulate in N-to-N fashion.
+
+- flow control : flow control will be applied between multiple streams. Special
+  care must be taken so that an H2 client cannot block some H2 servers by
+  sending requests spread over multiple servers to the point where one server
+  response is blocked and prevents other responses from the same server from
+  reaching their clients. H2 connection buffers must always be empty or nearly
+  empty. The per-stream flow control needs to be respected as well as the
+  connection's buffers. It is important to implement some fairness between all
+  the streams so that it's not always the same which gets the bandwidth when
+  the connection is congested.
+
+- some clients can be H1 with an H2 server (is this really needed ?). Most of
+  the initial use case will be H2 clients to H1 servers. It is important to keep
+  in mind that H1 servers do not do flow control and that we don't want them to
+  block transfers (eg: post upload).
+
+- internal tasks : some H2 clients will be internal tasks (eg: health checks).
+  Some H2 servers will be internal tasks (eg: stats, cache). The model must be
+  compatible with this use case.
+
+- header indexing : headers are transported compressed, with a reference to a
+  static or a dynamic header, or a literal, possibly huffman-encoded. Indexing
+  is specific to the H2 connection. This means there is no way any binary data
+  can flow between both sides, headers will have to be decoded according to the
+  incoming connection's context and re-encoded according to the outgoing
+  connection's context, which can significantly differ. In order to avoid the
+  parsing trouble we currently face, headers will have to be clearly split
+  between name and value. It is worth noting that neither the incoming nor the
+  outgoing connections' contexts will be of any use while processing the
+  headers. At best we can have some shortcuts for well-known names that map
+  well to the static ones (eg: use the first static entry with same name), and
+  maybe have a few special cases for static name+value as well. Probably we can
+  classify headers in such categories :
+
+    - static name + value
+    - static name + other value
+    - dynamic name + other value
+
+  This will allow for better processing in some specific cases. Headers
+  supporting a single value (:method, :status, :path, ...) should probably
+  be stored in a single location with a direct access. That would allow us
+  to retrieve a method using hdr[METHOD]. All such indexing must be performed
+  while parsing. That also means that HTTP/1 will have to be converted to this
+  representation very early in the parser and possibly converted back to H/1
+  after processing.
+
+  Header names/values will have to be placed in a small memory area that will
+  inevitably get fragmented as headers are rewritten. An automatic packing
+  mechanism must be implemented so that when there's no more room, headers are
+  simply defragmented/packet to a new table and the old one is released. Just
+  like for the static chunks, we need to have a few such tables pre-allocated
+  and ready to be swapped at any moment. Repacking must not change any index
+  nor affect the way headers are compressed so that it can happen late after a
+  retry (send-name-header for example).
+
+- header processing : can still happen on a (header, value) basis. Reqrep/
+  rsprep completely disappear and will have to be replaced with something else
+  to support renaming headers and rewriting url/path/...
+
+- push_promise : servers can push dummy requests+responses. They advertise
+  the stream ID in the push_promise frame indicating the associated stream ID.
+  This means that it is possible to initiate a client-server stream from the
+  information coming from the server and make the data flow as if the client
+  had made it. It's likely that we'll have to support two types of server
+  connections: those which support push and those which do not. That way client
+  streams will be distributed to existing server connections based on their
+  capabilities. It's important to keep in mind that PUSH will not be rewritten
+  in responses.
+
+- stream ID mapping : since the stream ID is per H2 connection, stream IDs will
+  have to be mapped. Thus a given stream is an entity with two IDs (one per
+  side). Or more precisely a stream has two end points, each one carrying an ID
+  when it ends on an HTTP2 connection. Also, for each stream ID we need to
+  quickly find the associated transaction in progress. Using a small quick
+  unique tree seems indicated considering the wide range of valid values.
+
+- frame sizes : frame have to be remapped between both sides as multiplexed
+  connections won't always have the same characteristics. Thus some frames
+  might be spliced and others will be sliced.
+
+- error processing : care must be taken to never break a connection unless it
+  is dead or corrupt at the protocol level. Stats counter must exist to observe
+  the causes. Timeouts are a great problem because silent connections might
+  die out of inactivity. Ping frames should probably be scheduled a few seconds
+  before the connection timeout so that an unused connection is verified before
+  being killed. Abnormal requests must be dealt with using RST_STREAM.
+
+- ALPN : ALPN must be observed onthe client side, and transmitted to the server
+  side.
+
+- proxy protocol : proxy protocol makes little to no sense in a multiplexed
+  protocol. A per-stream equivalent will surely be needed if implementations
+  do not quickly generalize the use of Forward.
+
+- simplified protocol for local devices (eg: haproxy->varnish in clear and
+  without handshake, and possibly even with splicing if the connection's
+  settings are shared)
+
+- logging : logging must report a number of extra information such as the
+  stream ID, and whether the transaction was initiated by the client or by the
+  server (which can be deduced from the stream ID's parity). In case of push,
+  the number of the associated stream must also be reported.
+
+- memory usage : H2 increases memory usage by mandating use of 16384 bytes
+  frame size minimum. That means slightly more than 16kB of buffer in each
+  direction to process any frame. It will definitely have an impact on the
+  deployed maxconn setting in places using less than this (4..8kB are common).
+  Also, the header list is persistant per connection, so if we reach the same
+  size as the request, that's another 16kB in each direction, resulting in
+  about 48kB of memory where 8 were previously used. A more careful encoder
+  can work with a much smaller set even if that implies evicting entries
+  between multiple headers of the same message.
+
+- HTTP/1.0 should very carefully be transported over H2. Since there's no way
+  to pass version information in the protocol, the server could use some
+  features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers,
+  ...).
+
+- host / :authority : ":authority" is the norm, and "host" will be absent when
+  H2 clients generate :authority. This probably means that a dummy Host header
+  will have to be produced internally from :authority and removed when passing
+  to H2 behind. This can cause some trouble when passing H2 requests to H1
+  proxies, because there's no way to know if the request should contain scheme
+  and authority in H1 or not based on the H2 request. Thus a "proxy" option
+  will have to be explicitly mentionned on HTTP/1 server lines. One of the
+  problem that it creates is that it's not longer possible to pass H/1 requests
+  to H/1 proxies without an explicit configuration. Maybe a table of the
+  various combinations is needed.
+
+                           :scheme   :authority   host
+       HTTP/2 request      present   present      absent
+       HTTP/1 server req   absent    absent       present
+       HTTP/1 proxy req    present   present      present
+
+  So in the end the issue is only with H/2 requests passed to H/1 proxies.
+
+- ping frames : they don't indicate any stream ID so by definition they cannot
+  be forwarded to any server. The H2 connection should deal with them only.
+
+There's a layering problem with H2. The framing layer has to be aware of the
+upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass
+it over a framing layer to mux the streams, the frame type must be passed below
+so that frames are properly arranged. Header encoding is connection-based and
+all streams using the same connection will interact in the way their headers
+are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and
+this entity has to know for each stream what its headers are.
+
+Probably that we should remove *all* headers from transported data and move
+them on the fly to a parallel structure that can be shared between H1 and H2
+and consumed at the appropriate level. That means buffers only transport data.
+Trailers have to be dealt with differently.
+
+So if we consider an H1 request being forwarded between a client and a server,
+it would look approximately like this :
+
+  - request header + body land into a stream's receive buffer
+  - headers are indexed and stripped out so that only the body and whatever
+    follows remain in the buffer
+  - both the header index and the buffer with the body stay attached to the
+    stream
+  - the sender can rebuild the whole headers. Since they're found in a table
+    supposed to be stable, it can rebuild them as many times as desired and
+    will always get the same result, so it's safe to build them into the trash
+    buffer for immediate sending, just as we do for the PROXY protocol.
+  - the upper protocol should probably provide a build_hdr() callback which
+    when called by the socket layer, builds this header block based on the
+    current stream's header list, ready to be sent.
+  - the socket layer has to know how many bytes from the headers are left to be
+    forwarded prior to processing the body.
+  - the socket layer needs to consume only the acceptable part of the body and
+    must not release the buffer if any data remains in it (eg: pipelining over
+    H1). This is already handled by channel->o and channel->to_forward.
+  - we could possibly have another optional callback to send a preamble before
+    data, that could be used to send chunk sizes in H1. The danger is that it
+    absolutely needs to be stable if it has to be retried. But it could
+    considerably simplify de-chunking.
+
+When the request is sent to an H2 server, an H2 stream request must be made
+to the server, we find an existing connection whose settings are compatible
+with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If
+none is found, a new connection must be established, unless maxconn is reached.
+
+Servers must have a maxstream setting just like they have a maxconn. The same
+queue may be used for that.
+
+The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2
+that becomes impossible (and useless). We still need something like the
+"tcp-request session" hook to apply just after the SSL handshake is done.
+
+It is impossible to defragment the body on the fly in HTTP/2. Since multiple
+messages are interleaved, we cannot wait for all of them and block the head of
+line. Thus if body analysis is required, it will have to use the stream's
+buffer, which necessarily implies a copy. That means that with each H2 end we
+necessarily have at least one copy. Sometimes we might be able to "splice" some
+bytes from one side to the other without copying into the stream buffer (same
+rules as for TCP splicing).
+
+In theory, only data should flow through the channel buffer, so each side's
+connector is responsible for encoding data (H1: linear/chunks, H2: frames).
+Maybe the same mechanism could be extrapolated to tunnels / TCP.
+
+Since we'd use buffers only for data (and for receipt of headers), we need to
+have dynamic buffer allocation.
+
+Thus :
+- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to
+  send something that we need to build and that needs to be persistant in case
+  of partial send. H1 headers are built on the fly from the header table to a
+  temporary buffer that is immediately sent and whose amount of sent bytes is
+  the only information kept (like for PROXY protocol). H2 headers are more
+  complex since the encoding depends on what was successfully sent. Thus we
+  need to build them and put them into a temporary buffer that remains
+  persistent in case send() fails. It is possible to have a limited pool of
+  Tx buffers and refrain from sending if there is no more buffer available in
+  the pool. In that case we need a wake-up mechanism once a buffer is
+  available. Once the data are sent, the Tx buffer is then immediately recycled
+  in its pool. Note that no tx buffer being used (eg: for hdr or control) means
+  that we have to be able to serialize access to the connection and retry with
+  the same stream. It also means that a stream that times out while waiting for
+  the connector to read the second half of its request has to stay there, or at
+  least needs to be handled gracefully. However if the connector cannot read
+  the data to be sent, it means that the buffer is congested and the connection
+  is dead, so that probably means it can be killed.
+
+- Rx buffers have to be pre-allocated just before calling recv(). A connection
+  will first try to pick a buffer and disable reception if it fails, then
+  subscribe to the list of tasks waiting for an Rx buffer.
+
+- full Rx buffers might sometimes be moved around to the next buffer instead of
+  experiencing a copy. That means that channels and connectors must use the
+  same format of buffer, and that only the channel will have to see its
+  pointers adjusted.
+
+- Tx of data should be made as much as possible without copying. That possibly
+  means by directly looking into the connection buffer on the other side if
+  the local Tx buffer does not exist and the stream buffer is not allocated, or
+  even performing a splice() call between the two sides. One of the problem in
+  doing this is that it requires proper ordering of the operations (eg: when
+  multiple readers are attached to a same buffer). If the splitting occurs upon
+  receipt, there's no problem. If we expect to retrieve data directly from the
+  original buffer, it's harder since it contains various things in an order
+  which does not even indicate what belongs to whom. Thus possibly the only
+  mechanism to implement is the buffer permutation which guarantees zero-copy
+  and only in the 100% safe case. Also it's atomic and does not cause HOL
+  blocking.
+
+It makes sense to chose the frontend_accept() function right after the
+handshake ended. It is then possible to check the ALPN, the SNI, the ciphers
+and to accept to switch to the h2_conn_accept handler only if everything is OK.
+The h2_conn_accept handler will have to deal with the connection setup,
+initialization of the header table, exchange of the settings frames and
+preparing whatever is needed to fire new streams upon receipt of unknown
+stream IDs. Note: most of the time it will not be possible to splice() because
+we need to know in advance the amount of bytes to write the header, and here it
+will not be possible.
+
+H2 health checks must be seen as regular transactions/streams. The check runs a
+normal client which seeks an available stream from a server. The server then
+finds one on an existing connection or initiates a new H2 connection. The H2
+checks will have to be configurable for sharing streams or not. Another option
+could be to specify how many requests can be made over existing connections
+before insisting on getting a separate connection. Note that such separate
+connections might end up stacking up once released. So probably that they need
+to be recycled very quickly (eg: fix how many unused ones can exist max).
+