Willy Tarreau | e607df3 | 2014-10-23 18:36:16 +0200 | [diff] [blame] | 1 | 2014/10/23 - design thoughts for HTTP/2 |
| 2 | |
| 3 | - connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a |
| 4 | connection holds a compression context (headers table, etc...). We probably |
| 5 | need to have an h2_conn struct. |
| 6 | |
| 7 | - multiple transactions will be handled in parallel for a given h2_conn. They |
| 8 | are called streams in HTTP/2 terminology. |
| 9 | |
| 10 | - multiplexing : for a given client-side h2 connection, we can have multiple |
| 11 | server-side h2 connections. And for a server-side h2 connection, we can have |
| 12 | multiple client-side h2 connections. Streams circulate in N-to-N fashion. |
| 13 | |
| 14 | - flow control : flow control will be applied between multiple streams. Special |
| 15 | care must be taken so that an H2 client cannot block some H2 servers by |
| 16 | sending requests spread over multiple servers to the point where one server |
| 17 | response is blocked and prevents other responses from the same server from |
| 18 | reaching their clients. H2 connection buffers must always be empty or nearly |
| 19 | empty. The per-stream flow control needs to be respected as well as the |
| 20 | connection's buffers. It is important to implement some fairness between all |
| 21 | the streams so that it's not always the same which gets the bandwidth when |
| 22 | the connection is congested. |
| 23 | |
| 24 | - some clients can be H1 with an H2 server (is this really needed ?). Most of |
| 25 | the initial use case will be H2 clients to H1 servers. It is important to keep |
| 26 | in mind that H1 servers do not do flow control and that we don't want them to |
| 27 | block transfers (eg: post upload). |
| 28 | |
| 29 | - internal tasks : some H2 clients will be internal tasks (eg: health checks). |
| 30 | Some H2 servers will be internal tasks (eg: stats, cache). The model must be |
| 31 | compatible with this use case. |
| 32 | |
| 33 | - header indexing : headers are transported compressed, with a reference to a |
| 34 | static or a dynamic header, or a literal, possibly huffman-encoded. Indexing |
| 35 | is specific to the H2 connection. This means there is no way any binary data |
| 36 | can flow between both sides, headers will have to be decoded according to the |
| 37 | incoming connection's context and re-encoded according to the outgoing |
| 38 | connection's context, which can significantly differ. In order to avoid the |
| 39 | parsing trouble we currently face, headers will have to be clearly split |
| 40 | between name and value. It is worth noting that neither the incoming nor the |
| 41 | outgoing connections' contexts will be of any use while processing the |
| 42 | headers. At best we can have some shortcuts for well-known names that map |
| 43 | well to the static ones (eg: use the first static entry with same name), and |
| 44 | maybe have a few special cases for static name+value as well. Probably we can |
| 45 | classify headers in such categories : |
| 46 | |
| 47 | - static name + value |
| 48 | - static name + other value |
| 49 | - dynamic name + other value |
| 50 | |
| 51 | This will allow for better processing in some specific cases. Headers |
| 52 | supporting a single value (:method, :status, :path, ...) should probably |
| 53 | be stored in a single location with a direct access. That would allow us |
| 54 | to retrieve a method using hdr[METHOD]. All such indexing must be performed |
| 55 | while parsing. That also means that HTTP/1 will have to be converted to this |
| 56 | representation very early in the parser and possibly converted back to H/1 |
| 57 | after processing. |
| 58 | |
| 59 | Header names/values will have to be placed in a small memory area that will |
| 60 | inevitably get fragmented as headers are rewritten. An automatic packing |
| 61 | mechanism must be implemented so that when there's no more room, headers are |
| 62 | simply defragmented/packet to a new table and the old one is released. Just |
| 63 | like for the static chunks, we need to have a few such tables pre-allocated |
| 64 | and ready to be swapped at any moment. Repacking must not change any index |
| 65 | nor affect the way headers are compressed so that it can happen late after a |
| 66 | retry (send-name-header for example). |
| 67 | |
| 68 | - header processing : can still happen on a (header, value) basis. Reqrep/ |
| 69 | rsprep completely disappear and will have to be replaced with something else |
| 70 | to support renaming headers and rewriting url/path/... |
| 71 | |
| 72 | - push_promise : servers can push dummy requests+responses. They advertise |
| 73 | the stream ID in the push_promise frame indicating the associated stream ID. |
| 74 | This means that it is possible to initiate a client-server stream from the |
| 75 | information coming from the server and make the data flow as if the client |
| 76 | had made it. It's likely that we'll have to support two types of server |
| 77 | connections: those which support push and those which do not. That way client |
| 78 | streams will be distributed to existing server connections based on their |
| 79 | capabilities. It's important to keep in mind that PUSH will not be rewritten |
| 80 | in responses. |
| 81 | |
| 82 | - stream ID mapping : since the stream ID is per H2 connection, stream IDs will |
| 83 | have to be mapped. Thus a given stream is an entity with two IDs (one per |
| 84 | side). Or more precisely a stream has two end points, each one carrying an ID |
| 85 | when it ends on an HTTP2 connection. Also, for each stream ID we need to |
| 86 | quickly find the associated transaction in progress. Using a small quick |
| 87 | unique tree seems indicated considering the wide range of valid values. |
| 88 | |
| 89 | - frame sizes : frame have to be remapped between both sides as multiplexed |
| 90 | connections won't always have the same characteristics. Thus some frames |
| 91 | might be spliced and others will be sliced. |
| 92 | |
| 93 | - error processing : care must be taken to never break a connection unless it |
| 94 | is dead or corrupt at the protocol level. Stats counter must exist to observe |
| 95 | the causes. Timeouts are a great problem because silent connections might |
| 96 | die out of inactivity. Ping frames should probably be scheduled a few seconds |
| 97 | before the connection timeout so that an unused connection is verified before |
| 98 | being killed. Abnormal requests must be dealt with using RST_STREAM. |
| 99 | |
Ilya Shipitsin | 2075ca8 | 2020-03-06 23:22:22 +0500 | [diff] [blame] | 100 | - ALPN : ALPN must be observed on the client side, and transmitted to the server |
Willy Tarreau | e607df3 | 2014-10-23 18:36:16 +0200 | [diff] [blame] | 101 | side. |
| 102 | |
| 103 | - proxy protocol : proxy protocol makes little to no sense in a multiplexed |
| 104 | protocol. A per-stream equivalent will surely be needed if implementations |
| 105 | do not quickly generalize the use of Forward. |
| 106 | |
| 107 | - simplified protocol for local devices (eg: haproxy->varnish in clear and |
| 108 | without handshake, and possibly even with splicing if the connection's |
| 109 | settings are shared) |
| 110 | |
| 111 | - logging : logging must report a number of extra information such as the |
| 112 | stream ID, and whether the transaction was initiated by the client or by the |
| 113 | server (which can be deduced from the stream ID's parity). In case of push, |
| 114 | the number of the associated stream must also be reported. |
| 115 | |
| 116 | - memory usage : H2 increases memory usage by mandating use of 16384 bytes |
| 117 | frame size minimum. That means slightly more than 16kB of buffer in each |
| 118 | direction to process any frame. It will definitely have an impact on the |
| 119 | deployed maxconn setting in places using less than this (4..8kB are common). |
Joseph Herlant | 02cedc4 | 2018-11-13 19:45:17 -0800 | [diff] [blame] | 120 | Also, the header list is persistent per connection, so if we reach the same |
Willy Tarreau | e607df3 | 2014-10-23 18:36:16 +0200 | [diff] [blame] | 121 | size as the request, that's another 16kB in each direction, resulting in |
| 122 | about 48kB of memory where 8 were previously used. A more careful encoder |
| 123 | can work with a much smaller set even if that implies evicting entries |
| 124 | between multiple headers of the same message. |
| 125 | |
| 126 | - HTTP/1.0 should very carefully be transported over H2. Since there's no way |
| 127 | to pass version information in the protocol, the server could use some |
| 128 | features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers, |
| 129 | ...). |
| 130 | |
| 131 | - host / :authority : ":authority" is the norm, and "host" will be absent when |
| 132 | H2 clients generate :authority. This probably means that a dummy Host header |
| 133 | will have to be produced internally from :authority and removed when passing |
| 134 | to H2 behind. This can cause some trouble when passing H2 requests to H1 |
| 135 | proxies, because there's no way to know if the request should contain scheme |
| 136 | and authority in H1 or not based on the H2 request. Thus a "proxy" option |
Ilya Shipitsin | 2075ca8 | 2020-03-06 23:22:22 +0500 | [diff] [blame] | 137 | will have to be explicitly mentioned on HTTP/1 server lines. One of the |
Willy Tarreau | e607df3 | 2014-10-23 18:36:16 +0200 | [diff] [blame] | 138 | problem that it creates is that it's not longer possible to pass H/1 requests |
| 139 | to H/1 proxies without an explicit configuration. Maybe a table of the |
| 140 | various combinations is needed. |
| 141 | |
| 142 | :scheme :authority host |
| 143 | HTTP/2 request present present absent |
| 144 | HTTP/1 server req absent absent present |
| 145 | HTTP/1 proxy req present present present |
| 146 | |
| 147 | So in the end the issue is only with H/2 requests passed to H/1 proxies. |
| 148 | |
| 149 | - ping frames : they don't indicate any stream ID so by definition they cannot |
| 150 | be forwarded to any server. The H2 connection should deal with them only. |
| 151 | |
| 152 | There's a layering problem with H2. The framing layer has to be aware of the |
| 153 | upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass |
| 154 | it over a framing layer to mux the streams, the frame type must be passed below |
| 155 | so that frames are properly arranged. Header encoding is connection-based and |
| 156 | all streams using the same connection will interact in the way their headers |
| 157 | are encoded. Thus the encoder *has* to be placed in the h2_conn entity, and |
| 158 | this entity has to know for each stream what its headers are. |
| 159 | |
| 160 | Probably that we should remove *all* headers from transported data and move |
| 161 | them on the fly to a parallel structure that can be shared between H1 and H2 |
| 162 | and consumed at the appropriate level. That means buffers only transport data. |
| 163 | Trailers have to be dealt with differently. |
| 164 | |
| 165 | So if we consider an H1 request being forwarded between a client and a server, |
| 166 | it would look approximately like this : |
| 167 | |
| 168 | - request header + body land into a stream's receive buffer |
| 169 | - headers are indexed and stripped out so that only the body and whatever |
| 170 | follows remain in the buffer |
| 171 | - both the header index and the buffer with the body stay attached to the |
| 172 | stream |
| 173 | - the sender can rebuild the whole headers. Since they're found in a table |
| 174 | supposed to be stable, it can rebuild them as many times as desired and |
| 175 | will always get the same result, so it's safe to build them into the trash |
| 176 | buffer for immediate sending, just as we do for the PROXY protocol. |
| 177 | - the upper protocol should probably provide a build_hdr() callback which |
| 178 | when called by the socket layer, builds this header block based on the |
| 179 | current stream's header list, ready to be sent. |
| 180 | - the socket layer has to know how many bytes from the headers are left to be |
| 181 | forwarded prior to processing the body. |
| 182 | - the socket layer needs to consume only the acceptable part of the body and |
| 183 | must not release the buffer if any data remains in it (eg: pipelining over |
| 184 | H1). This is already handled by channel->o and channel->to_forward. |
| 185 | - we could possibly have another optional callback to send a preamble before |
| 186 | data, that could be used to send chunk sizes in H1. The danger is that it |
| 187 | absolutely needs to be stable if it has to be retried. But it could |
| 188 | considerably simplify de-chunking. |
| 189 | |
| 190 | When the request is sent to an H2 server, an H2 stream request must be made |
| 191 | to the server, we find an existing connection whose settings are compatible |
| 192 | with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If |
| 193 | none is found, a new connection must be established, unless maxconn is reached. |
| 194 | |
| 195 | Servers must have a maxstream setting just like they have a maxconn. The same |
| 196 | queue may be used for that. |
| 197 | |
| 198 | The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2 |
| 199 | that becomes impossible (and useless). We still need something like the |
| 200 | "tcp-request session" hook to apply just after the SSL handshake is done. |
| 201 | |
| 202 | It is impossible to defragment the body on the fly in HTTP/2. Since multiple |
| 203 | messages are interleaved, we cannot wait for all of them and block the head of |
| 204 | line. Thus if body analysis is required, it will have to use the stream's |
| 205 | buffer, which necessarily implies a copy. That means that with each H2 end we |
| 206 | necessarily have at least one copy. Sometimes we might be able to "splice" some |
| 207 | bytes from one side to the other without copying into the stream buffer (same |
| 208 | rules as for TCP splicing). |
| 209 | |
| 210 | In theory, only data should flow through the channel buffer, so each side's |
| 211 | connector is responsible for encoding data (H1: linear/chunks, H2: frames). |
| 212 | Maybe the same mechanism could be extrapolated to tunnels / TCP. |
| 213 | |
| 214 | Since we'd use buffers only for data (and for receipt of headers), we need to |
| 215 | have dynamic buffer allocation. |
| 216 | |
| 217 | Thus : |
| 218 | - Tx buffers do not exist. We allocate a buffer on the fly when we're ready to |
Joseph Herlant | 02cedc4 | 2018-11-13 19:45:17 -0800 | [diff] [blame] | 219 | send something that we need to build and that needs to be persistent in case |
Willy Tarreau | e607df3 | 2014-10-23 18:36:16 +0200 | [diff] [blame] | 220 | of partial send. H1 headers are built on the fly from the header table to a |
| 221 | temporary buffer that is immediately sent and whose amount of sent bytes is |
| 222 | the only information kept (like for PROXY protocol). H2 headers are more |
| 223 | complex since the encoding depends on what was successfully sent. Thus we |
| 224 | need to build them and put them into a temporary buffer that remains |
| 225 | persistent in case send() fails. It is possible to have a limited pool of |
| 226 | Tx buffers and refrain from sending if there is no more buffer available in |
| 227 | the pool. In that case we need a wake-up mechanism once a buffer is |
| 228 | available. Once the data are sent, the Tx buffer is then immediately recycled |
| 229 | in its pool. Note that no tx buffer being used (eg: for hdr or control) means |
| 230 | that we have to be able to serialize access to the connection and retry with |
| 231 | the same stream. It also means that a stream that times out while waiting for |
| 232 | the connector to read the second half of its request has to stay there, or at |
| 233 | least needs to be handled gracefully. However if the connector cannot read |
| 234 | the data to be sent, it means that the buffer is congested and the connection |
| 235 | is dead, so that probably means it can be killed. |
| 236 | |
| 237 | - Rx buffers have to be pre-allocated just before calling recv(). A connection |
| 238 | will first try to pick a buffer and disable reception if it fails, then |
| 239 | subscribe to the list of tasks waiting for an Rx buffer. |
| 240 | |
| 241 | - full Rx buffers might sometimes be moved around to the next buffer instead of |
| 242 | experiencing a copy. That means that channels and connectors must use the |
| 243 | same format of buffer, and that only the channel will have to see its |
| 244 | pointers adjusted. |
| 245 | |
| 246 | - Tx of data should be made as much as possible without copying. That possibly |
| 247 | means by directly looking into the connection buffer on the other side if |
| 248 | the local Tx buffer does not exist and the stream buffer is not allocated, or |
| 249 | even performing a splice() call between the two sides. One of the problem in |
| 250 | doing this is that it requires proper ordering of the operations (eg: when |
| 251 | multiple readers are attached to a same buffer). If the splitting occurs upon |
| 252 | receipt, there's no problem. If we expect to retrieve data directly from the |
| 253 | original buffer, it's harder since it contains various things in an order |
| 254 | which does not even indicate what belongs to whom. Thus possibly the only |
| 255 | mechanism to implement is the buffer permutation which guarantees zero-copy |
| 256 | and only in the 100% safe case. Also it's atomic and does not cause HOL |
| 257 | blocking. |
| 258 | |
| 259 | It makes sense to chose the frontend_accept() function right after the |
| 260 | handshake ended. It is then possible to check the ALPN, the SNI, the ciphers |
| 261 | and to accept to switch to the h2_conn_accept handler only if everything is OK. |
| 262 | The h2_conn_accept handler will have to deal with the connection setup, |
| 263 | initialization of the header table, exchange of the settings frames and |
| 264 | preparing whatever is needed to fire new streams upon receipt of unknown |
| 265 | stream IDs. Note: most of the time it will not be possible to splice() because |
| 266 | we need to know in advance the amount of bytes to write the header, and here it |
| 267 | will not be possible. |
| 268 | |
| 269 | H2 health checks must be seen as regular transactions/streams. The check runs a |
| 270 | normal client which seeks an available stream from a server. The server then |
| 271 | finds one on an existing connection or initiates a new H2 connection. The H2 |
| 272 | checks will have to be configurable for sharing streams or not. Another option |
| 273 | could be to specify how many requests can be made over existing connections |
| 274 | before insisting on getting a separate connection. Note that such separate |
| 275 | connections might end up stacking up once released. So probably that they need |
| 276 | to be recycled very quickly (eg: fix how many unused ones can exist max). |
| 277 | |