Blame - doc/design-thoughts/http2.txt - haproxy

blob: c21ac108a4d0a70603e620b3b37f1f7df7e8a946 [file] [log] [blame]

Willy Tarreau	e607df3	2014-10-23 18:36:16 +0200	[diff] [blame]	1	2014/10/23 - design thoughts for HTTP/2
				2
				3	- connections : HTTP/2 depends a lot more on a connection than HTTP/1 because a
				4	connection holds a compression context (headers table, etc...). We probably
				5	need to have an h2_conn struct.
				6
				7	- multiple transactions will be handled in parallel for a given h2_conn. They
				8	are called streams in HTTP/2 terminology.
				9
				10	- multiplexing : for a given client-side h2 connection, we can have multiple
				11	server-side h2 connections. And for a server-side h2 connection, we can have
				12	multiple client-side h2 connections. Streams circulate in N-to-N fashion.
				13
				14	- flow control : flow control will be applied between multiple streams. Special
				15	care must be taken so that an H2 client cannot block some H2 servers by
				16	sending requests spread over multiple servers to the point where one server
				17	response is blocked and prevents other responses from the same server from
				18	reaching their clients. H2 connection buffers must always be empty or nearly
				19	empty. The per-stream flow control needs to be respected as well as the
				20	connection's buffers. It is important to implement some fairness between all
				21	the streams so that it's not always the same which gets the bandwidth when
				22	the connection is congested.
				23
				24	- some clients can be H1 with an H2 server (is this really needed ?). Most of
				25	the initial use case will be H2 clients to H1 servers. It is important to keep
				26	in mind that H1 servers do not do flow control and that we don't want them to
				27	block transfers (eg: post upload).
				28
				29	- internal tasks : some H2 clients will be internal tasks (eg: health checks).
				30	Some H2 servers will be internal tasks (eg: stats, cache). The model must be
				31	compatible with this use case.
				32
				33	- header indexing : headers are transported compressed, with a reference to a
				34	static or a dynamic header, or a literal, possibly huffman-encoded. Indexing
				35	is specific to the H2 connection. This means there is no way any binary data
				36	can flow between both sides, headers will have to be decoded according to the
				37	incoming connection's context and re-encoded according to the outgoing
				38	connection's context, which can significantly differ. In order to avoid the
				39	parsing trouble we currently face, headers will have to be clearly split
				40	between name and value. It is worth noting that neither the incoming nor the
				41	outgoing connections' contexts will be of any use while processing the
				42	headers. At best we can have some shortcuts for well-known names that map
				43	well to the static ones (eg: use the first static entry with same name), and
				44	maybe have a few special cases for static name+value as well. Probably we can
				45	classify headers in such categories :
				46
				47	- static name + value
				48	- static name + other value
				49	- dynamic name + other value
				50
				51	This will allow for better processing in some specific cases. Headers
				52	supporting a single value (:method, :status, :path, ...) should probably
				53	be stored in a single location with a direct access. That would allow us
				54	to retrieve a method using hdr[METHOD]. All such indexing must be performed
				55	while parsing. That also means that HTTP/1 will have to be converted to this
				56	representation very early in the parser and possibly converted back to H/1
				57	after processing.
				58
				59	Header names/values will have to be placed in a small memory area that will
				60	inevitably get fragmented as headers are rewritten. An automatic packing
				61	mechanism must be implemented so that when there's no more room, headers are
				62	simply defragmented/packet to a new table and the old one is released. Just
				63	like for the static chunks, we need to have a few such tables pre-allocated
				64	and ready to be swapped at any moment. Repacking must not change any index
				65	nor affect the way headers are compressed so that it can happen late after a
				66	retry (send-name-header for example).
				67
				68	- header processing : can still happen on a (header, value) basis. Reqrep/
				69	rsprep completely disappear and will have to be replaced with something else
				70	to support renaming headers and rewriting url/path/...
				71
				72	- push_promise : servers can push dummy requests+responses. They advertise
				73	the stream ID in the push_promise frame indicating the associated stream ID.
				74	This means that it is possible to initiate a client-server stream from the
				75	information coming from the server and make the data flow as if the client
				76	had made it. It's likely that we'll have to support two types of server
				77	connections: those which support push and those which do not. That way client
				78	streams will be distributed to existing server connections based on their
				79	capabilities. It's important to keep in mind that PUSH will not be rewritten
				80	in responses.
				81
				82	- stream ID mapping : since the stream ID is per H2 connection, stream IDs will
				83	have to be mapped. Thus a given stream is an entity with two IDs (one per
				84	side). Or more precisely a stream has two end points, each one carrying an ID
				85	when it ends on an HTTP2 connection. Also, for each stream ID we need to
				86	quickly find the associated transaction in progress. Using a small quick
				87	unique tree seems indicated considering the wide range of valid values.
				88
				89	- frame sizes : frame have to be remapped between both sides as multiplexed
				90	connections won't always have the same characteristics. Thus some frames
				91	might be spliced and others will be sliced.
				92
				93	- error processing : care must be taken to never break a connection unless it
				94	is dead or corrupt at the protocol level. Stats counter must exist to observe
				95	the causes. Timeouts are a great problem because silent connections might
				96	die out of inactivity. Ping frames should probably be scheduled a few seconds
				97	before the connection timeout so that an unused connection is verified before
				98	being killed. Abnormal requests must be dealt with using RST_STREAM.
				99
Ilya Shipitsin	2075ca8	2020-03-06 23:22:22 +0500	[diff] [blame]	100	- ALPN : ALPN must be observed on the client side, and transmitted to the server
Willy Tarreau	e607df3	2014-10-23 18:36:16 +0200	[diff] [blame]	101	side.
				102
				103	- proxy protocol : proxy protocol makes little to no sense in a multiplexed
				104	protocol. A per-stream equivalent will surely be needed if implementations
				105	do not quickly generalize the use of Forward.
				106
				107	- simplified protocol for local devices (eg: haproxy->varnish in clear and
				108	without handshake, and possibly even with splicing if the connection's
				109	settings are shared)
				110
				111	- logging : logging must report a number of extra information such as the
				112	stream ID, and whether the transaction was initiated by the client or by the
				113	server (which can be deduced from the stream ID's parity). In case of push,
				114	the number of the associated stream must also be reported.
				115
				116	- memory usage : H2 increases memory usage by mandating use of 16384 bytes
				117	frame size minimum. That means slightly more than 16kB of buffer in each
				118	direction to process any frame. It will definitely have an impact on the
				119	deployed maxconn setting in places using less than this (4..8kB are common).
Joseph Herlant	02cedc4	2018-11-13 19:45:17 -0800	[diff] [blame]	120	Also, the header list is persistent per connection, so if we reach the same
Willy Tarreau	e607df3	2014-10-23 18:36:16 +0200	[diff] [blame]	121	size as the request, that's another 16kB in each direction, resulting in
				122	about 48kB of memory where 8 were previously used. A more careful encoder
				123	can work with a much smaller set even if that implies evicting entries
				124	between multiple headers of the same message.
				125
				126	- HTTP/1.0 should very carefully be transported over H2. Since there's no way
				127	to pass version information in the protocol, the server could use some
				128	features of HTTP/1.1 that are unsafe in HTTP/1.0 (compression, trailers,
				129	...).
				130
				131	- host / :authority : ":authority" is the norm, and "host" will be absent when
				132	H2 clients generate :authority. This probably means that a dummy Host header
				133	will have to be produced internally from :authority and removed when passing
				134	to H2 behind. This can cause some trouble when passing H2 requests to H1
				135	proxies, because there's no way to know if the request should contain scheme
				136	and authority in H1 or not based on the H2 request. Thus a "proxy" option
Ilya Shipitsin	2075ca8	2020-03-06 23:22:22 +0500	[diff] [blame]	137	will have to be explicitly mentioned on HTTP/1 server lines. One of the
Willy Tarreau	e607df3	2014-10-23 18:36:16 +0200	[diff] [blame]	138	problem that it creates is that it's not longer possible to pass H/1 requests
				139	to H/1 proxies without an explicit configuration. Maybe a table of the
				140	various combinations is needed.
				141
				142	:scheme :authority host
				143	HTTP/2 request present present absent
				144	HTTP/1 server req absent absent present
				145	HTTP/1 proxy req present present present
				146
				147	So in the end the issue is only with H/2 requests passed to H/1 proxies.
				148
				149	- ping frames : they don't indicate any stream ID so by definition they cannot
				150	be forwarded to any server. The H2 connection should deal with them only.
				151
				152	There's a layering problem with H2. The framing layer has to be aware of the
				153	upper layer semantics. We can't simply re-encode HTTP/1 to HTTP/2 then pass
				154	it over a framing layer to mux the streams, the frame type must be passed below
				155	so that frames are properly arranged. Header encoding is connection-based and
				156	all streams using the same connection will interact in the way their headers
				157	are encoded. Thus the encoder has to be placed in the h2_conn entity, and
				158	this entity has to know for each stream what its headers are.
				159
				160	Probably that we should remove all headers from transported data and move
				161	them on the fly to a parallel structure that can be shared between H1 and H2
				162	and consumed at the appropriate level. That means buffers only transport data.
				163	Trailers have to be dealt with differently.
				164
				165	So if we consider an H1 request being forwarded between a client and a server,
				166	it would look approximately like this :
				167
				168	- request header + body land into a stream's receive buffer
				169	- headers are indexed and stripped out so that only the body and whatever
				170	follows remain in the buffer
				171	- both the header index and the buffer with the body stay attached to the
				172	stream
				173	- the sender can rebuild the whole headers. Since they're found in a table
				174	supposed to be stable, it can rebuild them as many times as desired and
				175	will always get the same result, so it's safe to build them into the trash
				176	buffer for immediate sending, just as we do for the PROXY protocol.
				177	- the upper protocol should probably provide a build_hdr() callback which
				178	when called by the socket layer, builds this header block based on the
				179	current stream's header list, ready to be sent.
				180	- the socket layer has to know how many bytes from the headers are left to be
				181	forwarded prior to processing the body.
				182	- the socket layer needs to consume only the acceptable part of the body and
				183	must not release the buffer if any data remains in it (eg: pipelining over
				184	H1). This is already handled by channel->o and channel->to_forward.
				185	- we could possibly have another optional callback to send a preamble before
				186	data, that could be used to send chunk sizes in H1. The danger is that it
				187	absolutely needs to be stable if it has to be retried. But it could
				188	considerably simplify de-chunking.
				189
				190	When the request is sent to an H2 server, an H2 stream request must be made
				191	to the server, we find an existing connection whose settings are compatible
				192	with our needs (eg: tls/clear, push/no-push), and with a spare stream ID. If
				193	none is found, a new connection must be established, unless maxconn is reached.
				194
				195	Servers must have a maxstream setting just like they have a maxconn. The same
				196	queue may be used for that.
				197
				198	The "tcp-request content" ruleset must apply to the TCP layer. But with HTTP/2
				199	that becomes impossible (and useless). We still need something like the
				200	"tcp-request session" hook to apply just after the SSL handshake is done.
				201
				202	It is impossible to defragment the body on the fly in HTTP/2. Since multiple
				203	messages are interleaved, we cannot wait for all of them and block the head of
				204	line. Thus if body analysis is required, it will have to use the stream's
				205	buffer, which necessarily implies a copy. That means that with each H2 end we
				206	necessarily have at least one copy. Sometimes we might be able to "splice" some
				207	bytes from one side to the other without copying into the stream buffer (same
				208	rules as for TCP splicing).
				209
				210	In theory, only data should flow through the channel buffer, so each side's
				211	connector is responsible for encoding data (H1: linear/chunks, H2: frames).
				212	Maybe the same mechanism could be extrapolated to tunnels / TCP.
				213
				214	Since we'd use buffers only for data (and for receipt of headers), we need to
				215	have dynamic buffer allocation.
				216
				217	Thus :
				218	- Tx buffers do not exist. We allocate a buffer on the fly when we're ready to
Joseph Herlant	02cedc4	2018-11-13 19:45:17 -0800	[diff] [blame]	219	send something that we need to build and that needs to be persistent in case
Willy Tarreau	e607df3	2014-10-23 18:36:16 +0200	[diff] [blame]	220	of partial send. H1 headers are built on the fly from the header table to a
				221	temporary buffer that is immediately sent and whose amount of sent bytes is
				222	the only information kept (like for PROXY protocol). H2 headers are more
				223	complex since the encoding depends on what was successfully sent. Thus we
				224	need to build them and put them into a temporary buffer that remains
				225	persistent in case send() fails. It is possible to have a limited pool of
				226	Tx buffers and refrain from sending if there is no more buffer available in
				227	the pool. In that case we need a wake-up mechanism once a buffer is
				228	available. Once the data are sent, the Tx buffer is then immediately recycled
				229	in its pool. Note that no tx buffer being used (eg: for hdr or control) means
				230	that we have to be able to serialize access to the connection and retry with
				231	the same stream. It also means that a stream that times out while waiting for
				232	the connector to read the second half of its request has to stay there, or at
				233	least needs to be handled gracefully. However if the connector cannot read
				234	the data to be sent, it means that the buffer is congested and the connection
				235	is dead, so that probably means it can be killed.
				236
				237	- Rx buffers have to be pre-allocated just before calling recv(). A connection
				238	will first try to pick a buffer and disable reception if it fails, then
				239	subscribe to the list of tasks waiting for an Rx buffer.
				240
				241	- full Rx buffers might sometimes be moved around to the next buffer instead of
				242	experiencing a copy. That means that channels and connectors must use the
				243	same format of buffer, and that only the channel will have to see its
				244	pointers adjusted.
				245
				246	- Tx of data should be made as much as possible without copying. That possibly
				247	means by directly looking into the connection buffer on the other side if
				248	the local Tx buffer does not exist and the stream buffer is not allocated, or
				249	even performing a splice() call between the two sides. One of the problem in
				250	doing this is that it requires proper ordering of the operations (eg: when
				251	multiple readers are attached to a same buffer). If the splitting occurs upon
				252	receipt, there's no problem. If we expect to retrieve data directly from the
				253	original buffer, it's harder since it contains various things in an order
				254	which does not even indicate what belongs to whom. Thus possibly the only
				255	mechanism to implement is the buffer permutation which guarantees zero-copy
				256	and only in the 100% safe case. Also it's atomic and does not cause HOL
				257	blocking.
				258
				259	It makes sense to chose the frontend_accept() function right after the
				260	handshake ended. It is then possible to check the ALPN, the SNI, the ciphers
				261	and to accept to switch to the h2_conn_accept handler only if everything is OK.
				262	The h2_conn_accept handler will have to deal with the connection setup,
				263	initialization of the header table, exchange of the settings frames and
				264	preparing whatever is needed to fire new streams upon receipt of unknown
				265	stream IDs. Note: most of the time it will not be possible to splice() because
				266	we need to know in advance the amount of bytes to write the header, and here it
				267	will not be possible.
				268
				269	H2 health checks must be seen as regular transactions/streams. The check runs a
				270	normal client which seeks an available stream from a server. The server then
				271	finds one on an existing connection or initiates a new H2 connection. The H2
				272	checks will have to be configurable for sharing streams or not. Another option
				273	could be to specify how many requests can be made over existing connections
				274	before insisting on getting a separate connection. Note that such separate
				275	connections might end up stacking up once released. So probably that they need
				276	to be recycled very quickly (eg: fix how many unused ones can exist max).
				277