Blame - doc/internals/notes-layers.txt - haproxy

blob: 541c125828ead979a5a22aee799b613611520545 [file] [log] [blame]

Willy Tarreau	9382cdd	2018-02-21 18:07:26 +0100	[diff] [blame]	1	2018-02-21 - Layering in haproxy 1.9
				2	------------------------------------
				3
				4	2 main zones :
				5	- application : reads from conn_streams, writes to conn_streams, often uses
				6	streams
				7
				8	- connection : receives data from the network, presented into buffers
				9	available via conn_streams, sends data to the network
				10
				11
Joseph Herlant	02cedc4	2018-11-13 19:45:17 -0800	[diff] [blame]	12	The connection zone contains multiple layers which behave independently in each
Willy Tarreau	9382cdd	2018-02-21 18:07:26 +0100	[diff] [blame]	13	direction. The Rx direction is activated upon callbacks from the lower layers.
				14	The Tx direction is activated recursively from the upper layers. Between every
				15	two layers there may be a buffer, in each direction. When a buffer is full
				16	either in Tx or Rx direction, this direction is paused from the network layer
				17	and the location where the congestion is encountered. Upon end of congestion
				18	(cs_recv() from the upper layer, of sendto() at the lower layers), a
				19	tasklet_wakeup() is performed on the blocked layer so that suspended operations
				20	can be resumed. In this case, the Rx side restarts propagating data upwards
				21	from the lowest blocked level, while the Tx side restarts propagating data
				22	downwards from the highest blocked level. Proceeding like this ensures that
				23	information known to the producer may always be used to tailor the buffer sizes
				24	or decide of a strategy to best aggregate data. Additionally, each time a layer
				25	is crossed without transformation, it becomes possible to send without copying.
				26
				27	The Rx side notifies the application of data readiness using a wakeup or a
				28	callback. The Tx side notifies the application of room availability once data
				29	have been moved resulting in the uppermost buffer having some free space.
				30
				31	When crossing a mux downwards, it is possible that the sender is not allowed to
				32	access the buffer because it is not yet its turn. It is not a problem, the data
				33	remains in the conn_stream's buffer (or the stream one) and will be restarted
				34	once the mux is ready to consume these data.
				35
				36
				37	cs_recv() -------. cs_send()
				38	^ +--------> \|\|\|\|\|\| -------------+ ^
				39	\| \| -------' \| \| stream
				40	--\|----------\|-------------------------------\|-------\|-------------------
				41	\| \| V \| connection
				42	data .---. \| \| room
				43	ready! \|---\| \|---\| available!
				44	\|---\| \|---\|
				45	\|---\| \|---\|
				46	\| \| '---'
				47	^ +------------+-------+ \|
				48	\| \| ^ \| /
				49	/ V \| V /
				50	/ recvfrom() \| sendto() \|
				51	-------------\|----------------\|--------------\|---------------------------
				52	\| \| poll! V kernel
				53
				54
				55	The cs_recv() function should act on pointers to buffer pointers, so that the
				56	callee may decide to pass its own buffer directly by simply swapping pointers.
				57	Similarly for cs_send() it is desirable to let the callee steal the buffer by
				58	swapping the pointers. This way it remains possible to implement zero-copy
				59	forwarding.
				60
				61	Some operation flags will be needed on cs_recv() :
				62	- RECV_ZERO_COPY : refuse to merge new data into the current buffer if it
				63	will result in a data copy (ie the buffer is not empty), unless no more
				64	than XXX bytes have to be copied (eg: copying 2 cache lines may be cheaper
				65	than waiting and playing with pointers)
				66
				67	- RECV_AT_ONCE : only perform the operation if it will result in the source
				68	buffer to become empty at the end of the operation so that no two buffers
				69	remain allocated at the end. It will most of the time result in either a
				70	small read or a zero-copy operation.
				71
				72	- RECV_PEEK : retrieve a copy of pending data without removing these data
				73	from the source buffer. Maybe an alternate solution could consist in
				74	finding the pointer to the source buffer and accessing these data directly,
				75	except that it might be less interesting for the long term, thread-wise.
				76
				77	- RECV_MIN : receive minimum X bytes (or less with a shutdown), or fail.
				78	This should help various protocol parsers which need to receive a complete
				79	frame before proceeding.
				80
				81	- RECV_ENOUGH : no more data expected after this read if it's of the
				82	requested size, thus no need to re-enable receiving on the lower layers.
				83
				84	- RECV_ONE_SHOT : perform a single read without re-enabling reading on the
Joseph Herlant	02cedc4	2018-11-13 19:45:17 -0800	[diff] [blame]	85	lower layers, like we currently do when receiving an HTTP/1 request. Like
Willy Tarreau	9382cdd	2018-02-21 18:07:26 +0100	[diff] [blame]	86	RECV_ENOUGH where any size is enough. Probably that the two could be merged
				87	(eg: by having a MIN argument like RECV_MIN).
				88
				89
				90	Some operation flags will be needed on cs_send() :
				91	- SEND_ZERO_COPY : refuse to merge the presented data with existing data and
				92	prefer to wait for current data to leave and try again, unless the consumer
				93	considers the amount of data acceptable for a copy.
				94
				95	- SEND_AT_ONCE : only perform the operation if it will result in the source
				96	buffer to become empty at the end of the operation so that no two buffers
				97	remain allocated at the end. It will most of the time result in either a
				98	small write or a zero-copy operation.
				99
				100
				101	Both operations should return a composite status :
Joseph Herlant	02cedc4	2018-11-13 19:45:17 -0800	[diff] [blame]	102	- number of bytes transferred
Willy Tarreau	9382cdd	2018-02-21 18:07:26 +0100	[diff] [blame]	103	- status flags (shutr, shutw, reset, empty, full, ...)
				104
Willy Tarreau	7cc040c	2018-07-23 17:29:37 +0200	[diff] [blame]	105
				106	2018-07-23 - Update after merging rxbuf
				107	---------------------------------------
				108
				109	It becomes visible that the mux will not always be welcome to decode incoming
				110	data because it will sometimes imply extra memory copies and/or usage for no
				111	benefit.
				112
Ilya Shipitsin	2075ca8	2020-03-06 23:22:22 +0500	[diff] [blame]	113	Ideally, when when a stream is instantiated based on incoming data, these
Willy Tarreau	7cc040c	2018-07-23 17:29:37 +0200	[diff] [blame]	114	incoming data should be passed and the upper layers called, but it should then
				115	be up these upper layers to peek more data in certain circumstances. Typically
				116	if the pending connection data are larger than what is expected to be passed
				117	above, it means some data may cause head-of-line blocking (HOL) to other
				118	streams, and needs to be pushed up through the layers to let other streams
				119	continue to work. Similarly very large H2 data frames after header frames
				120	should probably not be passed as they may require copies that could be avoided
				121	if passed later. However if the decoded frame fits into the conn_stream's
				122	buffer, there is an opportunity to use a single buffer for the conn_stream
				123	and the channel. The H2 demux could set a blocking flag indicating it's waiting
				124	for the upper stream to take over demuxing. This flag would be purged once the
				125	upper stream would start reading, or when extra data come and change the
				126	conditions.
				127
				128	Forcing structured headers and raw data to coexist within a single buffer is
				129	quite challenging for many code parts. For example it's perfectly possible to
				130	see a fragmented buffer containing series of headers, then a small data chunk
				131	that was received at the same time, then a few other headers added by request
				132	processing, then another data block received afterwards, then possibly yet
				133	another header added by option http-send-name-header, and yet another data
				134	block. This causes some pain for compression which still needs to know where
				135	compressed and uncompressed data start/stop. It also makes it very difficult
				136	to account the exact bytes to pass through the various layers.
				137
				138	One solution consists in thinking about buffers using 3 representations :
				139
				140	- a structured message, which is used for the internal HTTP representation.
				141	This message may only be atomically processed. It has no clear byte count,
				142	it's a message.
				143
				144	- a raw stream, consisting in sequences of bytes. That's typically what
				145	happens in data sequences or in tunnel.
				146
				147	- a pipe, which contains data to be forwarded, and that haproxy cannot have
				148	access to.
				149
				150	The processing efficiency decreases with the higher complexity above, but the
				151	capabilities increase. The structured message can contain anything including
				152	serialized data blocks to be processed or forwarded. The raw stream contains
				153	data blocks to be processed or forwarded. The pipe only contains data blocks
				154	to be forwarded. The the latter ones are only an optimization of the former
				155	ones.
				156
				157	Thus ideally a channel should have access to all such 3 storage areas at once,
				158	depending on the use case :
				159	(1) a structured message,
				160	(2) a raw stream,
				161	(3) a pipe
				162
				163	Right now a channel only has (2) and (3) but after the native HTTP rework, it
				164	will only have (1) and (3). Placing a raw stream exclusively in (1) comes with
				165	some performance drawbacks which are not easily recovered, and with some quite
				166	difficult management still involving the reserve to ensure that a data block
				167	doesn't prevent headers from being appended. But during header processing, the
				168	payload may be necessary so we cannot decide to drop this option.
				169
				170	A long-term approach would consist in ensuring that a single channel may have
				171	access to all 3 representations at once, and to enumerate priority rules to
				172	define how they interact together. That's exactly what is currently being done
				173	with the pipe and the raw buffer right now. Doing so would also save the need
				174	for storing payload in the structured message and void the requirement for the
				175	reserve. But it would cost more memory to process POST data and server
				176	responses. Thus an intermediary step consists in keeping this model in mind but
				177	not implementing everything yet.
				178
				179	Short term proposal : a channel has access to a buffer and a pipe. A non-empty
				180	buffer is either in structured message format OR raw stream format. Only the
				181	channel knows. However a structured buffer MAY contain raw data in a properly
Ilya Shipitsin	2075ca8	2020-03-06 23:22:22 +0500	[diff] [blame]	182	formatted way (using the envelope defined by the structured message format).
Willy Tarreau	7cc040c	2018-07-23 17:29:37 +0200	[diff] [blame]	183
				184	By default, when a demux writes to a CS rxbuf, it will try to use the lowest
				185	possible level for what is being done (i.e. splice if possible, otherwise raw
				186	stream, otherwise structured message). If the buffer already contains a
				187	structured message, then this format is exclusive. From this point the MUX has
				188	two options : either encode the incoming data to match the structured message
				189	format, or refrain from receiving into the CS's rxbuf and wait until the upper
				190	layer request those data.
				191
				192	This opens a simplified option which could be suited even for the long term :
				193	- cs_recv() will take one or two flags to indicate if a buffer already
				194	contains a structured message or not ; the upper layer knows it.
				195
				196	- cs_recv() will take two flags to indicate what the upper layer is willing
				197	to take :
				198	- structured message only
				199	- raw stream only
				200	- any of them
				201
				202	From this point the mux can decide to either pass anything or refrain from
				203	doing so.
				204
				205	- the demux stores the knowledge it has from the contents into some CS flags
				206	to indicate whether or not some structured message are still available, and
				207	whether or not some raw data are still available. Thus the caller knows
				208	whether or not extra data are available.
				209
				210	- when the demux works on its own, it refrains from passing structured data
				211	to a non-empty buffer, unless these data are causing trouble to other
				212	streams (HOL).
				213
				214	- when a demux has to encapsulate raw data into a structured message, it will
				215	always have to respect a configured reserve so that extra header processing
				216	can be done on the structured message inside the buffer, regardless of the
				217	supposed available room. In addition, the upper layer may indicate using an
				218	extra recv() flag whether it wants the demux to defragment serialized data
				219	(for example by moving trailing headers apart) or if it's not necessary.
				220	This flag will be set by the stream interface if compression is required or
				221	if the http-buffer-request option is set for example. Probably that using
				222	to_forward==0 is a stronger indication that the reserve must be respected.
				223
				224	- cs_recv() and cs_send() when fed with a message, should not return byte
				225	counts but message counts (i.e. 0 or 1). This implies that a single call to
				226	either of these functions cannot mix raw data and structured messages at
				227	the same time.
				228
				229	At this point it looks like the conn_stream will have some encapsulation work
				230	to do for the payload if it needs to be encapsulated into a message. This
				231	further magnifies the importance of not decoding DATA frames into the CS's
				232	rxbuf until really needed.
				233
				234	The CS will probably need to hold indication of what is available at the mux
				235	level, not only in the CS. Eg: we know that payload is still available.
				236
				237	Using these elements, it should be possible to ensure that full header frames
				238	may be received without enforcing any reserve, that too large frames that do
				239	not fit will be detected because they return 0 message and indicate that such
				240	a message is still pending, and that data availability is correctly detected
				241	(later we may expect that the stream-interface allocates a larger or second
				242	buffer to place the payload).
				243
				244	Regarding the ability for the channel to forward data, it looks like having a
				245	new function "cs_xfer(src_cs, dst_cs, count)" could be very productive in
				246	optimizing the forwarding to make use of splicing when available. It is not yet
				247	totally clear whether it will split into "cs_xfer_in(src_cs, pipe, count)"
				248	followed by "cs_xfer_out(dst_cs, pipe, count)" or anything different, and it
				249	still needs to be studied. The general idea seems to be that the receiver might
				250	have to call the sender directly once they agree on how to transfer data (pipe
				251	or buffer). If the transfer is incomplete, the cs_xfer() return value and/or
				252	flags will indicate the current situation (src empty, dst full, etc) so that
				253	the caller may register for notifications on the appropriate event and wait to
				254	be called again to continue.
				255
				256	Short term implementation :
				257	1) add new CS flags to qualify what the buffer contains and what we expect
				258	to read into it;
				259
				260	2) set these flags to pretend we have a structured message when receiving
				261	headers (after all, H1 is an atomic header as well) and see what it
				262	implies for the code; for H1 it's unclear whether it makes sense to try
				263	to set it without the H1 mux.
				264
				265	3) use these flags to refrain from sending DATA frames after HEADERS frames
				266	in H2.
				267
				268	4) flush the flags at the stream interface layer when performing a cs_send().
				269
				270	5) use the flags to enforce receipt of data only when necessary
				271
Ilya Shipitsin	2075ca8	2020-03-06 23:22:22 +0500	[diff] [blame]	272	We should be able to end up with sequential receipt in H2 modelling what is
Willy Tarreau	7cc040c	2018-07-23 17:29:37 +0200	[diff] [blame]	273	needed for other protocols without interfering with the native H1 devs.
Willy Tarreau	f7e3955	2018-08-17 09:58:29 +0200	[diff] [blame]	274
				275
				276	2018-08-17 - Considerations after killing cs_recv()
				277	---------------------------------------------------
				278
				279	With the ongoing reorganisation of the I/O layers, it's visible that cs_recv()
				280	will have to transfer data between the cs' rxbuf and the channel's buffer while
				281	not being aware of the data format. Moreover, in case there's no data there, it
				282	needs to recursively call the mux's rcv_buf() to trigger a decoding, while this
				283	function is sometimes replaced with cs_recv(). All this shows that cs_recv() is
				284	in fact needed while data are pushed upstream from the lower layers, and is not
				285	suitable for the "pull" mode. Thus it was decided to remove this function and
				286	put its code back into h2_rcv_buf(). The H1 mux's rcv_buf() already couldn't be
				287	replaced with cs_recv() since it is the only one knowing about the buffer's
				288	format.
				289
				290	This opportunity simplified something : if the cs's rxbuf is only read by the
				291	mux's rcv_buf() method, then it doesn't need to be located into the CS and is
				292	well placed into the mux's representation of the stream. This has an important
				293	impact for H2 as it offers more freedom to the mux to allocate/free/reallocate
				294	this buffer, and it ensures the mux always has access to it.
				295
				296	Furthermore, the conn_stream's txbuf experienced the same fate. Indeed, the H1
				297	mux has already uncovered the difficulty related to the channel shutting down
				298	on output, with data stuck into the CS's txbuf. Since the CS is tightly coupled
				299	to the stream and the stream can close immediately once its buffers are empty,
				300	it required a way to support orphaned CS with pending data in their txbuf. This
				301	is something that the H2 mux already has to deal with, by carefully leaving the
				302	data in the channel's buffer. But due to the snd_buf() call being top-down, it
				303	is always possible to push the stream's data via the mux's snd_buf() call
				304	without requiring a CS txbuf anymore. Thus the txbuf (when needed) is only
				305	implemented in the mux and attached to the mux's representation of the stream,
				306	and doing so allows to immediately release the channel once the data are safe
				307	in the mux's buffer.
				308
				309	This is an important change which clarifies the roles and responsibilities of
				310	each layer in the chain : when receiving data from a mux, it's the mux's
				311	responsibility to make sure it can correctly decode the incoming data and to
				312	buffer the possible excess of data it cannot pass to the requester. This means
				313	that decoding an H2 frame, which is not retryable since it has an impact on the
				314	HPACK decompression context, and which cannot be reordered for the same reason,
				315	simply needs to be performed to the H2 stream's rxbuf which will then be passed
				316	to the stream when this one calls h2_rcv_buf(), even if it reads one byte at a
				317	time. Similarly when calling h2_snd_buf(), it's the mux's responsibility to
				318	read as much as it needs to be able to restart later, possibly by buffering
				319	some data into a local buffer. And it's only once all the output data has been
				320	consumed by snd_buf() that the stream is free to disappear.
				321
				322	This model presents the nice benefit of being infinitely stackable and solving
				323	the last identified showstoppers to move towards a structured message internal
				324	representation, as it will give full power to the rcv_buf() and snd_buf() to
				325	process what they need.
				326
				327	For now the conn_stream's flags indicating whether a shutdown has been seen in
				328	any direction or if an end of stream was seen will remain in the conn_stream,
				329	though it's likely that some of them will move to the mux's representation of
				330	the stream after structured messages are implemented.