Blame - doc/internals/buffer-rework.txt - haproxy

blob: 136377fd003d0d0f055c915f205ebeea801448dd [file] [log] [blame]

Willy Tarreau	fa9f9cc	2018-05-18 16:18:17 +0200	[diff] [blame]	1	2018-05-18 - Buffer rework
				2
				3	1. Summary
				4
				5	The situation with the current buffer structure is becoming problematic in
				6	the newly introduced muxes and causes serious difficulties preventing muxes
				7	from being used on both sides, unless requiring that all code is duplicated
				8	to use buf->i on the Rx path and buf->o on the Tx path.
				9
				10
				11	2. History
				12
				13	A very long time ago, buffers were used to receive data using the recv() call,
				14	to parse them, forward them, and send them over the network using the send()
				15	call. Buffers were split into (buffer,channel) when some layers started to be
				16	introduced, and were reorganized a few times to ease content processing and
				17	rewriting. The current definition of the buffer structure is the following :
				18
				19	struct buffer {
				20	char *p;
				21	uint size;
				22	uint i;
				23	uint o;
				24	char data[0];
				25	};
				26
				27	data p
				28	\| \|
				29	V V
				30	+-----------+--------------------+------------+-------------+
				31	\| \|////////////////////\|////////////\| \|
				32	+-----------+--------------------+------------+-------------+
				33	<---------------------------------------------------------> size
				34	<------------------> <---------->
				35	o i
				36
				37	Pointer (p) is initialized to (data) when the buffer is empty. Data are
				38	received after (p+i), increasing (i) by the number of bytes read. Data are sent
				39	from (p-o) for up to (o) bytes, decreasing (o) by the number of bytes sent.
				40	Forwarding data in the channel consists in advancing (p) by the number of bytes
				41	to forward, increasing (o) and decreasing (i) by the same amount.
				42
				43	This representation is convenient for channel operations because most of them
				44	require to parse input data between (p) and (p+i), and to have a simple way to
				45	forward data. Additionally, it's always easy to know if some data are scheduled
				46	for departure (o), or if the buffer has some room available (size-i-o).
				47
				48
				49	3. Problems
				50
				51	When applets were introduced, the initial code that was made to write data into
				52	the output part was modified to send it into the input part since we had to
				53	rely on the stream code to forward these data via the channel. This explains
				54	the flood of bi_* functions that were introduced to perform the same operations
				55	as the initial bo_*, to write into an input buffer from an applet.
				56
				57	Health checks however continue to use output because checks do not use streams
				58	nor channels. Thus the check buffers use buf->o for requests and buf->i for
				59	responses.
				60
				61	The introduction of muxes has changed this again by requiring that most request
				62	code was able to write to buf->i, pretending to be the equivalent of a socket
				63	recv() call. New bi_* functions had to be created to write headers and chunks
				64	from the HTTP/2 mux. Conversely, it was made necessary to parse HTTP traffic
				65	from buf->o while all the original code was made to parse this from buf->i.
				66
				67	Furthermore, implementing an outgoing mux (eg: HTTP/2) will require to
				68	duplicate a lot of the code to use buf->i instead of buf->o and conversely,
				69	just because the mux will not be placed on the same side of the buffer. Not
				70	only it complicates code maintenance but it also emphasizes the risk to use
				71	the wrong function at any moment.
				72
				73	From a performance perspective, applets have to suffer a useless copy most of
				74	the time, only due to API limitatoins : it is not possible to write directly to
				75	an input buffer, one has to write to a chunk and then copy it into a buffer. A
				76	compatible structure could allow to share the same data between the chunk and
				77	the buffer without having to perform an extra copy.
				78
				79
				80	4. Proposal
				81
				82	In checks and muxes, it is obvious that a single "side" of the buffer is used,
				83	and it generally is the one associated with the I/O to be performed. Only the
				84	channel requires the distinction between (i) and (o).
				85
				86	The proposal is to remove this distinction from the buffer and move ->o into
				87	the channel.
				88
				89	A buffer will then become only a linear (possibly wrapping) storage area with
				90	a beginning, and an end.
				91
				92	Given the experience gathered from past buffer API updates, we know that the
				93	buffer's end is not as much important as its data length. This will give the
				94	current representation :
				95
				96
				97	struct buffer {
				98	void *area; // start of the storage area
				99	uint head; // start offset of remaining data relative to area
				100	uint len; // contents length after head
				101	uint size; // size of the storage area (wrapping point)
				102	};
				103
				104	area
				105	\|
				106	V
				107	+-----------+---------------------------------+-------------+
				108	\| \|/////////////////////////////////\| \|
				109	+-----------+---------------------------------+-------------+
				110	<---------------------------------------------------------> size
				111	<---------> <------------------------------->
				112	head len
				113
				114	The channel will contain an "output" field corresponding to the current buf->o,
				115	indicating how many bytes of the current buffer are actually scheduled for
				116	being forwarded and must not be considered anymore. It means that a stream
				117	parser will simply start to parse from (buf->area + buf->head + chn->output)
				118	and stop at (buf->area + buf->head + buf->len).
				119
				120	For esnding data, the caller of cs_send() or whatever function will have to
				121	pass the desired number of bytes to send, and one will not expect anymore that
				122	all the buffer's contents have to be sent. In general the caller will have
				123	access to chn->output if it needs to use this (typically from the stream
				124	interface code at the moment).
				125
				126
				127	5. First implementation step
				128
				129	The first step will consist in limiting the changes to the current buffers. The
				130	buffer structure will still contain both a descriptor and the storage area. A
				131	buffer will first be declared this way :
				132
				133	struct buffer {
				134	uint head; // start offset of remaining data relative to area
				135	uint len; // contents length after head
				136	uint size; // size of the storage area (wrapping point)
				137	void area[0]; // start of the storage area
				138	};
				139
				140	Thanks to this, no changes will have to be performed on memory management, and
				141	buffers will continue to be allocated from a pool of size (sizeof(buffer) +
				142	tune.bufsize).
				143
				144	The following translations will have to be performed on the code :
				145	- occurrences of (buf->i + buf->o) will have to be replaced with (buf->len)
				146	- bi_ptr() -> ci_ptr() ; bi_end() -> b_head()+b_size() ; bi_del() -> b_del()
				147	- bo_ptr() -> b_head() ; bo_end() -> co_end()
				148	- b_adv() -> c_adv() ; b_rew() -> c_rew()
				149	- buf->o will have to be replaced with either chn->output or a function
				150	argument containing a copy of chn->output. These ones should cancel out
				151	at the end of the operation.
				152	- buf->i -> (b_len(buf) - chn->output)
				153
				154	Temporary difficulties :
				155	- compression makes use of both (i) and (o), by taking care of only touching
				156	(i) and never (o). The filters know how not to touch (o), and the internal
				157	compression API needs a small update so that this previous ->o value is
				158	passed as an argument that the filter will collect from the channel. If it
				159	is simpler (it probably isn't), a temporary "struct oldbuf" could be
				160	created to emulate the old behaviour and be fed/used by the filters code.
				161
				162	- buffer_slow_realign() distinguishes input data from output data so that the
				163	output data is always placed at the end, leaving a clean contigous buffer
				164	once forwarded. Instead, a "split" argument will have to be added so that
				165	the caller may decide where to split the contents. Muxes will pass zero
				166	here while channels will pass chn->output.
				167
				168
				169	6. Second implementation step
				170
				171	The second step will consist in making "struct buffer" only hold the descriptor
				172	and not the data anymore. It will then look like this :
				173
				174	struct buffer {
				175	void *area; // start of the storage area
				176	uint head; // start offset of remaining data relative to area
				177	uint len; // contents length after head
				178	uint size; // size of the storage area (wrapping point)
				179	};
				180
				181	Each buffer allocation will have to atomically allocate a struct buffer and an
				182	area. Buffer copies will consist in exchanging the "struct buffer" contents
				183	only.
				184
				185	The chunk API must then be updated so that some new versions of chunk_putblk(),
				186	chunk_printf() etc can write to a storage area, and so that bi_putchk() and
				187	bo_putchk() instead can swap the storage areas when possible.
				188
				189	At this point b_size() will be used to know where to release the allocated
				190	area. The storage will simply consist in (start,len) which is perfectly suited
				191	to have slabs. Just like chunks, b_size()==0 can be used to mention that no
				192	free() must be done on the area. Doing so will make it much simpler to send
				193	pre-formated messages (eg: error messages) into a buffer because such messages
				194	will then be stored into a "struct ist" and sending such a message will be as
				195	simple as doing :
				196
				197	b->area = ist.str;
				198	b->len = ist.len;
				199	b->head = 0;
				200	b->size = 0;
				201
				202	The chunk struct can then be removed and replaced everywhere with a struct
				203	buffer. Only the functions will remain, though they will likely have to be
				204	renamed. Maybe the buffer manipulation functions will have to be split between
				205	those which support wrapping and those which don't (chunks don't support
				206	wrapping).
				207
				208	The buf_empty structure will then disappear since a small 20-bytes structure
				209	will be enough to represent an empty buffer.
				210
				211
				212	7. Third implementation step
				213
				214	The third step will consist in placing a struct buffer into the struct channel.
				215	This way no allocation is needed at all, and any storage can be used to deliver
				216	contents. This allows to trivially upgrade a buffer on the fly by picking from
				217	a different slab. It also allows to deliver error messages without ever having
				218	to perform a buffer allocation. Doing so removes the need for the early buffer
				219	allocation for the response in process_stream(), as it is only needed to have a
				220	reliable place to send an error message to. This will ensure the buffer
				221	allocator can be simplified and made more robust against the risk of deadlock
				222	on memory shortage.
				223
				224
				225	8. Caveats
				226
				227	The following points require extra care :
				228	- there will be some subtracts to figure the buffer "input" length (formerly
				229	buf->i). In the past it always used to be an unsigned value. Extreme care
				230	will have to be taken to always use an inline function to compute this so
				231	that it doesn't accidently become signed.
				232
				233	- supporting buf->size==0 to point to a special string may require some extra
				234	checks to avoid causing an integer underflow when calculating (size-len) or
				235	(size-len-head) to figure the available room.
				236
				237	- it is very likely that some further changes will be tempting to do in the
				238	channel to better integrate the buffer (which becomes very similar to the
				239	pipe), but we must not go as far as removing the visibility of the "struct
				240	buffer" because it will be used as entry point for many functions.
				241
				242	- it is likely that a number of the chunk_, bi_ and bo_* variants have very
				243	minor variations like return codes or error checking that can make their
				244	replacement very tricky. Each set of such functions must be studied in
				245	advance, and their users as well.