Blame - doc/internals/fd-migration.txt - haproxy

blob: aaddad3671f057e363bb19604869651a110cff34 [file] [log] [blame]

Willy Tarreau	88babd9	2021-07-30 17:40:07 +0200	[diff] [blame]	1	2021-07-30 - File descriptor migration between threads
				2
				3	An FD migration may happen on any idle connection that experiences a takeover()
				4	operation by another thread. In this case the acting thread becomes the owner
				5	of the connection (and FD) while previous one(s) need to forget about it.
				6
				7	File descriptor migration between threads is a fairly complex operation because
				8	it is required to maintain a durable consistency between the pollers states and
				9	the haproxy's desired state. Indeed, very often the FD is registered within one
				10	thread's poller and that thread might be waiting in the system, so there is no
				11	way to synchronously update it. This is where thread_mask, polled_mask and per
				12	thread updates are used:
				13
				14	- a thread knows if it's allowed to manipulate an FD by looking at its bit in
				15	the FD's thread_mask ;
				16
				17	- each thread knows if it was polling an FD by looking at its bit in the
				18	polled_mask field ; a recent migration is usually indicated by a bit being
				19	present in polled_mask and absent from thread_mask.
				20
				21	- other threads know whether it's safe to take over an FD by looking at the
				22	running mask: if it contains any other thread's bit, then other threads are
				23	using it and it's not safe to take it over.
				24
				25	- sleeping threads are notified about the need to update their polling via
				26	local or global updates to the FD. Each thread has its own local update
				27	list and its own bit in the update_mask to know whether there are pending
				28	updates for it. This allows to reconverge polling with the desired state
				29	at the last instant before polling.
				30
				31	While the description above could be seen as "progressive" (it technically is)
				32	in that there is always a transition and convergence period in a migrated FD's
				33	life, functionally speaking it's perfectly atomic thanks to the running bit and
				34	to the per-thread idle connections lock: no takeover is permitted without
				35	holding the idle_conns lock, and takeover may only happen by atomically picking
				36	a connection from the list that is also protected by this lock. In practice, an
				37	FD is never taken over by itself, but always in the context of a connection,
				38	and by atomically removing a connection from an idle list, it is possible to
				39	guarantee that a connection will not be picked, hence that its FD will not be
				40	taken over.
				41
				42	same thread as list!
				43
				44	The possible entry points to a race to use a file descriptor are the following
				45	ones, with their respective sequences:
				46
				47	1) takeover: requested by conn_backend_get() on behalf of connect_server()
				48	- take the idle_conns_lock, protecting against a parallel access from the
				49	I/O tasklet or timeout task
				50	- pick the first connection from the list
				51	- attempt an fd_takeover() on this connection's fd. Usually it works,
				52	unless a late wakeup of the owning thread shows up in the FD's running
				53	mask. The operation is performed in fd_takeover() using a DWCAS which
				54	tries to switch both running and thread_mask to the caller's tid_bit. A
				55	concurrent bit in running is enough to make it fail. This guarantees
				56	another thread does not wakeup from I/O in the middle of the takeover.
				57	In case of conflict, this FD is skipped and the attempt is tried again
				58	with the next connection.
				59	- resets the task/tasklet contexts to NULL, as a signal that they are not
				60	allowed to run anymore. The tasks retrieve their execution context from
				61	the scheduler in the arguments, but will check the tasks' context from
				62	the structure under the lock to detect this possible change, and abort.
Ilya Shipitsin	0188108	2021-08-07 14:41:56 +0500	[diff] [blame]	63	- at this point the takeover succeeded, the idle_conns_lock is released and
Willy Tarreau	88babd9	2021-07-30 17:40:07 +0200	[diff] [blame]	64	the connection and its FD are now owned by the caller
				65
				66	2) poll report: happens on late rx, shutdown or error on idle conns
				67	- fd_set_running() is called to atomically set the running_mask and check
				68	that the caller's tid_bit is still present in the thread_mask. Upon
				69	failure the caller arranges itself to stop reporting that FD (e.g. by
				70	immediate removal or by an asynchronous update). Upon success, it's
				71	guaranteed that any concurrent fd_takeover() will fail the DWCAS and that
				72	another connection will need to be picked instead.
				73	- FD's state is possibly updated
				74	- the iocb is called if needed (almost always)
				75	- if the iocb didn't kill the connection, release the bit from running_mask
				76	making the connection possibly available to a subsequent fd_takeover().
				77
				78	3) I/O tasklet, timeout task: timeout or subscribed wakeup
				79	- start by taking the idle_conns_lock, ensuring no takeover() will pick the
				80	same connection from this point.
				81	- check the task/tasklet's context to verify that no recently completed
				82	takeover() stole the connection. If it's NULL, the connection was lost,
				83	the lock is released and the task/tasklet killed. Otherwise it is
Ilya Shipitsin	0188108	2021-08-07 14:41:56 +0500	[diff] [blame]	84	guaranteed that no other thread may use that connection (current takeover
Willy Tarreau	88babd9	2021-07-30 17:40:07 +0200	[diff] [blame]	85	candidates are waiting on the lock, previous owners waking from poll()
				86	lost their bit in the thread_mask and will not touch the FD).
				87	- the connection is removed from the idle conns list. From this point on,
				88	no other thread will even find it there nor even try fd_takeover() on it.
				89	- the idle_conns_lock is now released, the connection is protected and its
				90	FD is not reachable by other threads anymore.
				91	- the task does what it has to do
				92	- if the connection is still usable (i.e. not upon timeout), it's inserted
				93	again into the idle conns list, meaning it may instantly be taken over
				94	by a competing thread.
				95
				96	4) wake() callback: happens on last user after xfers (may free() the conn)
				97	- the connection is still owned by the caller, it's still subscribed to
				98	polling but the connection is idle thus inactive. Errors or shutdowns
				99	may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus
				100	the running bit is set (i.e. a concurrent fd_takeover() will fail).
				101	- if the connection is in the list, the idle_conns_lock is grabbed, the
				102	connection is removed from the list, and the lock is released.
				103	- mux->wake() is called
				104	- if the connection previously was in the list, it's reinserted under the
				105	idle_conns_lock.
Willy Tarreau	f69fea6	2021-08-03 09:04:32 +0200	[diff] [blame]	106
				107
				108	With the DWCAS removal between running_mask & thread_mask:
				109
				110	fd_takeover:
				111	1 if (!CAS(&running_mask, 0, tid_bit))
				112	2 return fail;
				113	3 atomic_store(&thread_mask, tid_bit);
				114	4 atomic_and(&running_mask, ~tid_bit);
				115
				116	poller:
				117	1 do {
				118	2 /* read consistent running_mask & thread_mask */
				119	3 do {
				120	4 run = atomic_load(&running_mask);
				121	5 thr = atomic_load(&thread_mask);
				122	6 } while (run & ~thr);
				123	7
				124	8 if (!(thr & tid_bit)) {
				125	9 /* takeover has started */
				126	10 goto disable_fd;
				127	11 }
				128	12 } while (!CAS(&running_mask, run, run \| tid_bit));
				129
				130	fd_delete:
				131	1 atomic_or(&running_mask, tid_bit);
				132	2 atomic_store(&thread_mask, 0);
				133	3 atomic_and(&running_mask, ~tid_bit);
				134
				135	The loop in poller:3-6 is used to make sure the thread_mask we read matches
				136	the last updated running_mask. If nobody can give up on fd_takeover(), it
				137	might even be possible to spin on thread_mask only. Late pollers will not
				138	set running anymore with this.