Blame - doc/design-thoughts/thread-group.txt - haproxy

blob: 9ab1f556d7989c22f78e684031d95d6d4f969d9f [file] [log] [blame]

Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	1	Thread groups
				2	#############
				3
				4	2021-07-13 - first draft
				5	==========
				6
				7	Objective
				8	---------
				9	- support multi-socket systems with limited cache-line bouncing between
				10	physical CPUs and/or L3 caches
				11
				12	- overcome the 64-thread limitation
				13
				14	- Support a reasonable number of groups. I.e. if modern CPUs arrive with
				15	core complexes made of 8 cores, with 8 CC per chip and 2 chips in a
				16	system, it makes sense to support 16 groups.
				17
				18
				19	Non-objective
				20	-------------
				21	- no need to optimize to the last possible cycle. I.e. some algos like
				22	leastconn will remain shared across all threads, servers will keep a
				23	single queue, etc. Global information remains global.
				24
				25	- no stubborn enforcement of FD sharing. Per-server idle connection lists
				26	can become per-group; listeners can (and should probably) be per-group.
				27	Other mechanisms (like SO_REUSEADDR) can already overcome this.
				28
				29	- no need to go beyond 64 threads per group.
				30
				31
				32	Identified tasks
				33	================
				34
				35	General
				36	-------
				37	Everywhere tid_bit is used we absolutely need to find a complement using
				38	either the current group or a specific one. Thread debugging will need to
				39	be extended as masks are extensively used.
				40
				41
				42	Scheduler
				43	---------
				44	The global run queue and global wait queue must become per-group. This
				45	means that a task may only be queued into one of them at a time. It
				46	sounds like tasks may only belong to a given group, but doing so would
				47	bring back the original issue that it's impossible to perform remote wake
				48	ups.
				49
				50	We could probably ignore the group if we don't need to set the thread mask
				51	in the task. the task's thread_mask is never manipulated using atomics so
				52	it's safe to complement it with a group.
				53
				54	The sleeping_thread_mask should become per-group. Thus possibly that a
				55	wakeup may only be performed on the assigned group, meaning that either
				56	a task is not assigned, in which case it be self-assigned (like today),
				57	otherwise the tg to be woken up will be retrieved from the task itself.
				58
				59	Task creation currently takes a thread mask of either tid_bit, a specific
				60	mask, or MAX_THREADS_MASK. How to create a task able to run anywhere
				61	(checks, Lua, ...) ?
				62
Willy Tarreau	680ed5f	2022-06-13 15:59:39 +0200	[diff] [blame]	63	Profiling -> completed
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	64	---------
				65	There should be one task_profiling_mask per thread group. Enabling or
				66	disabling profiling should be made per group (possibly by iterating).
Willy Tarreau	680ed5f	2022-06-13 15:59:39 +0200	[diff] [blame]	67	-> not needed anymore, one flag per thread in each thread's context.
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	68
				69	Thread isolation
				70	----------------
				71	Thread isolation is difficult as we solely rely on atomic ops to figure
				72	who can complete. Such operation is rare, maybe we could have a global
				73	read_mostly flag containing a mask of the groups that require isolation.
				74	Then the threads_want_rdv_mask etc can become per-group. However setting
				75	and clearing the bits will become problematic as this will happen in two
				76	steps hence will require careful ordering.
				77
				78	FD
				79	--
				80	Tidbit is used in a number of atomic ops on the running_mask. If we have
				81	one fdtab[] per group, the mask implies that it's within the group.
				82	Theoretically we should never face a situation where an FD is reported nor
				83	manipulated for a remote group.
				84
				85	There will still be one poller per thread, except that this time all
				86	operations will be related to the current thread_group. No fd may appear
				87	in two thread_groups at once, but we can probably not prevent that (e.g.
				88	delayed close and reopen). Should we instead have a single shared fdtab[]
				89	(less memory usage also) ? Maybe adding the group in the fdtab entry would
				90	work, but when does a thread know it can leave it ? Currently this is
				91	solved by running_mask and by update_mask. Having two tables could help
				92	with this (each table sees the FD in a different group with a different
				93	mask) but this looks overkill.
				94
				95	There's polled_mask[] which needs to be decided upon. Probably that it
				96	should be doubled as well. Note, polled_mask left fdtab[] for cacheline
				97	alignment reasons in commit cb92f5cae4.
				98
				99	If we have one fdtab[] per group, what really prevents from using the
				100	same FD in multiple groups ? _fd_delete_orphan() and fd_update_events()
				101	need to check for no-thread usage before closing the FD. This could be
				102	a limiting factor. Enabling could require to wake every poller.
				103
				104	Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short,
				105	used only during creation and close) ?
				106
				107	Other problem, if we have one fdtab[] per TG, disabling/enabling an FD
				108	(e.g. pause/resume on listener) can become a problem if it's not necessarily
				109	on the current TG. We'll then need a way to figure that one. It sounds like
				110	FDs from listeners and receivers are very specific and suffer from problems
				111	all other ones under high load do not suffer from. Maybe something specific
				112	ought to be done for them, if we can guarantee there is no risk of accidental
				113	reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the
				114	FD's flags). The risk is always that a close() can result in instant pop-up
				115	of the same FD on any other thread of the same process.
				116
				117	Observations: right now fdtab[].thread_mask more or less corresponds to a
				118	declaration of interest, it's very close to meaning "active per thread". It is
				119	in fact located in the FD while it ought to do nothing there, as it should be
				120	where the FD is used as it rules accesses to a shared resource that is not
				121	the FD but what uses it. Indeed, if neither polled_mask nor running_mask have
				122	a thread's bit, the FD is unknown to that thread and the element using it may
				123	only be reached from above and not from the FD. As such we ought to have a
				124	thread_mask on a listener and another one on connections. These ones will
				125	indicate who uses them. A takeover could then be simplified (atomically set
				126	exclusivity on the FD's running_mask, upon success, takeover the connection,
				127	clear the running mask). Probably that the change ought to be performed on
				128	the connection level first, not the FD level by the way. But running and
				129	polled are the two relevant elements, one indicates userland knowledge,
				130	the other one kernel knowledge. For listeners there's no exclusivity so it's
				131	a bit different but the rule remains the same that we don't have to know
				132	what threads are interested in the FD, only its holder.
				133
				134	Not exact in fact, see FD notes below.
				135
				136	activity
				137	--------
				138	There should be one activity array per thread group. The dump should
				139	simply scan them all since the cumuled values are not very important
				140	anyway.
				141
				142	applets
				143	-------
				144	They use tid_bit only for the task. It looks like the appctx's thread_mask
				145	is never used (now removed). Furthermore, it looks like the argument is
				146	always tid_bit.
				147
				148	CPU binding
				149	-----------
				150	This is going to be tough. It will be needed to detect that threads overlap
				151	and are not bound (i.e. all threads on same mask). In this case, if the number
				152	of threads is higher than the number of threads per physical socket, one must
				153	try hard to evenly spread them among physical sockets (e.g. one thread group
				154	per physical socket) and start as many threads as needed on each, bound to
				155	all threads/cores of each socket. If there is a single socket, the same job
				156	may be done based on L3 caches. Maybe it could always be done based on L3
				157	caches. The difficulty behind this is the number of sockets to be bound: it
				158	is not possible to bind several FDs per listener. Maybe with a new bind
				159	keyword we can imagine to automatically duplicate listeners ? In any case,
				160	the initially bound cpumap (via taskset) must always be respected, and
				161	everything should probably start from there.
				162
				163	Frontend binding
				164	----------------
				165	We'll have to define a list of threads and thread-groups per frontend.
				166	Probably that having a group mask and a same thread-mask for each group
				167	would suffice.
				168
				169	Threads should have two numbers:
				170	- the per-process number (e.g. 1..256)
				171	- the per-group number (1..64)
				172
				173	The "bind-thread" lines ought to use the following syntax:
				174	- bind 45 ## bind to process' thread 45
				175	- bind 1/45 ## bind to group 1's thread 45
				176	- bind all/45 ## bind to thread 45 in each group
				177	- bind 1/all ## bind to all threads in group 1
				178	- bind all ## bind to all threads
				179	- bind all/all ## bind to all threads in all groups (=all)
				180	- bind 1/65 ## rejected
				181	- bind 65 ## OK if there are enough
				182	- bind 35-45 ## depends. Rejected if it crosses a group boundary.
				183
				184	The global directive "nbthread 28" means 28 total threads for the process. The
				185	number of groups will sub-divide this. E.g. 4 groups will very likely imply 7
				186	threads per group. At the beginning, the nbgroup should be manual since it
				187	implies config adjustments to bind lines.
				188
				189	There should be a trivial way to map a global thread to a group and local ID
				190	and to do the opposite.
				191
				192
				193	Panic handler + watchdog
				194	------------------------
				195	Will probably depend on what's done for thread_isolate
				196
				197	Per-thread arrays inside structures
				198	-----------------------------------
				199	- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should
				200	we simply bump the limit ?
				201	- same for servers with idle connections.
				202	=> doesn't seem very practical.
				203	- another solution might be to point to dynamically allocated arrays of
				204	arrays (e.g. nbthread * nbgroup) or a first level per group and a second
				205	per thread.
				206	=> dynamic allocation based on the global number
				207
				208	Other
				209	-----
				210	- what about dynamic thread start/stop (e.g. for containers/VMs) ?
				211	E.g. if we decide to start $MANY threads in 4 groups, and only use
				212	one, in the end it will not be possible to use less than one thread
				213	per group, and at most 64 will be present in each group.
				214
				215
				216	FD Notes
				217	--------
				218	- updt_fd_polling() uses thread_mask to figure where to send the update,
				219	the local list or a shared list, and which bits to set in update_mask.
				220	This could be changed so that it takes the update mask in argument. The
				221	call from the poller's fork would just have to broadcast everywhere.
				222
				223	- pollers use it to figure whether they're concerned or not by the activity
				224	update. This looks important as otherwise we could re-enable polling on
				225	an FD that changed to another thread.
				226
				227	- thread_mask being a per-thread active mask looks more exact and is
				228	precisely used this way by _update_fd(). In this case using it instead
				229	of running_mask to gauge a change or temporarily lock it during a
				230	removal could make sense.
				231
Willy Tarreau	55b9689	2022-05-31 08:07:43 +0200	[diff] [blame]	232	- running should be conditioned by thread. Polled not (since deferred
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	233	or migrated). In this case testing thread_mask can be enough most of
				234	the time, but this requires synchronization that will have to be
				235	extended to tgid.. But migration seems a different beast that we shouldn't
				236	care about here: if first performed at the higher level it ought to
				237	be safe.
				238
				239	In practice the update_mask can be dropped to zero by the first fd_delete()
				240	as the only authority allowed to fd_delete() is the owner, and as soon as
				241	all running_mask are gone, the FD will be closed, hence removed from all
				242	pollers. This will be the only way to make sure that update_mask always
				243	refers to the current tgid.
				244
				245	However, it may happen that a takeover within the same group causes a thread
				246	to read the update_mask late, while the FD is being wiped by another thread.
				247	That other thread may close it, causing another thread in another group to
				248	catch it, and change the tgid and start to update the update_mask. This means
				249	that it would be possible for a thread entering do_poll() to see the correct
				250	tgid, then the fd would be closed, reopened and reassigned to another tgid,
				251	and the thread would see its bit in the update_mask, being confused. Right
				252	now this should already happen when the update_mask is not cleared, except
				253	that upon wakeup a migration would be detected and that would be all.
				254
				255	Thus we might need to set the running bit to prevent the FD from migrating
				256	before reading update_mask, which also implies closing on fd_clr_running() == 0 :-(
				257
				258	Also even fd_update_events() leaves a risk of updating update_mask after
				259	clearing running, thus affecting the wrong one. Probably that update_mask
				260	should be updated before clearing running_mask there. Also, how about not
				261	creating an update on a close ? Not trivial if done before running, unless
				262	thread_mask==0.
				263
Willy Tarreau	d60269f	2022-07-06 15:44:49 +0200	[diff] [blame]	264	Note that one situation that is currently visible is that a thread closes a
				265	file descriptor that it's the last one to own and to have an update for. In
				266	fd_delete_orphan() it does call poller.clo() but this one is not sufficient
				267	as it doesn't drop the update_mask nor does it clear the polled_mask. The
				268	typical problem that arises is that the close() happens before processing
				269	the last update (e.g. a close() just after a partial read), thus it still
				270	has at least one bit set for the current thread in both update_mask and
				271	polled_mask, and it is present in the update_list. Not handling it would
				272	mean that the event is lost on update() from the concerned threads and
				273	that some resource might leak. Handling it means zeroing the update_mask
				274	and polled_mask, and deleting the update entry from the update_list, thus
				275	losing the update event. And as indicated above, if the FD switches twice
				276	between 2 groups, the finally called thread does not necessarily know that
				277	the FD isn't the same anymore, thus it's difficult to decide whether to
				278	delete it or not, because deleting the event might in fact mean deleting
				279	something that was just re-added for the same thread with the same FD but
				280	a different usage.
				281
				282	Also it really seems unrealistic to scan a single shared update_list like
				283	this using write operations. There should likely be one per thread-group.
				284	But in this case there is no more choice than deleting the update event
				285	upon fd_delete_orphan(). This also means that poller->clo() must do the
				286	job for all of the group's threads at once. This would mean a synchronous
				287	removal before the close(), which doesn't seem ridiculously expensive. It
				288	just requires that any thread of a group may manipulate any other thread's
				289	status for an FD and a poller.
				290
				291	Note about our currently supported pollers:
				292
				293	- epoll: our current code base relies on the modern version which
				294	automatically removes closed FDs, so we don't have anything to do
				295	when closing and we don't need the update.
				296
				297	- kqueue: according to https://www.freebsd.org/cgi/man.cgi?query=kqueue, just
				298	like epoll, a close() implies a removal. Our poller doesn't perform
				299	any bookkeeping either so it's OK to directly close.
				300
				301	- evports: https://docs.oracle.com/cd/E86824_01/html/E54766/port-dissociate-3c.html
				302	says the same, i.e. close() implies a removal of all events. No local
				303	processing nor bookkeeping either, we can close.
				304
				305	- poll: the fd_evts[] array is global, thus shared by all threads. As such,
				306	a single removal is needed to flush it for all threads at once. The
				307	operation is already performed like this.
				308
				309	- select: works exactly like poll() above, hence already handled.
				310
				311	As a preliminary conclusion, it's safe to delete the event and reset
				312	update_mask just after calling poller->clo(). If extremely unlucky (changing
				313	thread mask due to takeover ?), the same FD may appear at the same time:
				314	- in one or several thread-local fd_updt[] arrays. These ones are just work
				315	queues, there's nothing to do to ignore them, just leave the holes with an
				316	outdated FD which will be ignored once met. As a bonus, poller->clo() could
				317	check if the last fd_updt[] points to this specific FD and decide to kill
				318	it.
				319
				320	- in the global update_list. In this case, fd_rm_from_fd_list() already
				321	performs an attachment check, so it's safe to always call it before closing
Ilya Shipitsin	3b64a28	2022-07-29 22:26:53 +0500	[diff] [blame]	322	(since no one else may be in the process of changing anything).
Willy Tarreau	d60269f	2022-07-06 15:44:49 +0200	[diff] [blame]	323
				324
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	325	###########################################################
				326
				327	Current state:
				328
				329
				330	Mux / takeover / fd_delete() code \|\|\| poller code
				331	-------------------------------------------------\|\|\|---------------------------------------------------
				332	\\|/
				333	mux_takeover(): \| fd_set_running():
				334	if (fd_takeover()<0) \| old = {running, thread};
				335	return fail; \| new = {tid_bit, tid_bit};
				336	... \|
				337	fd_takeover(): \| do {
Willy Tarreau	55b9689	2022-05-31 08:07:43 +0200	[diff] [blame]	338	atomic_or(running, tid_bit); \| if (!(old.thread & tid_bit))
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	339	old = {running, thread}; \| return -1;
				340	new = {tid_bit, tid_bit}; \| new = { running \| tid_bit, old.thread }
				341	if (owner != expected) { \| } while (!dwcas({running, thread}, &old, &new));
				342	atomic_and(runnning, ~tid_bit); \|
				343	return -1; // fail \| fd_clr_running():
				344	} \| return atomic_and_fetch(running, ~tid_bit);
				345	\|
				346	while (old == {tid_bit, !=0 }) \| poll():
				347	if (dwcas({running, thread}, &old, &new)) { \| if (!owner)
				348	atomic_and(runnning, ~tid_bit); \| continue;
				349	return 0; // success \|
				350	} \| if (!(thread_mask & tid_bit)) {
				351	} \| epoll_ctl_del();
				352	\| continue;
				353	atomic_and(runnning, ~tid_bit); \| }
				354	return -1; // fail \|
				355	\| // via fd_update_events()
				356	fd_delete(): \| if (fd_set_running() != -1) {
				357	atomic_or(running, tid_bit); \| iocb();
				358	atomic_store(thread, 0); \| if (fd_clr_running() == 0 && !thread_mask)
				359	if (fd_clr_running(fd) = 0) \| fd_delete_orphan();
				360	fd_delete_orphan(); \| }
				361
				362
				363	The idle_conns_lock prevents the connection from being picked and released
				364	while someone else is reading it. What it does is guarantee that on idle
				365	connections, the caller of the IOCB will not dereference the task's context
				366	while the connection is still in the idle list, since it might be picked then
				367	freed at the same instant by another thread. As soon as the IOCB manages to
				368	get that lock, it removes the connection from the list so that it cannot be
				369	taken over anymore. Conversely, the mux's takeover() code runs under that
				370	lock so that if it frees the connection and task, this will appear atomic
				371	to the IOCB. The timeout task (which is another entry point for connection
				372	deletion) does the same. Thus, when coming from the low-level (I/O or timeout):
				373	- task always exists, but ctx checked under lock validates; conn removal
				374	from list prevents takeover().
				375	- t->context is stable, except during changes under takeover lock. So
				376	h2_timeout_task may well run on a different thread than h2_io_cb().
				377
				378	Coming from the top:
				379	- takeover() done under lock() clears task's ctx and possibly closes the FD
				380	(unless some running remains present).
				381
				382	Unlikely but currently possible situations:
				383	- multiple pollers (up to N) may have an idle connection's FD being
				384	polled, if the connection was passed from thread to thread. The first
				385	event on the connection would wake all of them. Most of them would
				386	see fdtab[].owner set (the late ones might miss it). All but one would
				387	see that their bit is missing from fdtab[].thread_mask and give up.
				388	However, just after this test, others might take over the connection,
				389	so in practice if terribly unlucky, all but 1 could see their bit in
				390	thread_mask just before it gets removed, all of them set their bit
				391	in running_mask, and all of them call iocb() (sock_conn_iocb()).
				392	Thus all of them dereference the connection and touch the subscriber
				393	with no protection, then end up in conn_notify_mux() that will call
				394	the mux's wake().
				395
				396	- multiple pollers (up to N-1) might still be in fd_update_events()
				397	manipulating fdtab[].state. The cause is that the "locked" variable
				398	is determined by atleast2(thread_mask) but that thread_mask is read
				399	at a random instant (i.e. it may be stolen by another one during a
				400	takeover) since we don't yet hold running to prevent this from being
				401	done. Thus we can arrive here with thread_mask==something_else (1bit),
				402	locked==0 and fdtab[].state assigned non-atomically.
				403
				404	- it looks like nothing prevents h2_release() from being called on a
				405	thread (e.g. from the top or task timeout) while sock_conn_iocb()
				406	dereferences the connection on another thread. Those killing the
				407	connection don't yet consider the fact that it's an FD that others
				408	might currently be waking up on.
				409
				410	###################
				411
				412	pb with counter:
				413
				414	users count doesn't say who's using the FD and two users can do the same
				415	close in turn. The thread_mask should define who's responsible for closing
				416	the FD, and all those with a bit in it ought to do it.
				417
				418
				419	2021-08-25 - update with minimal locking on tgid value
				420	==========
				421
				422	- tgid + refcount at once using CAS
				423	- idle_conns lock during updates
				424	- update:
				425	if tgid differs => close happened, thus drop update
				426	otherwise normal stuff. Lock tgid until running if needed.
				427	- poll report:
				428	if tgid differs => closed
				429	if thread differs => stop polling (migrated)
				430	keep tgid lock until running
				431	- test on thread_id:
				432	if (xadd(&tgid,65536) != my_tgid) {
				433	// was closed
				434	sub(&tgid, 65536)
				435	return -1
				436	}
				437	if !(thread_id & tidbit) => migrated/closed
				438	set_running()
				439	sub(tgid,65536)
				440	- note: either fd_insert() or the final close() ought to set
				441	polled and update to 0.
				442
				443	2021-09-13 - tid / tgroups etc.
				444	==========
				445
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	446	* tid currently is the thread's global ID. It's essentially used as an index
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	447	for arrays. It must be clearly stated that it works this way.
				448
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	449	* tasklets use the global thread id, and __tasklet_wakeup_on() must use a
				450	global ID as well. It's capital that tinfo[] provides instant access to
				451	local/global bits/indexes/arrays
				452
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	453	- tid_bit makes no sense process-wide, so it must be redefined to represent
				454	the thread's tid within its group. The name is not much welcome though, but
				455	there are 286 of it that are not going to be changed that fast.
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	456	=> now we have ltid and ltid_bit in thread_info. thread-local tid_bit still
				457	not changed though. If renamed we must make sure the older one vanishes.
				458	Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid,
				459	gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid"
				460	which is half the time not the one we expect.
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	461
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	462	* just like "ti" is the thread_info, we need to have "tg" pointing to the
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	463	thread_group.
				464
				465	- other less commonly used elements should be retrieved from ti->xxx. E.g.
				466	the thread's local ID.
				467
				468	- lock debugging must reproduce tgid
				469
Willy Tarreau	680ed5f	2022-06-13 15:59:39 +0200	[diff] [blame]	470	* task profiling must be made per-group (annoying), unless we want to add a
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	471	per-thread TH_FL_* flag and have the rare places where the bit is changed
				472	iterate over all threads if needed. Sounds preferable overall.
				473
				474	* an offset might be placed in the tgroup so that even with 64 threads max
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	475	we could have completely separate tid_bits over several groups.
Willy Tarreau	1424774	2022-06-10 16:05:59 +0200	[diff] [blame]	476	=> base and count now
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	477
				478	2021-09-15 - bind + listen() + rx
				479	==========
				480
				481	- thread_mask (in bind_conf->rx_settings) should become an array of
				482	MAX_TGROUP longs.
				483	- when parsing "thread 123" or "thread 2/37", the proper bit is set,
Ilya Shipitsin	3b64a28	2022-07-29 22:26:53 +0500	[diff] [blame]	484	assuming the array is either a contiguous bitfield or a tgroup array.
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	485	An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on
				486	how the thread num was parsed, so that we reject mixes.
				487	- end of parsing: entries translated to the cleanest form (to be determined)
				488	- binding: for each socket()/bind()/listen()... just perform one extra dup()
				489	for each tgroup and store the multiple FDs into an FD array indexed on
				490	MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence
				491	to have multiple entries in all tgroup pollers without requiring the user
				492	to duplicate the bind line.
				493
				494	2021-09-15 - global thread masks
				495	==========
				496
				497	Some global variables currently expect to know about thread IDs and it's
				498	uncertain what must be done with them:
				499	- global_tasks_mask /* Mask of threads with tasks in the global runqueue */
				500	=> touched under the rq lock. Change it per-group ? What exact use is made ?
				501
				502	- sleeping_thread_mask /* Threads that are about to sleep in poll() */
				503	=> seems that it can be made per group
				504
				505	- all_threads_mask: a bit complicated, derived from nbthread and used with
				506	masks and with my_ffsl() to wake threads up. Should probably be per-group
				507	but we might miss something for global.
				508
				509	- stopping_thread_mask: used in combination with all_threads_mask, should
				510	move per-group.
				511
				512	- threads_harmless_mask: indicates all threads that are currently harmless in
				513	that they promise not to access a shared resource. Must be made per-group
				514	but then we'll likely need a second stage to have the harmless groups mask.
				515	threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one
				516	above. Maybe the right approach will be to request harmless on a group mask
				517	so that we can detect collisions and arbiter them like today, but on top of
				518	this it becomes possible to request harmless only on the local group if
				519	desired. The subtlety is that requesting harmless at the group level does
				520	not mean it's achieved since the requester cannot vouch for the other ones
				521	in the same group.
				522
				523	In addition, some variables are related to the global runqueue:
				524	__decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */
				525	struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */
				526	unsigned int grq_total; /* total number of entries in the global run queue, atomic */
				527	static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */
				528
				529	And others to the global wait queue:
				530	struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
				531	__decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */
				532	struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
				533
				534
Willy Tarreau	0aa6f3e	2022-06-14 15:00:40 +0200	[diff] [blame]	535	2022-06-14 - progress on task affinity
				536	==========
				537
				538	The particularity of the current global run queue is to be usable for remote
				539	wakeups because it's protected by a lock. There is no need for a global run
				540	queue beyond this, and there could already be a locked queue per thread for
				541	remote wakeups, with a random selection at wakeup time. It's just that picking
				542	a pending task in a run queue among a number is convenient (though it
				543	introduces some excessive locking). A task will either be tied to a single
				544	group or will be allowed to run on any group. As such it's pretty clear that we
				545	don't need a global run queue. When a run-anywhere task expires, either it runs
				546	on the current group's runqueue with any thread, or a target thread is selected
				547	during the wakeup and it's directly assigned.
				548
				549	A global wait queue seems important for scheduled repetitive tasks however. But
				550	maybe it's more a task for a cron-like job and there's no need for the task
				551	itself to wake up anywhere, because once the task wakes up, it must be tied to
				552	one (or a set of) thread(s). One difficulty if the task is temporarily assigned
				553	a thread group is that it's impossible to know where it's running when trying
				554	to perform a second wakeup or when trying to kill it. Maybe we'll need to have
				555	two tgid for a task (desired, effective). Or maybe we can restrict the ability
				556	of such a task to stay in wait queue in case of wakeup, though that sounds
				557	difficult. Other approaches would be to set the GID to the current one when
				558	waking up the task, and to have a flag (or sign on the GID) indicating that the
				559	task is still queued in the global timers queue. We already have TASK_SHARED_WQ
				560	so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make
				561	sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ?
				562	All calls to task_new() use either 1<<thr, tid_bit, all_threads_mask, or come
				563	from appctx_new which does exactly the same. The only real user of non-global,
				564	non-unique task_new() call is debug_parse_cli_sched() which purposely allows to
				565	use an arbitrary mask.
				566
				567	+----------------------------------------------------------------------------+
				568	\| => we don't need one WQ per group, only a global and N local ones, hence \|
				569	\| the TASK_SHARED_WQ flag can continue to be used for this purpose. \|
				570	+----------------------------------------------------------------------------+
				571
				572	Having TASK_SHARED_WQ should indicate that a task will always be queued to the
				573	shared queue and will always have a temporary gid and thread mask in the run
				574	queue.
				575
				576	Going further, as we don't have any single case of a task bound to a small set
				577	of threads, we could decide to wake up only expired tasks for ourselves by
				578	looking them up using eb32sc and adopting them. Thus, there's no more need for
				579	a shared runqueue nor a global_runqueue_ticks counter, and we can simply have
				580	the ability to wake up a remote task. The task's thread_mask will then change
				581	so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in
				582	which case it corresponds to the running thread. That's very close to what is
				583	already done with tasklets in fact.
				584
				585
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	586	2021-09-29 - group designation and masks
				587	==========
				588
				589	Neither FDs nor tasks will belong to incomplete subsets of threads spanning
				590	over multiple thread groups. In addition there may be a difference between
				591	configuration and operation (for FDs). This allows to fix the following rules:
				592
				593	group mask description
				594	0 0 bind_conf: groups & thread not set. bind to any/all
				595	task: it would be nice to mean "run on the same as the caller".
				596
				597	0 xxx bind_conf: thread set but not group: thread IDs are global
				598	FD/task: group 0, mask xxx
				599
				600	G>0 0 bind_conf: only group is set: bind to all threads of group G
				601	FD/task: mask 0 not permitted (= not owned). May be used to
				602	mention "any thread of this group", though already covered by
				603	G/xxx like today.
				604
				605	G>0 xxx bind_conf: Bind to these threads of this group
				606	FD/task: group G, mask xxx
				607
				608	It looks like keeping groups starting at zero internally complicates everything
				609	though. But forcing it to start at 1 might also require that we rescan all tasks
				610	to replace 0 with 1 upon startup. This would also allow group 0 to be special and
				611	be used as the default group for any new thread creation, so that group0.count
				612	would keep the number of unassigned threads. Let's try:
				613
				614	group mask description
				615	0 0 bind_conf: groups & thread not set. bind to any/all
				616	task: "run on the same group & thread as the caller".
				617
				618	0 xxx bind_conf: thread set but not group: thread IDs are global
				619	FD/task: invalid. Or maybe for a task we could use this to
				620	mean "run on current group, thread XXX", which would cover
				621	the need for health checks (g/t 0/0 while sleeping, 0/xxx
				622	while running) and have wake_expired_tasks() detect 0/0 and
				623	wake them up to a random group.
				624
				625	G>0 0 bind_conf: only group is set: bind to all threads of group G
				626	FD/task: mask 0 not permitted (= not owned). May be used to
				627	mention "any thread of this group", though already covered by
				628	G/xxx like today.
				629
				630	G>0 xxx bind_conf: Bind to these threads of this group
				631	FD/task: group G, mask xxx
				632
				633	With a single group declared in the config, group 0 would implicitly find the
				634	first one.
				635
				636
				637	The problem with the approach above is that a task queued in one group+thread's
				638	wait queue could very well receive a signal from another thread and/or group,
				639	and that there is no indication about where the task is queued, nor how to
				640	dequeue it. Thus it seems that it's up to the application itself to unbind/
				641	rebind a task. This contradicts the principle of leaving a task waiting in a
				642	wait queue and waking it anywhere.
				643
				644	Another possibility might be to decide that a task having a defined group but
				645	a mask of zero is shared and will always be queued into its group's wait queue.
				646	However, upon expiry, the scheduler would notice the thread-mask 0 and would
				647	broadcast it to any group.
				648
				649	Right now in the code we have:
				650	- 18 calls of task_new(tid_bit)
Willy Tarreau	3ccb14d	2022-06-14 11:18:40 +0200	[diff] [blame]	651	- 17 calls of task_new_anywhere()
Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	652	- 2 calls with a single bit
				653
				654	Thus it looks like "task_new_anywhere()", "task_new_on()" and
				655	"task_new_here()" would be sufficient.