Blame - doc/design-thoughts/thread-group.txt - haproxy

blob: e859bf55d751d4dfc67b2a3019e4d57d29d8a8d4 [file] [log] [blame]

Willy Tarreau	2747162	2021-11-18 17:45:57 +0100	[diff] [blame]	1	Thread groups
				2	#############
				3
				4	2021-07-13 - first draft
				5	==========
				6
				7	Objective
				8	---------
				9	- support multi-socket systems with limited cache-line bouncing between
				10	physical CPUs and/or L3 caches
				11
				12	- overcome the 64-thread limitation
				13
				14	- Support a reasonable number of groups. I.e. if modern CPUs arrive with
				15	core complexes made of 8 cores, with 8 CC per chip and 2 chips in a
				16	system, it makes sense to support 16 groups.
				17
				18
				19	Non-objective
				20	-------------
				21	- no need to optimize to the last possible cycle. I.e. some algos like
				22	leastconn will remain shared across all threads, servers will keep a
				23	single queue, etc. Global information remains global.
				24
				25	- no stubborn enforcement of FD sharing. Per-server idle connection lists
				26	can become per-group; listeners can (and should probably) be per-group.
				27	Other mechanisms (like SO_REUSEADDR) can already overcome this.
				28
				29	- no need to go beyond 64 threads per group.
				30
				31
				32	Identified tasks
				33	================
				34
				35	General
				36	-------
				37	Everywhere tid_bit is used we absolutely need to find a complement using
				38	either the current group or a specific one. Thread debugging will need to
				39	be extended as masks are extensively used.
				40
				41
				42	Scheduler
				43	---------
				44	The global run queue and global wait queue must become per-group. This
				45	means that a task may only be queued into one of them at a time. It
				46	sounds like tasks may only belong to a given group, but doing so would
				47	bring back the original issue that it's impossible to perform remote wake
				48	ups.
				49
				50	We could probably ignore the group if we don't need to set the thread mask
				51	in the task. the task's thread_mask is never manipulated using atomics so
				52	it's safe to complement it with a group.
				53
				54	The sleeping_thread_mask should become per-group. Thus possibly that a
				55	wakeup may only be performed on the assigned group, meaning that either
				56	a task is not assigned, in which case it be self-assigned (like today),
				57	otherwise the tg to be woken up will be retrieved from the task itself.
				58
				59	Task creation currently takes a thread mask of either tid_bit, a specific
				60	mask, or MAX_THREADS_MASK. How to create a task able to run anywhere
				61	(checks, Lua, ...) ?
				62
				63	Profiling
				64	---------
				65	There should be one task_profiling_mask per thread group. Enabling or
				66	disabling profiling should be made per group (possibly by iterating).
				67
				68	Thread isolation
				69	----------------
				70	Thread isolation is difficult as we solely rely on atomic ops to figure
				71	who can complete. Such operation is rare, maybe we could have a global
				72	read_mostly flag containing a mask of the groups that require isolation.
				73	Then the threads_want_rdv_mask etc can become per-group. However setting
				74	and clearing the bits will become problematic as this will happen in two
				75	steps hence will require careful ordering.
				76
				77	FD
				78	--
				79	Tidbit is used in a number of atomic ops on the running_mask. If we have
				80	one fdtab[] per group, the mask implies that it's within the group.
				81	Theoretically we should never face a situation where an FD is reported nor
				82	manipulated for a remote group.
				83
				84	There will still be one poller per thread, except that this time all
				85	operations will be related to the current thread_group. No fd may appear
				86	in two thread_groups at once, but we can probably not prevent that (e.g.
				87	delayed close and reopen). Should we instead have a single shared fdtab[]
				88	(less memory usage also) ? Maybe adding the group in the fdtab entry would
				89	work, but when does a thread know it can leave it ? Currently this is
				90	solved by running_mask and by update_mask. Having two tables could help
				91	with this (each table sees the FD in a different group with a different
				92	mask) but this looks overkill.
				93
				94	There's polled_mask[] which needs to be decided upon. Probably that it
				95	should be doubled as well. Note, polled_mask left fdtab[] for cacheline
				96	alignment reasons in commit cb92f5cae4.
				97
				98	If we have one fdtab[] per group, what really prevents from using the
				99	same FD in multiple groups ? _fd_delete_orphan() and fd_update_events()
				100	need to check for no-thread usage before closing the FD. This could be
				101	a limiting factor. Enabling could require to wake every poller.
				102
				103	Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short,
				104	used only during creation and close) ?
				105
				106	Other problem, if we have one fdtab[] per TG, disabling/enabling an FD
				107	(e.g. pause/resume on listener) can become a problem if it's not necessarily
				108	on the current TG. We'll then need a way to figure that one. It sounds like
				109	FDs from listeners and receivers are very specific and suffer from problems
				110	all other ones under high load do not suffer from. Maybe something specific
				111	ought to be done for them, if we can guarantee there is no risk of accidental
				112	reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the
				113	FD's flags). The risk is always that a close() can result in instant pop-up
				114	of the same FD on any other thread of the same process.
				115
				116	Observations: right now fdtab[].thread_mask more or less corresponds to a
				117	declaration of interest, it's very close to meaning "active per thread". It is
				118	in fact located in the FD while it ought to do nothing there, as it should be
				119	where the FD is used as it rules accesses to a shared resource that is not
				120	the FD but what uses it. Indeed, if neither polled_mask nor running_mask have
				121	a thread's bit, the FD is unknown to that thread and the element using it may
				122	only be reached from above and not from the FD. As such we ought to have a
				123	thread_mask on a listener and another one on connections. These ones will
				124	indicate who uses them. A takeover could then be simplified (atomically set
				125	exclusivity on the FD's running_mask, upon success, takeover the connection,
				126	clear the running mask). Probably that the change ought to be performed on
				127	the connection level first, not the FD level by the way. But running and
				128	polled are the two relevant elements, one indicates userland knowledge,
				129	the other one kernel knowledge. For listeners there's no exclusivity so it's
				130	a bit different but the rule remains the same that we don't have to know
				131	what threads are interested in the FD, only its holder.
				132
				133	Not exact in fact, see FD notes below.
				134
				135	activity
				136	--------
				137	There should be one activity array per thread group. The dump should
				138	simply scan them all since the cumuled values are not very important
				139	anyway.
				140
				141	applets
				142	-------
				143	They use tid_bit only for the task. It looks like the appctx's thread_mask
				144	is never used (now removed). Furthermore, it looks like the argument is
				145	always tid_bit.
				146
				147	CPU binding
				148	-----------
				149	This is going to be tough. It will be needed to detect that threads overlap
				150	and are not bound (i.e. all threads on same mask). In this case, if the number
				151	of threads is higher than the number of threads per physical socket, one must
				152	try hard to evenly spread them among physical sockets (e.g. one thread group
				153	per physical socket) and start as many threads as needed on each, bound to
				154	all threads/cores of each socket. If there is a single socket, the same job
				155	may be done based on L3 caches. Maybe it could always be done based on L3
				156	caches. The difficulty behind this is the number of sockets to be bound: it
				157	is not possible to bind several FDs per listener. Maybe with a new bind
				158	keyword we can imagine to automatically duplicate listeners ? In any case,
				159	the initially bound cpumap (via taskset) must always be respected, and
				160	everything should probably start from there.
				161
				162	Frontend binding
				163	----------------
				164	We'll have to define a list of threads and thread-groups per frontend.
				165	Probably that having a group mask and a same thread-mask for each group
				166	would suffice.
				167
				168	Threads should have two numbers:
				169	- the per-process number (e.g. 1..256)
				170	- the per-group number (1..64)
				171
				172	The "bind-thread" lines ought to use the following syntax:
				173	- bind 45 ## bind to process' thread 45
				174	- bind 1/45 ## bind to group 1's thread 45
				175	- bind all/45 ## bind to thread 45 in each group
				176	- bind 1/all ## bind to all threads in group 1
				177	- bind all ## bind to all threads
				178	- bind all/all ## bind to all threads in all groups (=all)
				179	- bind 1/65 ## rejected
				180	- bind 65 ## OK if there are enough
				181	- bind 35-45 ## depends. Rejected if it crosses a group boundary.
				182
				183	The global directive "nbthread 28" means 28 total threads for the process. The
				184	number of groups will sub-divide this. E.g. 4 groups will very likely imply 7
				185	threads per group. At the beginning, the nbgroup should be manual since it
				186	implies config adjustments to bind lines.
				187
				188	There should be a trivial way to map a global thread to a group and local ID
				189	and to do the opposite.
				190
				191
				192	Panic handler + watchdog
				193	------------------------
				194	Will probably depend on what's done for thread_isolate
				195
				196	Per-thread arrays inside structures
				197	-----------------------------------
				198	- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should
				199	we simply bump the limit ?
				200	- same for servers with idle connections.
				201	=> doesn't seem very practical.
				202	- another solution might be to point to dynamically allocated arrays of
				203	arrays (e.g. nbthread * nbgroup) or a first level per group and a second
				204	per thread.
				205	=> dynamic allocation based on the global number
				206
				207	Other
				208	-----
				209	- what about dynamic thread start/stop (e.g. for containers/VMs) ?
				210	E.g. if we decide to start $MANY threads in 4 groups, and only use
				211	one, in the end it will not be possible to use less than one thread
				212	per group, and at most 64 will be present in each group.
				213
				214
				215	FD Notes
				216	--------
				217	- updt_fd_polling() uses thread_mask to figure where to send the update,
				218	the local list or a shared list, and which bits to set in update_mask.
				219	This could be changed so that it takes the update mask in argument. The
				220	call from the poller's fork would just have to broadcast everywhere.
				221
				222	- pollers use it to figure whether they're concerned or not by the activity
				223	update. This looks important as otherwise we could re-enable polling on
				224	an FD that changed to another thread.
				225
				226	- thread_mask being a per-thread active mask looks more exact and is
				227	precisely used this way by _update_fd(). In this case using it instead
				228	of running_mask to gauge a change or temporarily lock it during a
				229	removal could make sense.
				230
				231	- running should be conditionned by thread. Polled not (since deferred
				232	or migrated). In this case testing thread_mask can be enough most of
				233	the time, but this requires synchronization that will have to be
				234	extended to tgid.. But migration seems a different beast that we shouldn't
				235	care about here: if first performed at the higher level it ought to
				236	be safe.
				237
				238	In practice the update_mask can be dropped to zero by the first fd_delete()
				239	as the only authority allowed to fd_delete() is the owner, and as soon as
				240	all running_mask are gone, the FD will be closed, hence removed from all
				241	pollers. This will be the only way to make sure that update_mask always
				242	refers to the current tgid.
				243
				244	However, it may happen that a takeover within the same group causes a thread
				245	to read the update_mask late, while the FD is being wiped by another thread.
				246	That other thread may close it, causing another thread in another group to
				247	catch it, and change the tgid and start to update the update_mask. This means
				248	that it would be possible for a thread entering do_poll() to see the correct
				249	tgid, then the fd would be closed, reopened and reassigned to another tgid,
				250	and the thread would see its bit in the update_mask, being confused. Right
				251	now this should already happen when the update_mask is not cleared, except
				252	that upon wakeup a migration would be detected and that would be all.
				253
				254	Thus we might need to set the running bit to prevent the FD from migrating
				255	before reading update_mask, which also implies closing on fd_clr_running() == 0 :-(
				256
				257	Also even fd_update_events() leaves a risk of updating update_mask after
				258	clearing running, thus affecting the wrong one. Probably that update_mask
				259	should be updated before clearing running_mask there. Also, how about not
				260	creating an update on a close ? Not trivial if done before running, unless
				261	thread_mask==0.
				262
				263	###########################################################
				264
				265	Current state:
				266
				267
				268	Mux / takeover / fd_delete() code \|\|\| poller code
				269	-------------------------------------------------\|\|\|---------------------------------------------------
				270	\\|/
				271	mux_takeover(): \| fd_set_running():
				272	if (fd_takeover()<0) \| old = {running, thread};
				273	return fail; \| new = {tid_bit, tid_bit};
				274	... \|
				275	fd_takeover(): \| do {
				276	atomic_or(runnning, tid_bit); \| if (!(old.thread & tid_bit))
				277	old = {running, thread}; \| return -1;
				278	new = {tid_bit, tid_bit}; \| new = { running \| tid_bit, old.thread }
				279	if (owner != expected) { \| } while (!dwcas({running, thread}, &old, &new));
				280	atomic_and(runnning, ~tid_bit); \|
				281	return -1; // fail \| fd_clr_running():
				282	} \| return atomic_and_fetch(running, ~tid_bit);
				283	\|
				284	while (old == {tid_bit, !=0 }) \| poll():
				285	if (dwcas({running, thread}, &old, &new)) { \| if (!owner)
				286	atomic_and(runnning, ~tid_bit); \| continue;
				287	return 0; // success \|
				288	} \| if (!(thread_mask & tid_bit)) {
				289	} \| epoll_ctl_del();
				290	\| continue;
				291	atomic_and(runnning, ~tid_bit); \| }
				292	return -1; // fail \|
				293	\| // via fd_update_events()
				294	fd_delete(): \| if (fd_set_running() != -1) {
				295	atomic_or(running, tid_bit); \| iocb();
				296	atomic_store(thread, 0); \| if (fd_clr_running() == 0 && !thread_mask)
				297	if (fd_clr_running(fd) = 0) \| fd_delete_orphan();
				298	fd_delete_orphan(); \| }
				299
				300
				301	The idle_conns_lock prevents the connection from being picked and released
				302	while someone else is reading it. What it does is guarantee that on idle
				303	connections, the caller of the IOCB will not dereference the task's context
				304	while the connection is still in the idle list, since it might be picked then
				305	freed at the same instant by another thread. As soon as the IOCB manages to
				306	get that lock, it removes the connection from the list so that it cannot be
				307	taken over anymore. Conversely, the mux's takeover() code runs under that
				308	lock so that if it frees the connection and task, this will appear atomic
				309	to the IOCB. The timeout task (which is another entry point for connection
				310	deletion) does the same. Thus, when coming from the low-level (I/O or timeout):
				311	- task always exists, but ctx checked under lock validates; conn removal
				312	from list prevents takeover().
				313	- t->context is stable, except during changes under takeover lock. So
				314	h2_timeout_task may well run on a different thread than h2_io_cb().
				315
				316	Coming from the top:
				317	- takeover() done under lock() clears task's ctx and possibly closes the FD
				318	(unless some running remains present).
				319
				320	Unlikely but currently possible situations:
				321	- multiple pollers (up to N) may have an idle connection's FD being
				322	polled, if the connection was passed from thread to thread. The first
				323	event on the connection would wake all of them. Most of them would
				324	see fdtab[].owner set (the late ones might miss it). All but one would
				325	see that their bit is missing from fdtab[].thread_mask and give up.
				326	However, just after this test, others might take over the connection,
				327	so in practice if terribly unlucky, all but 1 could see their bit in
				328	thread_mask just before it gets removed, all of them set their bit
				329	in running_mask, and all of them call iocb() (sock_conn_iocb()).
				330	Thus all of them dereference the connection and touch the subscriber
				331	with no protection, then end up in conn_notify_mux() that will call
				332	the mux's wake().
				333
				334	- multiple pollers (up to N-1) might still be in fd_update_events()
				335	manipulating fdtab[].state. The cause is that the "locked" variable
				336	is determined by atleast2(thread_mask) but that thread_mask is read
				337	at a random instant (i.e. it may be stolen by another one during a
				338	takeover) since we don't yet hold running to prevent this from being
				339	done. Thus we can arrive here with thread_mask==something_else (1bit),
				340	locked==0 and fdtab[].state assigned non-atomically.
				341
				342	- it looks like nothing prevents h2_release() from being called on a
				343	thread (e.g. from the top or task timeout) while sock_conn_iocb()
				344	dereferences the connection on another thread. Those killing the
				345	connection don't yet consider the fact that it's an FD that others
				346	might currently be waking up on.
				347
				348	###################
				349
				350	pb with counter:
				351
				352	users count doesn't say who's using the FD and two users can do the same
				353	close in turn. The thread_mask should define who's responsible for closing
				354	the FD, and all those with a bit in it ought to do it.
				355
				356
				357	2021-08-25 - update with minimal locking on tgid value
				358	==========
				359
				360	- tgid + refcount at once using CAS
				361	- idle_conns lock during updates
				362	- update:
				363	if tgid differs => close happened, thus drop update
				364	otherwise normal stuff. Lock tgid until running if needed.
				365	- poll report:
				366	if tgid differs => closed
				367	if thread differs => stop polling (migrated)
				368	keep tgid lock until running
				369	- test on thread_id:
				370	if (xadd(&tgid,65536) != my_tgid) {
				371	// was closed
				372	sub(&tgid, 65536)
				373	return -1
				374	}
				375	if !(thread_id & tidbit) => migrated/closed
				376	set_running()
				377	sub(tgid,65536)
				378	- note: either fd_insert() or the final close() ought to set
				379	polled and update to 0.
				380
				381	2021-09-13 - tid / tgroups etc.
				382	==========
				383
				384	- tid currently is the thread's global ID. It's essentially used as an index
				385	for arrays. It must be clearly stated that it works this way.
				386
				387	- tid_bit makes no sense process-wide, so it must be redefined to represent
				388	the thread's tid within its group. The name is not much welcome though, but
				389	there are 286 of it that are not going to be changed that fast.
				390
				391	- just like "ti" is the thread_info, we need to have "tg" pointing to the
				392	thread_group.
				393
				394	- other less commonly used elements should be retrieved from ti->xxx. E.g.
				395	the thread's local ID.
				396
				397	- lock debugging must reproduce tgid
				398
				399	- an offset might be placed in the tgroup so that even with 64 threads max
				400	we could have completely separate tid_bits over several groups.
				401
				402	2021-09-15 - bind + listen() + rx
				403	==========
				404
				405	- thread_mask (in bind_conf->rx_settings) should become an array of
				406	MAX_TGROUP longs.
				407	- when parsing "thread 123" or "thread 2/37", the proper bit is set,
				408	assuming the array is either a contigous bitfield or a tgroup array.
				409	An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on
				410	how the thread num was parsed, so that we reject mixes.
				411	- end of parsing: entries translated to the cleanest form (to be determined)
				412	- binding: for each socket()/bind()/listen()... just perform one extra dup()
				413	for each tgroup and store the multiple FDs into an FD array indexed on
				414	MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence
				415	to have multiple entries in all tgroup pollers without requiring the user
				416	to duplicate the bind line.
				417
				418	2021-09-15 - global thread masks
				419	==========
				420
				421	Some global variables currently expect to know about thread IDs and it's
				422	uncertain what must be done with them:
				423	- global_tasks_mask /* Mask of threads with tasks in the global runqueue */
				424	=> touched under the rq lock. Change it per-group ? What exact use is made ?
				425
				426	- sleeping_thread_mask /* Threads that are about to sleep in poll() */
				427	=> seems that it can be made per group
				428
				429	- all_threads_mask: a bit complicated, derived from nbthread and used with
				430	masks and with my_ffsl() to wake threads up. Should probably be per-group
				431	but we might miss something for global.
				432
				433	- stopping_thread_mask: used in combination with all_threads_mask, should
				434	move per-group.
				435
				436	- threads_harmless_mask: indicates all threads that are currently harmless in
				437	that they promise not to access a shared resource. Must be made per-group
				438	but then we'll likely need a second stage to have the harmless groups mask.
				439	threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one
				440	above. Maybe the right approach will be to request harmless on a group mask
				441	so that we can detect collisions and arbiter them like today, but on top of
				442	this it becomes possible to request harmless only on the local group if
				443	desired. The subtlety is that requesting harmless at the group level does
				444	not mean it's achieved since the requester cannot vouch for the other ones
				445	in the same group.
				446
				447	In addition, some variables are related to the global runqueue:
				448	__decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */
				449	struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */
				450	unsigned int grq_total; /* total number of entries in the global run queue, atomic */
				451	static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */
				452
				453	And others to the global wait queue:
				454	struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
				455	__decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */
				456	struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
				457
				458
				459	2021-09-29 - group designation and masks
				460	==========
				461
				462	Neither FDs nor tasks will belong to incomplete subsets of threads spanning
				463	over multiple thread groups. In addition there may be a difference between
				464	configuration and operation (for FDs). This allows to fix the following rules:
				465
				466	group mask description
				467	0 0 bind_conf: groups & thread not set. bind to any/all
				468	task: it would be nice to mean "run on the same as the caller".
				469
				470	0 xxx bind_conf: thread set but not group: thread IDs are global
				471	FD/task: group 0, mask xxx
				472
				473	G>0 0 bind_conf: only group is set: bind to all threads of group G
				474	FD/task: mask 0 not permitted (= not owned). May be used to
				475	mention "any thread of this group", though already covered by
				476	G/xxx like today.
				477
				478	G>0 xxx bind_conf: Bind to these threads of this group
				479	FD/task: group G, mask xxx
				480
				481	It looks like keeping groups starting at zero internally complicates everything
				482	though. But forcing it to start at 1 might also require that we rescan all tasks
				483	to replace 0 with 1 upon startup. This would also allow group 0 to be special and
				484	be used as the default group for any new thread creation, so that group0.count
				485	would keep the number of unassigned threads. Let's try:
				486
				487	group mask description
				488	0 0 bind_conf: groups & thread not set. bind to any/all
				489	task: "run on the same group & thread as the caller".
				490
				491	0 xxx bind_conf: thread set but not group: thread IDs are global
				492	FD/task: invalid. Or maybe for a task we could use this to
				493	mean "run on current group, thread XXX", which would cover
				494	the need for health checks (g/t 0/0 while sleeping, 0/xxx
				495	while running) and have wake_expired_tasks() detect 0/0 and
				496	wake them up to a random group.
				497
				498	G>0 0 bind_conf: only group is set: bind to all threads of group G
				499	FD/task: mask 0 not permitted (= not owned). May be used to
				500	mention "any thread of this group", though already covered by
				501	G/xxx like today.
				502
				503	G>0 xxx bind_conf: Bind to these threads of this group
				504	FD/task: group G, mask xxx
				505
				506	With a single group declared in the config, group 0 would implicitly find the
				507	first one.
				508
				509
				510	The problem with the approach above is that a task queued in one group+thread's
				511	wait queue could very well receive a signal from another thread and/or group,
				512	and that there is no indication about where the task is queued, nor how to
				513	dequeue it. Thus it seems that it's up to the application itself to unbind/
				514	rebind a task. This contradicts the principle of leaving a task waiting in a
				515	wait queue and waking it anywhere.
				516
				517	Another possibility might be to decide that a task having a defined group but
				518	a mask of zero is shared and will always be queued into its group's wait queue.
				519	However, upon expiry, the scheduler would notice the thread-mask 0 and would
				520	broadcast it to any group.
				521
				522	Right now in the code we have:
				523	- 18 calls of task_new(tid_bit)
				524	- 18 calls of task_new(MAX_THREADS_MASK)
				525	- 2 calls with a single bit
				526
				527	Thus it looks like "task_new_anywhere()", "task_new_on()" and
				528	"task_new_here()" would be sufficient.