blob: e845230fab8ae46b394956435b45e9163135e705 [file] [log] [blame]
Willy Tarreau27471622021-11-18 17:45:57 +01001Thread groups
2#############
3
42021-07-13 - first draft
5==========
6
7Objective
8---------
9- support multi-socket systems with limited cache-line bouncing between
10 physical CPUs and/or L3 caches
11
12- overcome the 64-thread limitation
13
14- Support a reasonable number of groups. I.e. if modern CPUs arrive with
15 core complexes made of 8 cores, with 8 CC per chip and 2 chips in a
16 system, it makes sense to support 16 groups.
17
18
19Non-objective
20-------------
21- no need to optimize to the last possible cycle. I.e. some algos like
22 leastconn will remain shared across all threads, servers will keep a
23 single queue, etc. Global information remains global.
24
25- no stubborn enforcement of FD sharing. Per-server idle connection lists
26 can become per-group; listeners can (and should probably) be per-group.
27 Other mechanisms (like SO_REUSEADDR) can already overcome this.
28
29- no need to go beyond 64 threads per group.
30
31
32Identified tasks
33================
34
35General
36-------
37Everywhere tid_bit is used we absolutely need to find a complement using
38either the current group or a specific one. Thread debugging will need to
39be extended as masks are extensively used.
40
41
42Scheduler
43---------
44The global run queue and global wait queue must become per-group. This
45means that a task may only be queued into one of them at a time. It
46sounds like tasks may only belong to a given group, but doing so would
47bring back the original issue that it's impossible to perform remote wake
48ups.
49
50We could probably ignore the group if we don't need to set the thread mask
51in the task. the task's thread_mask is never manipulated using atomics so
52it's safe to complement it with a group.
53
54The sleeping_thread_mask should become per-group. Thus possibly that a
55wakeup may only be performed on the assigned group, meaning that either
56a task is not assigned, in which case it be self-assigned (like today),
57otherwise the tg to be woken up will be retrieved from the task itself.
58
59Task creation currently takes a thread mask of either tid_bit, a specific
60mask, or MAX_THREADS_MASK. How to create a task able to run anywhere
61(checks, Lua, ...) ?
62
Willy Tarreau680ed5f2022-06-13 15:59:39 +020063Profiling -> completed
Willy Tarreau27471622021-11-18 17:45:57 +010064---------
65There should be one task_profiling_mask per thread group. Enabling or
66disabling profiling should be made per group (possibly by iterating).
Willy Tarreau680ed5f2022-06-13 15:59:39 +020067-> not needed anymore, one flag per thread in each thread's context.
Willy Tarreau27471622021-11-18 17:45:57 +010068
69Thread isolation
70----------------
71Thread isolation is difficult as we solely rely on atomic ops to figure
72who can complete. Such operation is rare, maybe we could have a global
73read_mostly flag containing a mask of the groups that require isolation.
74Then the threads_want_rdv_mask etc can become per-group. However setting
75and clearing the bits will become problematic as this will happen in two
76steps hence will require careful ordering.
77
78FD
79--
80Tidbit is used in a number of atomic ops on the running_mask. If we have
81one fdtab[] per group, the mask implies that it's within the group.
82Theoretically we should never face a situation where an FD is reported nor
83manipulated for a remote group.
84
85There will still be one poller per thread, except that this time all
86operations will be related to the current thread_group. No fd may appear
87in two thread_groups at once, but we can probably not prevent that (e.g.
88delayed close and reopen). Should we instead have a single shared fdtab[]
89(less memory usage also) ? Maybe adding the group in the fdtab entry would
90work, but when does a thread know it can leave it ? Currently this is
91solved by running_mask and by update_mask. Having two tables could help
92with this (each table sees the FD in a different group with a different
93mask) but this looks overkill.
94
95There's polled_mask[] which needs to be decided upon. Probably that it
96should be doubled as well. Note, polled_mask left fdtab[] for cacheline
97alignment reasons in commit cb92f5cae4.
98
99If we have one fdtab[] per group, what *really* prevents from using the
100same FD in multiple groups ? _fd_delete_orphan() and fd_update_events()
101need to check for no-thread usage before closing the FD. This could be
102a limiting factor. Enabling could require to wake every poller.
103
104Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short,
105used only during creation and close) ?
106
107Other problem, if we have one fdtab[] per TG, disabling/enabling an FD
108(e.g. pause/resume on listener) can become a problem if it's not necessarily
109on the current TG. We'll then need a way to figure that one. It sounds like
110FDs from listeners and receivers are very specific and suffer from problems
111all other ones under high load do not suffer from. Maybe something specific
112ought to be done for them, if we can guarantee there is no risk of accidental
113reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the
114FD's flags). The risk is always that a close() can result in instant pop-up
115of the same FD on any other thread of the same process.
116
117Observations: right now fdtab[].thread_mask more or less corresponds to a
118declaration of interest, it's very close to meaning "active per thread". It is
119in fact located in the FD while it ought to do nothing there, as it should be
120where the FD is used as it rules accesses to a shared resource that is not
121the FD but what uses it. Indeed, if neither polled_mask nor running_mask have
122a thread's bit, the FD is unknown to that thread and the element using it may
123only be reached from above and not from the FD. As such we ought to have a
124thread_mask on a listener and another one on connections. These ones will
125indicate who uses them. A takeover could then be simplified (atomically set
126exclusivity on the FD's running_mask, upon success, takeover the connection,
127clear the running mask). Probably that the change ought to be performed on
128the connection level first, not the FD level by the way. But running and
129polled are the two relevant elements, one indicates userland knowledge,
130the other one kernel knowledge. For listeners there's no exclusivity so it's
131a bit different but the rule remains the same that we don't have to know
132what threads are *interested* in the FD, only its holder.
133
134Not exact in fact, see FD notes below.
135
136activity
137--------
138There should be one activity array per thread group. The dump should
139simply scan them all since the cumuled values are not very important
140anyway.
141
142applets
143-------
144They use tid_bit only for the task. It looks like the appctx's thread_mask
145is never used (now removed). Furthermore, it looks like the argument is
146*always* tid_bit.
147
148CPU binding
149-----------
150This is going to be tough. It will be needed to detect that threads overlap
151and are not bound (i.e. all threads on same mask). In this case, if the number
152of threads is higher than the number of threads per physical socket, one must
153try hard to evenly spread them among physical sockets (e.g. one thread group
154per physical socket) and start as many threads as needed on each, bound to
155all threads/cores of each socket. If there is a single socket, the same job
156may be done based on L3 caches. Maybe it could always be done based on L3
157caches. The difficulty behind this is the number of sockets to be bound: it
158is not possible to bind several FDs per listener. Maybe with a new bind
159keyword we can imagine to automatically duplicate listeners ? In any case,
160the initially bound cpumap (via taskset) must always be respected, and
161everything should probably start from there.
162
163Frontend binding
164----------------
165We'll have to define a list of threads and thread-groups per frontend.
166Probably that having a group mask and a same thread-mask for each group
167would suffice.
168
169Threads should have two numbers:
170 - the per-process number (e.g. 1..256)
171 - the per-group number (1..64)
172
173The "bind-thread" lines ought to use the following syntax:
174 - bind 45 ## bind to process' thread 45
175 - bind 1/45 ## bind to group 1's thread 45
176 - bind all/45 ## bind to thread 45 in each group
177 - bind 1/all ## bind to all threads in group 1
178 - bind all ## bind to all threads
179 - bind all/all ## bind to all threads in all groups (=all)
180 - bind 1/65 ## rejected
181 - bind 65 ## OK if there are enough
182 - bind 35-45 ## depends. Rejected if it crosses a group boundary.
183
184The global directive "nbthread 28" means 28 total threads for the process. The
185number of groups will sub-divide this. E.g. 4 groups will very likely imply 7
186threads per group. At the beginning, the nbgroup should be manual since it
187implies config adjustments to bind lines.
188
189There should be a trivial way to map a global thread to a group and local ID
190and to do the opposite.
191
192
193Panic handler + watchdog
194------------------------
195Will probably depend on what's done for thread_isolate
196
197Per-thread arrays inside structures
198-----------------------------------
199- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should
200 we simply bump the limit ?
201- same for servers with idle connections.
202=> doesn't seem very practical.
203- another solution might be to point to dynamically allocated arrays of
204 arrays (e.g. nbthread * nbgroup) or a first level per group and a second
205 per thread.
206=> dynamic allocation based on the global number
207
208Other
209-----
210- what about dynamic thread start/stop (e.g. for containers/VMs) ?
211 E.g. if we decide to start $MANY threads in 4 groups, and only use
212 one, in the end it will not be possible to use less than one thread
213 per group, and at most 64 will be present in each group.
214
215
216FD Notes
217--------
218 - updt_fd_polling() uses thread_mask to figure where to send the update,
219 the local list or a shared list, and which bits to set in update_mask.
220 This could be changed so that it takes the update mask in argument. The
221 call from the poller's fork would just have to broadcast everywhere.
222
223 - pollers use it to figure whether they're concerned or not by the activity
224 update. This looks important as otherwise we could re-enable polling on
225 an FD that changed to another thread.
226
227 - thread_mask being a per-thread active mask looks more exact and is
228 precisely used this way by _update_fd(). In this case using it instead
229 of running_mask to gauge a change or temporarily lock it during a
230 removal could make sense.
231
Willy Tarreau55b96892022-05-31 08:07:43 +0200232 - running should be conditioned by thread. Polled not (since deferred
Willy Tarreau27471622021-11-18 17:45:57 +0100233 or migrated). In this case testing thread_mask can be enough most of
234 the time, but this requires synchronization that will have to be
235 extended to tgid.. But migration seems a different beast that we shouldn't
236 care about here: if first performed at the higher level it ought to
237 be safe.
238
239In practice the update_mask can be dropped to zero by the first fd_delete()
240as the only authority allowed to fd_delete() is *the* owner, and as soon as
241all running_mask are gone, the FD will be closed, hence removed from all
242pollers. This will be the only way to make sure that update_mask always
243refers to the current tgid.
244
245However, it may happen that a takeover within the same group causes a thread
246to read the update_mask late, while the FD is being wiped by another thread.
247That other thread may close it, causing another thread in another group to
248catch it, and change the tgid and start to update the update_mask. This means
249that it would be possible for a thread entering do_poll() to see the correct
250tgid, then the fd would be closed, reopened and reassigned to another tgid,
251and the thread would see its bit in the update_mask, being confused. Right
252now this should already happen when the update_mask is not cleared, except
253that upon wakeup a migration would be detected and that would be all.
254
255Thus we might need to set the running bit to prevent the FD from migrating
256before reading update_mask, which also implies closing on fd_clr_running() == 0 :-(
257
258Also even fd_update_events() leaves a risk of updating update_mask after
259clearing running, thus affecting the wrong one. Probably that update_mask
260should be updated before clearing running_mask there. Also, how about not
261creating an update on a close ? Not trivial if done before running, unless
262thread_mask==0.
263
Willy Tarreaud60269f2022-07-06 15:44:49 +0200264Note that one situation that is currently visible is that a thread closes a
265file descriptor that it's the last one to own and to have an update for. In
266fd_delete_orphan() it does call poller.clo() but this one is not sufficient
267as it doesn't drop the update_mask nor does it clear the polled_mask. The
268typical problem that arises is that the close() happens before processing
269the last update (e.g. a close() just after a partial read), thus it still
270has *at least* one bit set for the current thread in both update_mask and
271polled_mask, and it is present in the update_list. Not handling it would
272mean that the event is lost on update() from the concerned threads and
273that some resource might leak. Handling it means zeroing the update_mask
274and polled_mask, and deleting the update entry from the update_list, thus
275losing the update event. And as indicated above, if the FD switches twice
276between 2 groups, the finally called thread does not necessarily know that
277the FD isn't the same anymore, thus it's difficult to decide whether to
278delete it or not, because deleting the event might in fact mean deleting
279something that was just re-added for the same thread with the same FD but
280a different usage.
281
282Also it really seems unrealistic to scan a single shared update_list like
283this using write operations. There should likely be one per thread-group.
284But in this case there is no more choice than deleting the update event
285upon fd_delete_orphan(). This also means that poller->clo() must do the
286job for all of the group's threads at once. This would mean a synchronous
287removal before the close(), which doesn't seem ridiculously expensive. It
288just requires that any thread of a group may manipulate any other thread's
289status for an FD and a poller.
290
291Note about our currently supported pollers:
292
293 - epoll: our current code base relies on the modern version which
294 automatically removes closed FDs, so we don't have anything to do
295 when closing and we don't need the update.
296
297 - kqueue: according to https://www.freebsd.org/cgi/man.cgi?query=kqueue, just
298 like epoll, a close() implies a removal. Our poller doesn't perform
299 any bookkeeping either so it's OK to directly close.
300
301 - evports: https://docs.oracle.com/cd/E86824_01/html/E54766/port-dissociate-3c.html
302 says the same, i.e. close() implies a removal of all events. No local
303 processing nor bookkeeping either, we can close.
304
305 - poll: the fd_evts[] array is global, thus shared by all threads. As such,
306 a single removal is needed to flush it for all threads at once. The
307 operation is already performed like this.
308
309 - select: works exactly like poll() above, hence already handled.
310
311As a preliminary conclusion, it's safe to delete the event and reset
312update_mask just after calling poller->clo(). If extremely unlucky (changing
313thread mask due to takeover ?), the same FD may appear at the same time:
314 - in one or several thread-local fd_updt[] arrays. These ones are just work
315 queues, there's nothing to do to ignore them, just leave the holes with an
316 outdated FD which will be ignored once met. As a bonus, poller->clo() could
317 check if the last fd_updt[] points to this specific FD and decide to kill
318 it.
319
320 - in the global update_list. In this case, fd_rm_from_fd_list() already
321 performs an attachment check, so it's safe to always call it before closing
Ilya Shipitsin3b64a282022-07-29 22:26:53 +0500322 (since no one else may be in the process of changing anything).
Willy Tarreaud60269f2022-07-06 15:44:49 +0200323
324
Willy Tarreau27471622021-11-18 17:45:57 +0100325###########################################################
326
327Current state:
328
329
330Mux / takeover / fd_delete() code ||| poller code
331-------------------------------------------------|||---------------------------------------------------
332 \|/
333mux_takeover(): | fd_set_running():
334 if (fd_takeover()<0) | old = {running, thread};
335 return fail; | new = {tid_bit, tid_bit};
336 ... |
337fd_takeover(): | do {
Willy Tarreau55b96892022-05-31 08:07:43 +0200338 atomic_or(running, tid_bit); | if (!(old.thread & tid_bit))
Willy Tarreau27471622021-11-18 17:45:57 +0100339 old = {running, thread}; | return -1;
340 new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread }
341 if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new));
Ilya Shipitsin4a689da2022-10-29 09:34:32 +0500342 atomic_and(running, ~tid_bit); |
Willy Tarreau27471622021-11-18 17:45:57 +0100343 return -1; // fail | fd_clr_running():
344 } | return atomic_and_fetch(running, ~tid_bit);
345 |
346 while (old == {tid_bit, !=0 }) | poll():
347 if (dwcas({running, thread}, &old, &new)) { | if (!owner)
Ilya Shipitsin4a689da2022-10-29 09:34:32 +0500348 atomic_and(running, ~tid_bit); | continue;
Willy Tarreau27471622021-11-18 17:45:57 +0100349 return 0; // success |
350 } | if (!(thread_mask & tid_bit)) {
351 } | epoll_ctl_del();
352 | continue;
Ilya Shipitsin4a689da2022-10-29 09:34:32 +0500353 atomic_and(running, ~tid_bit); | }
Willy Tarreau27471622021-11-18 17:45:57 +0100354 return -1; // fail |
355 | // via fd_update_events()
356fd_delete(): | if (fd_set_running() != -1) {
357 atomic_or(running, tid_bit); | iocb();
358 atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask)
359 if (fd_clr_running(fd) = 0) | fd_delete_orphan();
360 fd_delete_orphan(); | }
361
362
363The idle_conns_lock prevents the connection from being *picked* and released
364while someone else is reading it. What it does is guarantee that on idle
365connections, the caller of the IOCB will not dereference the task's context
366while the connection is still in the idle list, since it might be picked then
367freed at the same instant by another thread. As soon as the IOCB manages to
368get that lock, it removes the connection from the list so that it cannot be
369taken over anymore. Conversely, the mux's takeover() code runs under that
370lock so that if it frees the connection and task, this will appear atomic
371to the IOCB. The timeout task (which is another entry point for connection
372deletion) does the same. Thus, when coming from the low-level (I/O or timeout):
373 - task always exists, but ctx checked under lock validates; conn removal
374 from list prevents takeover().
375 - t->context is stable, except during changes under takeover lock. So
376 h2_timeout_task may well run on a different thread than h2_io_cb().
377
378Coming from the top:
379 - takeover() done under lock() clears task's ctx and possibly closes the FD
380 (unless some running remains present).
381
382Unlikely but currently possible situations:
383 - multiple pollers (up to N) may have an idle connection's FD being
384 polled, if the connection was passed from thread to thread. The first
385 event on the connection would wake all of them. Most of them would
386 see fdtab[].owner set (the late ones might miss it). All but one would
387 see that their bit is missing from fdtab[].thread_mask and give up.
388 However, just after this test, others might take over the connection,
389 so in practice if terribly unlucky, all but 1 could see their bit in
390 thread_mask just before it gets removed, all of them set their bit
391 in running_mask, and all of them call iocb() (sock_conn_iocb()).
392 Thus all of them dereference the connection and touch the subscriber
393 with no protection, then end up in conn_notify_mux() that will call
394 the mux's wake().
395
396 - multiple pollers (up to N-1) might still be in fd_update_events()
397 manipulating fdtab[].state. The cause is that the "locked" variable
398 is determined by atleast2(thread_mask) but that thread_mask is read
399 at a random instant (i.e. it may be stolen by another one during a
400 takeover) since we don't yet hold running to prevent this from being
401 done. Thus we can arrive here with thread_mask==something_else (1bit),
402 locked==0 and fdtab[].state assigned non-atomically.
403
404 - it looks like nothing prevents h2_release() from being called on a
405 thread (e.g. from the top or task timeout) while sock_conn_iocb()
406 dereferences the connection on another thread. Those killing the
407 connection don't yet consider the fact that it's an FD that others
408 might currently be waking up on.
409
410###################
411
412pb with counter:
413
414users count doesn't say who's using the FD and two users can do the same
415close in turn. The thread_mask should define who's responsible for closing
416the FD, and all those with a bit in it ought to do it.
417
418
4192021-08-25 - update with minimal locking on tgid value
420==========
421
422 - tgid + refcount at once using CAS
423 - idle_conns lock during updates
424 - update:
425 if tgid differs => close happened, thus drop update
426 otherwise normal stuff. Lock tgid until running if needed.
427 - poll report:
428 if tgid differs => closed
429 if thread differs => stop polling (migrated)
430 keep tgid lock until running
431 - test on thread_id:
432 if (xadd(&tgid,65536) != my_tgid) {
433 // was closed
434 sub(&tgid, 65536)
435 return -1
436 }
437 if !(thread_id & tidbit) => migrated/closed
438 set_running()
439 sub(tgid,65536)
440 - note: either fd_insert() or the final close() ought to set
441 polled and update to 0.
442
4432021-09-13 - tid / tgroups etc.
444==========
445
Willy Tarreau14247742022-06-10 16:05:59 +0200446 * tid currently is the thread's global ID. It's essentially used as an index
Willy Tarreau27471622021-11-18 17:45:57 +0100447 for arrays. It must be clearly stated that it works this way.
448
Willy Tarreau14247742022-06-10 16:05:59 +0200449 * tasklets use the global thread id, and __tasklet_wakeup_on() must use a
450 global ID as well. It's capital that tinfo[] provides instant access to
451 local/global bits/indexes/arrays
452
Willy Tarreau27471622021-11-18 17:45:57 +0100453 - tid_bit makes no sense process-wide, so it must be redefined to represent
454 the thread's tid within its group. The name is not much welcome though, but
455 there are 286 of it that are not going to be changed that fast.
Willy Tarreau14247742022-06-10 16:05:59 +0200456 => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still
457 not changed though. If renamed we must make sure the older one vanishes.
458 Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid,
459 gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid"
460 which is half the time not the one we expect.
Willy Tarreau27471622021-11-18 17:45:57 +0100461
Willy Tarreau14247742022-06-10 16:05:59 +0200462 * just like "ti" is the thread_info, we need to have "tg" pointing to the
Willy Tarreau27471622021-11-18 17:45:57 +0100463 thread_group.
464
465 - other less commonly used elements should be retrieved from ti->xxx. E.g.
466 the thread's local ID.
467
468 - lock debugging must reproduce tgid
469
Willy Tarreau680ed5f2022-06-13 15:59:39 +0200470 * task profiling must be made per-group (annoying), unless we want to add a
Willy Tarreau14247742022-06-10 16:05:59 +0200471 per-thread TH_FL_* flag and have the rare places where the bit is changed
472 iterate over all threads if needed. Sounds preferable overall.
473
474 * an offset might be placed in the tgroup so that even with 64 threads max
Willy Tarreau27471622021-11-18 17:45:57 +0100475 we could have completely separate tid_bits over several groups.
Willy Tarreau14247742022-06-10 16:05:59 +0200476 => base and count now
Willy Tarreau27471622021-11-18 17:45:57 +0100477
4782021-09-15 - bind + listen() + rx
479==========
480
481 - thread_mask (in bind_conf->rx_settings) should become an array of
482 MAX_TGROUP longs.
483 - when parsing "thread 123" or "thread 2/37", the proper bit is set,
Ilya Shipitsin3b64a282022-07-29 22:26:53 +0500484 assuming the array is either a contiguous bitfield or a tgroup array.
Willy Tarreau27471622021-11-18 17:45:57 +0100485 An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on
486 how the thread num was parsed, so that we reject mixes.
487 - end of parsing: entries translated to the cleanest form (to be determined)
488 - binding: for each socket()/bind()/listen()... just perform one extra dup()
489 for each tgroup and store the multiple FDs into an FD array indexed on
490 MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence
491 to have multiple entries in all tgroup pollers without requiring the user
492 to duplicate the bind line.
493
4942021-09-15 - global thread masks
495==========
496
497Some global variables currently expect to know about thread IDs and it's
498uncertain what must be done with them:
499 - global_tasks_mask /* Mask of threads with tasks in the global runqueue */
500 => touched under the rq lock. Change it per-group ? What exact use is made ?
501
502 - sleeping_thread_mask /* Threads that are about to sleep in poll() */
503 => seems that it can be made per group
504
505 - all_threads_mask: a bit complicated, derived from nbthread and used with
506 masks and with my_ffsl() to wake threads up. Should probably be per-group
507 but we might miss something for global.
508
509 - stopping_thread_mask: used in combination with all_threads_mask, should
510 move per-group.
511
512 - threads_harmless_mask: indicates all threads that are currently harmless in
513 that they promise not to access a shared resource. Must be made per-group
514 but then we'll likely need a second stage to have the harmless groups mask.
515 threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one
516 above. Maybe the right approach will be to request harmless on a group mask
517 so that we can detect collisions and arbiter them like today, but on top of
518 this it becomes possible to request harmless only on the local group if
519 desired. The subtlety is that requesting harmless at the group level does
520 not mean it's achieved since the requester cannot vouch for the other ones
521 in the same group.
522
523In addition, some variables are related to the global runqueue:
524 __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */
525 struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */
526 unsigned int grq_total; /* total number of entries in the global run queue, atomic */
527 static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */
528
529And others to the global wait queue:
530 struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
531 __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */
532 struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
533
534
Willy Tarreau0aa6f3e2022-06-14 15:00:40 +02005352022-06-14 - progress on task affinity
536==========
537
538The particularity of the current global run queue is to be usable for remote
539wakeups because it's protected by a lock. There is no need for a global run
540queue beyond this, and there could already be a locked queue per thread for
541remote wakeups, with a random selection at wakeup time. It's just that picking
542a pending task in a run queue among a number is convenient (though it
543introduces some excessive locking). A task will either be tied to a single
544group or will be allowed to run on any group. As such it's pretty clear that we
545don't need a global run queue. When a run-anywhere task expires, either it runs
546on the current group's runqueue with any thread, or a target thread is selected
547during the wakeup and it's directly assigned.
548
549A global wait queue seems important for scheduled repetitive tasks however. But
550maybe it's more a task for a cron-like job and there's no need for the task
551itself to wake up anywhere, because once the task wakes up, it must be tied to
552one (or a set of) thread(s). One difficulty if the task is temporarily assigned
553a thread group is that it's impossible to know where it's running when trying
554to perform a second wakeup or when trying to kill it. Maybe we'll need to have
555two tgid for a task (desired, effective). Or maybe we can restrict the ability
556of such a task to stay in wait queue in case of wakeup, though that sounds
557difficult. Other approaches would be to set the GID to the current one when
558waking up the task, and to have a flag (or sign on the GID) indicating that the
559task is still queued in the global timers queue. We already have TASK_SHARED_WQ
560so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make
561sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ?
562All calls to task_new() use either 1<<thr, tid_bit, all_threads_mask, or come
563from appctx_new which does exactly the same. The only real user of non-global,
564non-unique task_new() call is debug_parse_cli_sched() which purposely allows to
565use an arbitrary mask.
566
567 +----------------------------------------------------------------------------+
568 | => we don't need one WQ per group, only a global and N local ones, hence |
569 | the TASK_SHARED_WQ flag can continue to be used for this purpose. |
570 +----------------------------------------------------------------------------+
571
572Having TASK_SHARED_WQ should indicate that a task will always be queued to the
573shared queue and will always have a temporary gid and thread mask in the run
574queue.
575
576Going further, as we don't have any single case of a task bound to a small set
577of threads, we could decide to wake up only expired tasks for ourselves by
578looking them up using eb32sc and adopting them. Thus, there's no more need for
579a shared runqueue nor a global_runqueue_ticks counter, and we can simply have
580the ability to wake up a remote task. The task's thread_mask will then change
581so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in
582which case it corresponds to the running thread. That's very close to what is
583already done with tasklets in fact.
584
585
Willy Tarreau27471622021-11-18 17:45:57 +01005862021-09-29 - group designation and masks
587==========
588
589Neither FDs nor tasks will belong to incomplete subsets of threads spanning
590over multiple thread groups. In addition there may be a difference between
591configuration and operation (for FDs). This allows to fix the following rules:
592
593 group mask description
594 0 0 bind_conf: groups & thread not set. bind to any/all
595 task: it would be nice to mean "run on the same as the caller".
596
597 0 xxx bind_conf: thread set but not group: thread IDs are global
598 FD/task: group 0, mask xxx
599
600 G>0 0 bind_conf: only group is set: bind to all threads of group G
601 FD/task: mask 0 not permitted (= not owned). May be used to
602 mention "any thread of this group", though already covered by
603 G/xxx like today.
604
605 G>0 xxx bind_conf: Bind to these threads of this group
606 FD/task: group G, mask xxx
607
608It looks like keeping groups starting at zero internally complicates everything
609though. But forcing it to start at 1 might also require that we rescan all tasks
610to replace 0 with 1 upon startup. This would also allow group 0 to be special and
611be used as the default group for any new thread creation, so that group0.count
612would keep the number of unassigned threads. Let's try:
613
614 group mask description
615 0 0 bind_conf: groups & thread not set. bind to any/all
616 task: "run on the same group & thread as the caller".
617
618 0 xxx bind_conf: thread set but not group: thread IDs are global
619 FD/task: invalid. Or maybe for a task we could use this to
620 mean "run on current group, thread XXX", which would cover
621 the need for health checks (g/t 0/0 while sleeping, 0/xxx
622 while running) and have wake_expired_tasks() detect 0/0 and
623 wake them up to a random group.
624
625 G>0 0 bind_conf: only group is set: bind to all threads of group G
626 FD/task: mask 0 not permitted (= not owned). May be used to
627 mention "any thread of this group", though already covered by
628 G/xxx like today.
629
630 G>0 xxx bind_conf: Bind to these threads of this group
631 FD/task: group G, mask xxx
632
633With a single group declared in the config, group 0 would implicitly find the
634first one.
635
636
637The problem with the approach above is that a task queued in one group+thread's
638wait queue could very well receive a signal from another thread and/or group,
639and that there is no indication about where the task is queued, nor how to
640dequeue it. Thus it seems that it's up to the application itself to unbind/
641rebind a task. This contradicts the principle of leaving a task waiting in a
642wait queue and waking it anywhere.
643
644Another possibility might be to decide that a task having a defined group but
645a mask of zero is shared and will always be queued into its group's wait queue.
646However, upon expiry, the scheduler would notice the thread-mask 0 and would
647broadcast it to any group.
648
649Right now in the code we have:
650 - 18 calls of task_new(tid_bit)
Willy Tarreau3ccb14d2022-06-14 11:18:40 +0200651 - 17 calls of task_new_anywhere()
Willy Tarreau27471622021-11-18 17:45:57 +0100652 - 2 calls with a single bit
653
654Thus it looks like "task_new_anywhere()", "task_new_on()" and
655"task_new_here()" would be sufficient.