blob: e859bf55d751d4dfc67b2a3019e4d57d29d8a8d4 [file] [log] [blame]
Willy Tarreau27471622021-11-18 17:45:57 +01001Thread groups
2#############
3
42021-07-13 - first draft
5==========
6
7Objective
8---------
9- support multi-socket systems with limited cache-line bouncing between
10 physical CPUs and/or L3 caches
11
12- overcome the 64-thread limitation
13
14- Support a reasonable number of groups. I.e. if modern CPUs arrive with
15 core complexes made of 8 cores, with 8 CC per chip and 2 chips in a
16 system, it makes sense to support 16 groups.
17
18
19Non-objective
20-------------
21- no need to optimize to the last possible cycle. I.e. some algos like
22 leastconn will remain shared across all threads, servers will keep a
23 single queue, etc. Global information remains global.
24
25- no stubborn enforcement of FD sharing. Per-server idle connection lists
26 can become per-group; listeners can (and should probably) be per-group.
27 Other mechanisms (like SO_REUSEADDR) can already overcome this.
28
29- no need to go beyond 64 threads per group.
30
31
32Identified tasks
33================
34
35General
36-------
37Everywhere tid_bit is used we absolutely need to find a complement using
38either the current group or a specific one. Thread debugging will need to
39be extended as masks are extensively used.
40
41
42Scheduler
43---------
44The global run queue and global wait queue must become per-group. This
45means that a task may only be queued into one of them at a time. It
46sounds like tasks may only belong to a given group, but doing so would
47bring back the original issue that it's impossible to perform remote wake
48ups.
49
50We could probably ignore the group if we don't need to set the thread mask
51in the task. the task's thread_mask is never manipulated using atomics so
52it's safe to complement it with a group.
53
54The sleeping_thread_mask should become per-group. Thus possibly that a
55wakeup may only be performed on the assigned group, meaning that either
56a task is not assigned, in which case it be self-assigned (like today),
57otherwise the tg to be woken up will be retrieved from the task itself.
58
59Task creation currently takes a thread mask of either tid_bit, a specific
60mask, or MAX_THREADS_MASK. How to create a task able to run anywhere
61(checks, Lua, ...) ?
62
63Profiling
64---------
65There should be one task_profiling_mask per thread group. Enabling or
66disabling profiling should be made per group (possibly by iterating).
67
68Thread isolation
69----------------
70Thread isolation is difficult as we solely rely on atomic ops to figure
71who can complete. Such operation is rare, maybe we could have a global
72read_mostly flag containing a mask of the groups that require isolation.
73Then the threads_want_rdv_mask etc can become per-group. However setting
74and clearing the bits will become problematic as this will happen in two
75steps hence will require careful ordering.
76
77FD
78--
79Tidbit is used in a number of atomic ops on the running_mask. If we have
80one fdtab[] per group, the mask implies that it's within the group.
81Theoretically we should never face a situation where an FD is reported nor
82manipulated for a remote group.
83
84There will still be one poller per thread, except that this time all
85operations will be related to the current thread_group. No fd may appear
86in two thread_groups at once, but we can probably not prevent that (e.g.
87delayed close and reopen). Should we instead have a single shared fdtab[]
88(less memory usage also) ? Maybe adding the group in the fdtab entry would
89work, but when does a thread know it can leave it ? Currently this is
90solved by running_mask and by update_mask. Having two tables could help
91with this (each table sees the FD in a different group with a different
92mask) but this looks overkill.
93
94There's polled_mask[] which needs to be decided upon. Probably that it
95should be doubled as well. Note, polled_mask left fdtab[] for cacheline
96alignment reasons in commit cb92f5cae4.
97
98If we have one fdtab[] per group, what *really* prevents from using the
99same FD in multiple groups ? _fd_delete_orphan() and fd_update_events()
100need to check for no-thread usage before closing the FD. This could be
101a limiting factor. Enabling could require to wake every poller.
102
103Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short,
104used only during creation and close) ?
105
106Other problem, if we have one fdtab[] per TG, disabling/enabling an FD
107(e.g. pause/resume on listener) can become a problem if it's not necessarily
108on the current TG. We'll then need a way to figure that one. It sounds like
109FDs from listeners and receivers are very specific and suffer from problems
110all other ones under high load do not suffer from. Maybe something specific
111ought to be done for them, if we can guarantee there is no risk of accidental
112reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the
113FD's flags). The risk is always that a close() can result in instant pop-up
114of the same FD on any other thread of the same process.
115
116Observations: right now fdtab[].thread_mask more or less corresponds to a
117declaration of interest, it's very close to meaning "active per thread". It is
118in fact located in the FD while it ought to do nothing there, as it should be
119where the FD is used as it rules accesses to a shared resource that is not
120the FD but what uses it. Indeed, if neither polled_mask nor running_mask have
121a thread's bit, the FD is unknown to that thread and the element using it may
122only be reached from above and not from the FD. As such we ought to have a
123thread_mask on a listener and another one on connections. These ones will
124indicate who uses them. A takeover could then be simplified (atomically set
125exclusivity on the FD's running_mask, upon success, takeover the connection,
126clear the running mask). Probably that the change ought to be performed on
127the connection level first, not the FD level by the way. But running and
128polled are the two relevant elements, one indicates userland knowledge,
129the other one kernel knowledge. For listeners there's no exclusivity so it's
130a bit different but the rule remains the same that we don't have to know
131what threads are *interested* in the FD, only its holder.
132
133Not exact in fact, see FD notes below.
134
135activity
136--------
137There should be one activity array per thread group. The dump should
138simply scan them all since the cumuled values are not very important
139anyway.
140
141applets
142-------
143They use tid_bit only for the task. It looks like the appctx's thread_mask
144is never used (now removed). Furthermore, it looks like the argument is
145*always* tid_bit.
146
147CPU binding
148-----------
149This is going to be tough. It will be needed to detect that threads overlap
150and are not bound (i.e. all threads on same mask). In this case, if the number
151of threads is higher than the number of threads per physical socket, one must
152try hard to evenly spread them among physical sockets (e.g. one thread group
153per physical socket) and start as many threads as needed on each, bound to
154all threads/cores of each socket. If there is a single socket, the same job
155may be done based on L3 caches. Maybe it could always be done based on L3
156caches. The difficulty behind this is the number of sockets to be bound: it
157is not possible to bind several FDs per listener. Maybe with a new bind
158keyword we can imagine to automatically duplicate listeners ? In any case,
159the initially bound cpumap (via taskset) must always be respected, and
160everything should probably start from there.
161
162Frontend binding
163----------------
164We'll have to define a list of threads and thread-groups per frontend.
165Probably that having a group mask and a same thread-mask for each group
166would suffice.
167
168Threads should have two numbers:
169 - the per-process number (e.g. 1..256)
170 - the per-group number (1..64)
171
172The "bind-thread" lines ought to use the following syntax:
173 - bind 45 ## bind to process' thread 45
174 - bind 1/45 ## bind to group 1's thread 45
175 - bind all/45 ## bind to thread 45 in each group
176 - bind 1/all ## bind to all threads in group 1
177 - bind all ## bind to all threads
178 - bind all/all ## bind to all threads in all groups (=all)
179 - bind 1/65 ## rejected
180 - bind 65 ## OK if there are enough
181 - bind 35-45 ## depends. Rejected if it crosses a group boundary.
182
183The global directive "nbthread 28" means 28 total threads for the process. The
184number of groups will sub-divide this. E.g. 4 groups will very likely imply 7
185threads per group. At the beginning, the nbgroup should be manual since it
186implies config adjustments to bind lines.
187
188There should be a trivial way to map a global thread to a group and local ID
189and to do the opposite.
190
191
192Panic handler + watchdog
193------------------------
194Will probably depend on what's done for thread_isolate
195
196Per-thread arrays inside structures
197-----------------------------------
198- listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should
199 we simply bump the limit ?
200- same for servers with idle connections.
201=> doesn't seem very practical.
202- another solution might be to point to dynamically allocated arrays of
203 arrays (e.g. nbthread * nbgroup) or a first level per group and a second
204 per thread.
205=> dynamic allocation based on the global number
206
207Other
208-----
209- what about dynamic thread start/stop (e.g. for containers/VMs) ?
210 E.g. if we decide to start $MANY threads in 4 groups, and only use
211 one, in the end it will not be possible to use less than one thread
212 per group, and at most 64 will be present in each group.
213
214
215FD Notes
216--------
217 - updt_fd_polling() uses thread_mask to figure where to send the update,
218 the local list or a shared list, and which bits to set in update_mask.
219 This could be changed so that it takes the update mask in argument. The
220 call from the poller's fork would just have to broadcast everywhere.
221
222 - pollers use it to figure whether they're concerned or not by the activity
223 update. This looks important as otherwise we could re-enable polling on
224 an FD that changed to another thread.
225
226 - thread_mask being a per-thread active mask looks more exact and is
227 precisely used this way by _update_fd(). In this case using it instead
228 of running_mask to gauge a change or temporarily lock it during a
229 removal could make sense.
230
231 - running should be conditionned by thread. Polled not (since deferred
232 or migrated). In this case testing thread_mask can be enough most of
233 the time, but this requires synchronization that will have to be
234 extended to tgid.. But migration seems a different beast that we shouldn't
235 care about here: if first performed at the higher level it ought to
236 be safe.
237
238In practice the update_mask can be dropped to zero by the first fd_delete()
239as the only authority allowed to fd_delete() is *the* owner, and as soon as
240all running_mask are gone, the FD will be closed, hence removed from all
241pollers. This will be the only way to make sure that update_mask always
242refers to the current tgid.
243
244However, it may happen that a takeover within the same group causes a thread
245to read the update_mask late, while the FD is being wiped by another thread.
246That other thread may close it, causing another thread in another group to
247catch it, and change the tgid and start to update the update_mask. This means
248that it would be possible for a thread entering do_poll() to see the correct
249tgid, then the fd would be closed, reopened and reassigned to another tgid,
250and the thread would see its bit in the update_mask, being confused. Right
251now this should already happen when the update_mask is not cleared, except
252that upon wakeup a migration would be detected and that would be all.
253
254Thus we might need to set the running bit to prevent the FD from migrating
255before reading update_mask, which also implies closing on fd_clr_running() == 0 :-(
256
257Also even fd_update_events() leaves a risk of updating update_mask after
258clearing running, thus affecting the wrong one. Probably that update_mask
259should be updated before clearing running_mask there. Also, how about not
260creating an update on a close ? Not trivial if done before running, unless
261thread_mask==0.
262
263###########################################################
264
265Current state:
266
267
268Mux / takeover / fd_delete() code ||| poller code
269-------------------------------------------------|||---------------------------------------------------
270 \|/
271mux_takeover(): | fd_set_running():
272 if (fd_takeover()<0) | old = {running, thread};
273 return fail; | new = {tid_bit, tid_bit};
274 ... |
275fd_takeover(): | do {
276 atomic_or(runnning, tid_bit); | if (!(old.thread & tid_bit))
277 old = {running, thread}; | return -1;
278 new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread }
279 if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new));
280 atomic_and(runnning, ~tid_bit); |
281 return -1; // fail | fd_clr_running():
282 } | return atomic_and_fetch(running, ~tid_bit);
283 |
284 while (old == {tid_bit, !=0 }) | poll():
285 if (dwcas({running, thread}, &old, &new)) { | if (!owner)
286 atomic_and(runnning, ~tid_bit); | continue;
287 return 0; // success |
288 } | if (!(thread_mask & tid_bit)) {
289 } | epoll_ctl_del();
290 | continue;
291 atomic_and(runnning, ~tid_bit); | }
292 return -1; // fail |
293 | // via fd_update_events()
294fd_delete(): | if (fd_set_running() != -1) {
295 atomic_or(running, tid_bit); | iocb();
296 atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask)
297 if (fd_clr_running(fd) = 0) | fd_delete_orphan();
298 fd_delete_orphan(); | }
299
300
301The idle_conns_lock prevents the connection from being *picked* and released
302while someone else is reading it. What it does is guarantee that on idle
303connections, the caller of the IOCB will not dereference the task's context
304while the connection is still in the idle list, since it might be picked then
305freed at the same instant by another thread. As soon as the IOCB manages to
306get that lock, it removes the connection from the list so that it cannot be
307taken over anymore. Conversely, the mux's takeover() code runs under that
308lock so that if it frees the connection and task, this will appear atomic
309to the IOCB. The timeout task (which is another entry point for connection
310deletion) does the same. Thus, when coming from the low-level (I/O or timeout):
311 - task always exists, but ctx checked under lock validates; conn removal
312 from list prevents takeover().
313 - t->context is stable, except during changes under takeover lock. So
314 h2_timeout_task may well run on a different thread than h2_io_cb().
315
316Coming from the top:
317 - takeover() done under lock() clears task's ctx and possibly closes the FD
318 (unless some running remains present).
319
320Unlikely but currently possible situations:
321 - multiple pollers (up to N) may have an idle connection's FD being
322 polled, if the connection was passed from thread to thread. The first
323 event on the connection would wake all of them. Most of them would
324 see fdtab[].owner set (the late ones might miss it). All but one would
325 see that their bit is missing from fdtab[].thread_mask and give up.
326 However, just after this test, others might take over the connection,
327 so in practice if terribly unlucky, all but 1 could see their bit in
328 thread_mask just before it gets removed, all of them set their bit
329 in running_mask, and all of them call iocb() (sock_conn_iocb()).
330 Thus all of them dereference the connection and touch the subscriber
331 with no protection, then end up in conn_notify_mux() that will call
332 the mux's wake().
333
334 - multiple pollers (up to N-1) might still be in fd_update_events()
335 manipulating fdtab[].state. The cause is that the "locked" variable
336 is determined by atleast2(thread_mask) but that thread_mask is read
337 at a random instant (i.e. it may be stolen by another one during a
338 takeover) since we don't yet hold running to prevent this from being
339 done. Thus we can arrive here with thread_mask==something_else (1bit),
340 locked==0 and fdtab[].state assigned non-atomically.
341
342 - it looks like nothing prevents h2_release() from being called on a
343 thread (e.g. from the top or task timeout) while sock_conn_iocb()
344 dereferences the connection on another thread. Those killing the
345 connection don't yet consider the fact that it's an FD that others
346 might currently be waking up on.
347
348###################
349
350pb with counter:
351
352users count doesn't say who's using the FD and two users can do the same
353close in turn. The thread_mask should define who's responsible for closing
354the FD, and all those with a bit in it ought to do it.
355
356
3572021-08-25 - update with minimal locking on tgid value
358==========
359
360 - tgid + refcount at once using CAS
361 - idle_conns lock during updates
362 - update:
363 if tgid differs => close happened, thus drop update
364 otherwise normal stuff. Lock tgid until running if needed.
365 - poll report:
366 if tgid differs => closed
367 if thread differs => stop polling (migrated)
368 keep tgid lock until running
369 - test on thread_id:
370 if (xadd(&tgid,65536) != my_tgid) {
371 // was closed
372 sub(&tgid, 65536)
373 return -1
374 }
375 if !(thread_id & tidbit) => migrated/closed
376 set_running()
377 sub(tgid,65536)
378 - note: either fd_insert() or the final close() ought to set
379 polled and update to 0.
380
3812021-09-13 - tid / tgroups etc.
382==========
383
384 - tid currently is the thread's global ID. It's essentially used as an index
385 for arrays. It must be clearly stated that it works this way.
386
387 - tid_bit makes no sense process-wide, so it must be redefined to represent
388 the thread's tid within its group. The name is not much welcome though, but
389 there are 286 of it that are not going to be changed that fast.
390
391 - just like "ti" is the thread_info, we need to have "tg" pointing to the
392 thread_group.
393
394 - other less commonly used elements should be retrieved from ti->xxx. E.g.
395 the thread's local ID.
396
397 - lock debugging must reproduce tgid
398
399 - an offset might be placed in the tgroup so that even with 64 threads max
400 we could have completely separate tid_bits over several groups.
401
4022021-09-15 - bind + listen() + rx
403==========
404
405 - thread_mask (in bind_conf->rx_settings) should become an array of
406 MAX_TGROUP longs.
407 - when parsing "thread 123" or "thread 2/37", the proper bit is set,
408 assuming the array is either a contigous bitfield or a tgroup array.
409 An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on
410 how the thread num was parsed, so that we reject mixes.
411 - end of parsing: entries translated to the cleanest form (to be determined)
412 - binding: for each socket()/bind()/listen()... just perform one extra dup()
413 for each tgroup and store the multiple FDs into an FD array indexed on
414 MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence
415 to have multiple entries in all tgroup pollers without requiring the user
416 to duplicate the bind line.
417
4182021-09-15 - global thread masks
419==========
420
421Some global variables currently expect to know about thread IDs and it's
422uncertain what must be done with them:
423 - global_tasks_mask /* Mask of threads with tasks in the global runqueue */
424 => touched under the rq lock. Change it per-group ? What exact use is made ?
425
426 - sleeping_thread_mask /* Threads that are about to sleep in poll() */
427 => seems that it can be made per group
428
429 - all_threads_mask: a bit complicated, derived from nbthread and used with
430 masks and with my_ffsl() to wake threads up. Should probably be per-group
431 but we might miss something for global.
432
433 - stopping_thread_mask: used in combination with all_threads_mask, should
434 move per-group.
435
436 - threads_harmless_mask: indicates all threads that are currently harmless in
437 that they promise not to access a shared resource. Must be made per-group
438 but then we'll likely need a second stage to have the harmless groups mask.
439 threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one
440 above. Maybe the right approach will be to request harmless on a group mask
441 so that we can detect collisions and arbiter them like today, but on top of
442 this it becomes possible to request harmless only on the local group if
443 desired. The subtlety is that requesting harmless at the group level does
444 not mean it's achieved since the requester cannot vouch for the other ones
445 in the same group.
446
447In addition, some variables are related to the global runqueue:
448 __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */
449 struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */
450 unsigned int grq_total; /* total number of entries in the global run queue, atomic */
451 static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */
452
453And others to the global wait queue:
454 struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
455 __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */
456 struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */
457
458
4592021-09-29 - group designation and masks
460==========
461
462Neither FDs nor tasks will belong to incomplete subsets of threads spanning
463over multiple thread groups. In addition there may be a difference between
464configuration and operation (for FDs). This allows to fix the following rules:
465
466 group mask description
467 0 0 bind_conf: groups & thread not set. bind to any/all
468 task: it would be nice to mean "run on the same as the caller".
469
470 0 xxx bind_conf: thread set but not group: thread IDs are global
471 FD/task: group 0, mask xxx
472
473 G>0 0 bind_conf: only group is set: bind to all threads of group G
474 FD/task: mask 0 not permitted (= not owned). May be used to
475 mention "any thread of this group", though already covered by
476 G/xxx like today.
477
478 G>0 xxx bind_conf: Bind to these threads of this group
479 FD/task: group G, mask xxx
480
481It looks like keeping groups starting at zero internally complicates everything
482though. But forcing it to start at 1 might also require that we rescan all tasks
483to replace 0 with 1 upon startup. This would also allow group 0 to be special and
484be used as the default group for any new thread creation, so that group0.count
485would keep the number of unassigned threads. Let's try:
486
487 group mask description
488 0 0 bind_conf: groups & thread not set. bind to any/all
489 task: "run on the same group & thread as the caller".
490
491 0 xxx bind_conf: thread set but not group: thread IDs are global
492 FD/task: invalid. Or maybe for a task we could use this to
493 mean "run on current group, thread XXX", which would cover
494 the need for health checks (g/t 0/0 while sleeping, 0/xxx
495 while running) and have wake_expired_tasks() detect 0/0 and
496 wake them up to a random group.
497
498 G>0 0 bind_conf: only group is set: bind to all threads of group G
499 FD/task: mask 0 not permitted (= not owned). May be used to
500 mention "any thread of this group", though already covered by
501 G/xxx like today.
502
503 G>0 xxx bind_conf: Bind to these threads of this group
504 FD/task: group G, mask xxx
505
506With a single group declared in the config, group 0 would implicitly find the
507first one.
508
509
510The problem with the approach above is that a task queued in one group+thread's
511wait queue could very well receive a signal from another thread and/or group,
512and that there is no indication about where the task is queued, nor how to
513dequeue it. Thus it seems that it's up to the application itself to unbind/
514rebind a task. This contradicts the principle of leaving a task waiting in a
515wait queue and waking it anywhere.
516
517Another possibility might be to decide that a task having a defined group but
518a mask of zero is shared and will always be queued into its group's wait queue.
519However, upon expiry, the scheduler would notice the thread-mask 0 and would
520broadcast it to any group.
521
522Right now in the code we have:
523 - 18 calls of task_new(tid_bit)
524 - 18 calls of task_new(MAX_THREADS_MASK)
525 - 2 calls with a single bit
526
527Thus it looks like "task_new_anywhere()", "task_new_on()" and
528"task_new_here()" would be sufficient.