| Thread groups |
| ############# |
| |
| 2021-07-13 - first draft |
| ========== |
| |
| Objective |
| --------- |
| - support multi-socket systems with limited cache-line bouncing between |
| physical CPUs and/or L3 caches |
| |
| - overcome the 64-thread limitation |
| |
| - Support a reasonable number of groups. I.e. if modern CPUs arrive with |
| core complexes made of 8 cores, with 8 CC per chip and 2 chips in a |
| system, it makes sense to support 16 groups. |
| |
| |
| Non-objective |
| ------------- |
| - no need to optimize to the last possible cycle. I.e. some algos like |
| leastconn will remain shared across all threads, servers will keep a |
| single queue, etc. Global information remains global. |
| |
| - no stubborn enforcement of FD sharing. Per-server idle connection lists |
| can become per-group; listeners can (and should probably) be per-group. |
| Other mechanisms (like SO_REUSEADDR) can already overcome this. |
| |
| - no need to go beyond 64 threads per group. |
| |
| |
| Identified tasks |
| ================ |
| |
| General |
| ------- |
| Everywhere tid_bit is used we absolutely need to find a complement using |
| either the current group or a specific one. Thread debugging will need to |
| be extended as masks are extensively used. |
| |
| |
| Scheduler |
| --------- |
| The global run queue and global wait queue must become per-group. This |
| means that a task may only be queued into one of them at a time. It |
| sounds like tasks may only belong to a given group, but doing so would |
| bring back the original issue that it's impossible to perform remote wake |
| ups. |
| |
| We could probably ignore the group if we don't need to set the thread mask |
| in the task. the task's thread_mask is never manipulated using atomics so |
| it's safe to complement it with a group. |
| |
| The sleeping_thread_mask should become per-group. Thus possibly that a |
| wakeup may only be performed on the assigned group, meaning that either |
| a task is not assigned, in which case it be self-assigned (like today), |
| otherwise the tg to be woken up will be retrieved from the task itself. |
| |
| Task creation currently takes a thread mask of either tid_bit, a specific |
| mask, or MAX_THREADS_MASK. How to create a task able to run anywhere |
| (checks, Lua, ...) ? |
| |
| Profiling |
| --------- |
| There should be one task_profiling_mask per thread group. Enabling or |
| disabling profiling should be made per group (possibly by iterating). |
| |
| Thread isolation |
| ---------------- |
| Thread isolation is difficult as we solely rely on atomic ops to figure |
| who can complete. Such operation is rare, maybe we could have a global |
| read_mostly flag containing a mask of the groups that require isolation. |
| Then the threads_want_rdv_mask etc can become per-group. However setting |
| and clearing the bits will become problematic as this will happen in two |
| steps hence will require careful ordering. |
| |
| FD |
| -- |
| Tidbit is used in a number of atomic ops on the running_mask. If we have |
| one fdtab[] per group, the mask implies that it's within the group. |
| Theoretically we should never face a situation where an FD is reported nor |
| manipulated for a remote group. |
| |
| There will still be one poller per thread, except that this time all |
| operations will be related to the current thread_group. No fd may appear |
| in two thread_groups at once, but we can probably not prevent that (e.g. |
| delayed close and reopen). Should we instead have a single shared fdtab[] |
| (less memory usage also) ? Maybe adding the group in the fdtab entry would |
| work, but when does a thread know it can leave it ? Currently this is |
| solved by running_mask and by update_mask. Having two tables could help |
| with this (each table sees the FD in a different group with a different |
| mask) but this looks overkill. |
| |
| There's polled_mask[] which needs to be decided upon. Probably that it |
| should be doubled as well. Note, polled_mask left fdtab[] for cacheline |
| alignment reasons in commit cb92f5cae4. |
| |
| If we have one fdtab[] per group, what *really* prevents from using the |
| same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() |
| need to check for no-thread usage before closing the FD. This could be |
| a limiting factor. Enabling could require to wake every poller. |
| |
| Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, |
| used only during creation and close) ? |
| |
| Other problem, if we have one fdtab[] per TG, disabling/enabling an FD |
| (e.g. pause/resume on listener) can become a problem if it's not necessarily |
| on the current TG. We'll then need a way to figure that one. It sounds like |
| FDs from listeners and receivers are very specific and suffer from problems |
| all other ones under high load do not suffer from. Maybe something specific |
| ought to be done for them, if we can guarantee there is no risk of accidental |
| reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the |
| FD's flags). The risk is always that a close() can result in instant pop-up |
| of the same FD on any other thread of the same process. |
| |
| Observations: right now fdtab[].thread_mask more or less corresponds to a |
| declaration of interest, it's very close to meaning "active per thread". It is |
| in fact located in the FD while it ought to do nothing there, as it should be |
| where the FD is used as it rules accesses to a shared resource that is not |
| the FD but what uses it. Indeed, if neither polled_mask nor running_mask have |
| a thread's bit, the FD is unknown to that thread and the element using it may |
| only be reached from above and not from the FD. As such we ought to have a |
| thread_mask on a listener and another one on connections. These ones will |
| indicate who uses them. A takeover could then be simplified (atomically set |
| exclusivity on the FD's running_mask, upon success, takeover the connection, |
| clear the running mask). Probably that the change ought to be performed on |
| the connection level first, not the FD level by the way. But running and |
| polled are the two relevant elements, one indicates userland knowledge, |
| the other one kernel knowledge. For listeners there's no exclusivity so it's |
| a bit different but the rule remains the same that we don't have to know |
| what threads are *interested* in the FD, only its holder. |
| |
| Not exact in fact, see FD notes below. |
| |
| activity |
| -------- |
| There should be one activity array per thread group. The dump should |
| simply scan them all since the cumuled values are not very important |
| anyway. |
| |
| applets |
| ------- |
| They use tid_bit only for the task. It looks like the appctx's thread_mask |
| is never used (now removed). Furthermore, it looks like the argument is |
| *always* tid_bit. |
| |
| CPU binding |
| ----------- |
| This is going to be tough. It will be needed to detect that threads overlap |
| and are not bound (i.e. all threads on same mask). In this case, if the number |
| of threads is higher than the number of threads per physical socket, one must |
| try hard to evenly spread them among physical sockets (e.g. one thread group |
| per physical socket) and start as many threads as needed on each, bound to |
| all threads/cores of each socket. If there is a single socket, the same job |
| may be done based on L3 caches. Maybe it could always be done based on L3 |
| caches. The difficulty behind this is the number of sockets to be bound: it |
| is not possible to bind several FDs per listener. Maybe with a new bind |
| keyword we can imagine to automatically duplicate listeners ? In any case, |
| the initially bound cpumap (via taskset) must always be respected, and |
| everything should probably start from there. |
| |
| Frontend binding |
| ---------------- |
| We'll have to define a list of threads and thread-groups per frontend. |
| Probably that having a group mask and a same thread-mask for each group |
| would suffice. |
| |
| Threads should have two numbers: |
| - the per-process number (e.g. 1..256) |
| - the per-group number (1..64) |
| |
| The "bind-thread" lines ought to use the following syntax: |
| - bind 45 ## bind to process' thread 45 |
| - bind 1/45 ## bind to group 1's thread 45 |
| - bind all/45 ## bind to thread 45 in each group |
| - bind 1/all ## bind to all threads in group 1 |
| - bind all ## bind to all threads |
| - bind all/all ## bind to all threads in all groups (=all) |
| - bind 1/65 ## rejected |
| - bind 65 ## OK if there are enough |
| - bind 35-45 ## depends. Rejected if it crosses a group boundary. |
| |
| The global directive "nbthread 28" means 28 total threads for the process. The |
| number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 |
| threads per group. At the beginning, the nbgroup should be manual since it |
| implies config adjustments to bind lines. |
| |
| There should be a trivial way to map a global thread to a group and local ID |
| and to do the opposite. |
| |
| |
| Panic handler + watchdog |
| ------------------------ |
| Will probably depend on what's done for thread_isolate |
| |
| Per-thread arrays inside structures |
| ----------------------------------- |
| - listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should |
| we simply bump the limit ? |
| - same for servers with idle connections. |
| => doesn't seem very practical. |
| - another solution might be to point to dynamically allocated arrays of |
| arrays (e.g. nbthread * nbgroup) or a first level per group and a second |
| per thread. |
| => dynamic allocation based on the global number |
| |
| Other |
| ----- |
| - what about dynamic thread start/stop (e.g. for containers/VMs) ? |
| E.g. if we decide to start $MANY threads in 4 groups, and only use |
| one, in the end it will not be possible to use less than one thread |
| per group, and at most 64 will be present in each group. |
| |
| |
| FD Notes |
| -------- |
| - updt_fd_polling() uses thread_mask to figure where to send the update, |
| the local list or a shared list, and which bits to set in update_mask. |
| This could be changed so that it takes the update mask in argument. The |
| call from the poller's fork would just have to broadcast everywhere. |
| |
| - pollers use it to figure whether they're concerned or not by the activity |
| update. This looks important as otherwise we could re-enable polling on |
| an FD that changed to another thread. |
| |
| - thread_mask being a per-thread active mask looks more exact and is |
| precisely used this way by _update_fd(). In this case using it instead |
| of running_mask to gauge a change or temporarily lock it during a |
| removal could make sense. |
| |
| - running should be conditioned by thread. Polled not (since deferred |
| or migrated). In this case testing thread_mask can be enough most of |
| the time, but this requires synchronization that will have to be |
| extended to tgid.. But migration seems a different beast that we shouldn't |
| care about here: if first performed at the higher level it ought to |
| be safe. |
| |
| In practice the update_mask can be dropped to zero by the first fd_delete() |
| as the only authority allowed to fd_delete() is *the* owner, and as soon as |
| all running_mask are gone, the FD will be closed, hence removed from all |
| pollers. This will be the only way to make sure that update_mask always |
| refers to the current tgid. |
| |
| However, it may happen that a takeover within the same group causes a thread |
| to read the update_mask late, while the FD is being wiped by another thread. |
| That other thread may close it, causing another thread in another group to |
| catch it, and change the tgid and start to update the update_mask. This means |
| that it would be possible for a thread entering do_poll() to see the correct |
| tgid, then the fd would be closed, reopened and reassigned to another tgid, |
| and the thread would see its bit in the update_mask, being confused. Right |
| now this should already happen when the update_mask is not cleared, except |
| that upon wakeup a migration would be detected and that would be all. |
| |
| Thus we might need to set the running bit to prevent the FD from migrating |
| before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( |
| |
| Also even fd_update_events() leaves a risk of updating update_mask after |
| clearing running, thus affecting the wrong one. Probably that update_mask |
| should be updated before clearing running_mask there. Also, how about not |
| creating an update on a close ? Not trivial if done before running, unless |
| thread_mask==0. |
| |
| ########################################################### |
| |
| Current state: |
| |
| |
| Mux / takeover / fd_delete() code ||| poller code |
| -------------------------------------------------|||--------------------------------------------------- |
| \|/ |
| mux_takeover(): | fd_set_running(): |
| if (fd_takeover()<0) | old = {running, thread}; |
| return fail; | new = {tid_bit, tid_bit}; |
| ... | |
| fd_takeover(): | do { |
| atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) |
| old = {running, thread}; | return -1; |
| new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } |
| if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); |
| atomic_and(runnning, ~tid_bit); | |
| return -1; // fail | fd_clr_running(): |
| } | return atomic_and_fetch(running, ~tid_bit); |
| | |
| while (old == {tid_bit, !=0 }) | poll(): |
| if (dwcas({running, thread}, &old, &new)) { | if (!owner) |
| atomic_and(runnning, ~tid_bit); | continue; |
| return 0; // success | |
| } | if (!(thread_mask & tid_bit)) { |
| } | epoll_ctl_del(); |
| | continue; |
| atomic_and(runnning, ~tid_bit); | } |
| return -1; // fail | |
| | // via fd_update_events() |
| fd_delete(): | if (fd_set_running() != -1) { |
| atomic_or(running, tid_bit); | iocb(); |
| atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) |
| if (fd_clr_running(fd) = 0) | fd_delete_orphan(); |
| fd_delete_orphan(); | } |
| |
| |
| The idle_conns_lock prevents the connection from being *picked* and released |
| while someone else is reading it. What it does is guarantee that on idle |
| connections, the caller of the IOCB will not dereference the task's context |
| while the connection is still in the idle list, since it might be picked then |
| freed at the same instant by another thread. As soon as the IOCB manages to |
| get that lock, it removes the connection from the list so that it cannot be |
| taken over anymore. Conversely, the mux's takeover() code runs under that |
| lock so that if it frees the connection and task, this will appear atomic |
| to the IOCB. The timeout task (which is another entry point for connection |
| deletion) does the same. Thus, when coming from the low-level (I/O or timeout): |
| - task always exists, but ctx checked under lock validates; conn removal |
| from list prevents takeover(). |
| - t->context is stable, except during changes under takeover lock. So |
| h2_timeout_task may well run on a different thread than h2_io_cb(). |
| |
| Coming from the top: |
| - takeover() done under lock() clears task's ctx and possibly closes the FD |
| (unless some running remains present). |
| |
| Unlikely but currently possible situations: |
| - multiple pollers (up to N) may have an idle connection's FD being |
| polled, if the connection was passed from thread to thread. The first |
| event on the connection would wake all of them. Most of them would |
| see fdtab[].owner set (the late ones might miss it). All but one would |
| see that their bit is missing from fdtab[].thread_mask and give up. |
| However, just after this test, others might take over the connection, |
| so in practice if terribly unlucky, all but 1 could see their bit in |
| thread_mask just before it gets removed, all of them set their bit |
| in running_mask, and all of them call iocb() (sock_conn_iocb()). |
| Thus all of them dereference the connection and touch the subscriber |
| with no protection, then end up in conn_notify_mux() that will call |
| the mux's wake(). |
| |
| - multiple pollers (up to N-1) might still be in fd_update_events() |
| manipulating fdtab[].state. The cause is that the "locked" variable |
| is determined by atleast2(thread_mask) but that thread_mask is read |
| at a random instant (i.e. it may be stolen by another one during a |
| takeover) since we don't yet hold running to prevent this from being |
| done. Thus we can arrive here with thread_mask==something_else (1bit), |
| locked==0 and fdtab[].state assigned non-atomically. |
| |
| - it looks like nothing prevents h2_release() from being called on a |
| thread (e.g. from the top or task timeout) while sock_conn_iocb() |
| dereferences the connection on another thread. Those killing the |
| connection don't yet consider the fact that it's an FD that others |
| might currently be waking up on. |
| |
| ################### |
| |
| pb with counter: |
| |
| users count doesn't say who's using the FD and two users can do the same |
| close in turn. The thread_mask should define who's responsible for closing |
| the FD, and all those with a bit in it ought to do it. |
| |
| |
| 2021-08-25 - update with minimal locking on tgid value |
| ========== |
| |
| - tgid + refcount at once using CAS |
| - idle_conns lock during updates |
| - update: |
| if tgid differs => close happened, thus drop update |
| otherwise normal stuff. Lock tgid until running if needed. |
| - poll report: |
| if tgid differs => closed |
| if thread differs => stop polling (migrated) |
| keep tgid lock until running |
| - test on thread_id: |
| if (xadd(&tgid,65536) != my_tgid) { |
| // was closed |
| sub(&tgid, 65536) |
| return -1 |
| } |
| if !(thread_id & tidbit) => migrated/closed |
| set_running() |
| sub(tgid,65536) |
| - note: either fd_insert() or the final close() ought to set |
| polled and update to 0. |
| |
| 2021-09-13 - tid / tgroups etc. |
| ========== |
| |
| - tid currently is the thread's global ID. It's essentially used as an index |
| for arrays. It must be clearly stated that it works this way. |
| |
| - tid_bit makes no sense process-wide, so it must be redefined to represent |
| the thread's tid within its group. The name is not much welcome though, but |
| there are 286 of it that are not going to be changed that fast. |
| |
| - just like "ti" is the thread_info, we need to have "tg" pointing to the |
| thread_group. |
| |
| - other less commonly used elements should be retrieved from ti->xxx. E.g. |
| the thread's local ID. |
| |
| - lock debugging must reproduce tgid |
| |
| - an offset might be placed in the tgroup so that even with 64 threads max |
| we could have completely separate tid_bits over several groups. |
| |
| 2021-09-15 - bind + listen() + rx |
| ========== |
| |
| - thread_mask (in bind_conf->rx_settings) should become an array of |
| MAX_TGROUP longs. |
| - when parsing "thread 123" or "thread 2/37", the proper bit is set, |
| assuming the array is either a contigous bitfield or a tgroup array. |
| An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on |
| how the thread num was parsed, so that we reject mixes. |
| - end of parsing: entries translated to the cleanest form (to be determined) |
| - binding: for each socket()/bind()/listen()... just perform one extra dup() |
| for each tgroup and store the multiple FDs into an FD array indexed on |
| MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence |
| to have multiple entries in all tgroup pollers without requiring the user |
| to duplicate the bind line. |
| |
| 2021-09-15 - global thread masks |
| ========== |
| |
| Some global variables currently expect to know about thread IDs and it's |
| uncertain what must be done with them: |
| - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ |
| => touched under the rq lock. Change it per-group ? What exact use is made ? |
| |
| - sleeping_thread_mask /* Threads that are about to sleep in poll() */ |
| => seems that it can be made per group |
| |
| - all_threads_mask: a bit complicated, derived from nbthread and used with |
| masks and with my_ffsl() to wake threads up. Should probably be per-group |
| but we might miss something for global. |
| |
| - stopping_thread_mask: used in combination with all_threads_mask, should |
| move per-group. |
| |
| - threads_harmless_mask: indicates all threads that are currently harmless in |
| that they promise not to access a shared resource. Must be made per-group |
| but then we'll likely need a second stage to have the harmless groups mask. |
| threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one |
| above. Maybe the right approach will be to request harmless on a group mask |
| so that we can detect collisions and arbiter them like today, but on top of |
| this it becomes possible to request harmless only on the local group if |
| desired. The subtlety is that requesting harmless at the group level does |
| not mean it's achieved since the requester cannot vouch for the other ones |
| in the same group. |
| |
| In addition, some variables are related to the global runqueue: |
| __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ |
| struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ |
| unsigned int grq_total; /* total number of entries in the global run queue, atomic */ |
| static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ |
| |
| And others to the global wait queue: |
| struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ |
| struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| |
| |
| 2021-09-29 - group designation and masks |
| ========== |
| |
| Neither FDs nor tasks will belong to incomplete subsets of threads spanning |
| over multiple thread groups. In addition there may be a difference between |
| configuration and operation (for FDs). This allows to fix the following rules: |
| |
| group mask description |
| 0 0 bind_conf: groups & thread not set. bind to any/all |
| task: it would be nice to mean "run on the same as the caller". |
| |
| 0 xxx bind_conf: thread set but not group: thread IDs are global |
| FD/task: group 0, mask xxx |
| |
| G>0 0 bind_conf: only group is set: bind to all threads of group G |
| FD/task: mask 0 not permitted (= not owned). May be used to |
| mention "any thread of this group", though already covered by |
| G/xxx like today. |
| |
| G>0 xxx bind_conf: Bind to these threads of this group |
| FD/task: group G, mask xxx |
| |
| It looks like keeping groups starting at zero internally complicates everything |
| though. But forcing it to start at 1 might also require that we rescan all tasks |
| to replace 0 with 1 upon startup. This would also allow group 0 to be special and |
| be used as the default group for any new thread creation, so that group0.count |
| would keep the number of unassigned threads. Let's try: |
| |
| group mask description |
| 0 0 bind_conf: groups & thread not set. bind to any/all |
| task: "run on the same group & thread as the caller". |
| |
| 0 xxx bind_conf: thread set but not group: thread IDs are global |
| FD/task: invalid. Or maybe for a task we could use this to |
| mean "run on current group, thread XXX", which would cover |
| the need for health checks (g/t 0/0 while sleeping, 0/xxx |
| while running) and have wake_expired_tasks() detect 0/0 and |
| wake them up to a random group. |
| |
| G>0 0 bind_conf: only group is set: bind to all threads of group G |
| FD/task: mask 0 not permitted (= not owned). May be used to |
| mention "any thread of this group", though already covered by |
| G/xxx like today. |
| |
| G>0 xxx bind_conf: Bind to these threads of this group |
| FD/task: group G, mask xxx |
| |
| With a single group declared in the config, group 0 would implicitly find the |
| first one. |
| |
| |
| The problem with the approach above is that a task queued in one group+thread's |
| wait queue could very well receive a signal from another thread and/or group, |
| and that there is no indication about where the task is queued, nor how to |
| dequeue it. Thus it seems that it's up to the application itself to unbind/ |
| rebind a task. This contradicts the principle of leaving a task waiting in a |
| wait queue and waking it anywhere. |
| |
| Another possibility might be to decide that a task having a defined group but |
| a mask of zero is shared and will always be queued into its group's wait queue. |
| However, upon expiry, the scheduler would notice the thread-mask 0 and would |
| broadcast it to any group. |
| |
| Right now in the code we have: |
| - 18 calls of task_new(tid_bit) |
| - 18 calls of task_new(MAX_THREADS_MASK) |
| - 2 calls with a single bit |
| |
| Thus it looks like "task_new_anywhere()", "task_new_on()" and |
| "task_new_here()" would be sufficient. |