| Thread groups |
| ############# |
| |
| 2021-07-13 - first draft |
| ========== |
| |
| Objective |
| --------- |
| - support multi-socket systems with limited cache-line bouncing between |
| physical CPUs and/or L3 caches |
| |
| - overcome the 64-thread limitation |
| |
| - Support a reasonable number of groups. I.e. if modern CPUs arrive with |
| core complexes made of 8 cores, with 8 CC per chip and 2 chips in a |
| system, it makes sense to support 16 groups. |
| |
| |
| Non-objective |
| ------------- |
| - no need to optimize to the last possible cycle. I.e. some algos like |
| leastconn will remain shared across all threads, servers will keep a |
| single queue, etc. Global information remains global. |
| |
| - no stubborn enforcement of FD sharing. Per-server idle connection lists |
| can become per-group; listeners can (and should probably) be per-group. |
| Other mechanisms (like SO_REUSEADDR) can already overcome this. |
| |
| - no need to go beyond 64 threads per group. |
| |
| |
| Identified tasks |
| ================ |
| |
| General |
| ------- |
| Everywhere tid_bit is used we absolutely need to find a complement using |
| either the current group or a specific one. Thread debugging will need to |
| be extended as masks are extensively used. |
| |
| |
| Scheduler |
| --------- |
| The global run queue and global wait queue must become per-group. This |
| means that a task may only be queued into one of them at a time. It |
| sounds like tasks may only belong to a given group, but doing so would |
| bring back the original issue that it's impossible to perform remote wake |
| ups. |
| |
| We could probably ignore the group if we don't need to set the thread mask |
| in the task. the task's thread_mask is never manipulated using atomics so |
| it's safe to complement it with a group. |
| |
| The sleeping_thread_mask should become per-group. Thus possibly that a |
| wakeup may only be performed on the assigned group, meaning that either |
| a task is not assigned, in which case it be self-assigned (like today), |
| otherwise the tg to be woken up will be retrieved from the task itself. |
| |
| Task creation currently takes a thread mask of either tid_bit, a specific |
| mask, or MAX_THREADS_MASK. How to create a task able to run anywhere |
| (checks, Lua, ...) ? |
| |
| Profiling -> completed |
| --------- |
| There should be one task_profiling_mask per thread group. Enabling or |
| disabling profiling should be made per group (possibly by iterating). |
| -> not needed anymore, one flag per thread in each thread's context. |
| |
| Thread isolation |
| ---------------- |
| Thread isolation is difficult as we solely rely on atomic ops to figure |
| who can complete. Such operation is rare, maybe we could have a global |
| read_mostly flag containing a mask of the groups that require isolation. |
| Then the threads_want_rdv_mask etc can become per-group. However setting |
| and clearing the bits will become problematic as this will happen in two |
| steps hence will require careful ordering. |
| |
| FD |
| -- |
| Tidbit is used in a number of atomic ops on the running_mask. If we have |
| one fdtab[] per group, the mask implies that it's within the group. |
| Theoretically we should never face a situation where an FD is reported nor |
| manipulated for a remote group. |
| |
| There will still be one poller per thread, except that this time all |
| operations will be related to the current thread_group. No fd may appear |
| in two thread_groups at once, but we can probably not prevent that (e.g. |
| delayed close and reopen). Should we instead have a single shared fdtab[] |
| (less memory usage also) ? Maybe adding the group in the fdtab entry would |
| work, but when does a thread know it can leave it ? Currently this is |
| solved by running_mask and by update_mask. Having two tables could help |
| with this (each table sees the FD in a different group with a different |
| mask) but this looks overkill. |
| |
| There's polled_mask[] which needs to be decided upon. Probably that it |
| should be doubled as well. Note, polled_mask left fdtab[] for cacheline |
| alignment reasons in commit cb92f5cae4. |
| |
| If we have one fdtab[] per group, what *really* prevents from using the |
| same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() |
| need to check for no-thread usage before closing the FD. This could be |
| a limiting factor. Enabling could require to wake every poller. |
| |
| Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, |
| used only during creation and close) ? |
| |
| Other problem, if we have one fdtab[] per TG, disabling/enabling an FD |
| (e.g. pause/resume on listener) can become a problem if it's not necessarily |
| on the current TG. We'll then need a way to figure that one. It sounds like |
| FDs from listeners and receivers are very specific and suffer from problems |
| all other ones under high load do not suffer from. Maybe something specific |
| ought to be done for them, if we can guarantee there is no risk of accidental |
| reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the |
| FD's flags). The risk is always that a close() can result in instant pop-up |
| of the same FD on any other thread of the same process. |
| |
| Observations: right now fdtab[].thread_mask more or less corresponds to a |
| declaration of interest, it's very close to meaning "active per thread". It is |
| in fact located in the FD while it ought to do nothing there, as it should be |
| where the FD is used as it rules accesses to a shared resource that is not |
| the FD but what uses it. Indeed, if neither polled_mask nor running_mask have |
| a thread's bit, the FD is unknown to that thread and the element using it may |
| only be reached from above and not from the FD. As such we ought to have a |
| thread_mask on a listener and another one on connections. These ones will |
| indicate who uses them. A takeover could then be simplified (atomically set |
| exclusivity on the FD's running_mask, upon success, takeover the connection, |
| clear the running mask). Probably that the change ought to be performed on |
| the connection level first, not the FD level by the way. But running and |
| polled are the two relevant elements, one indicates userland knowledge, |
| the other one kernel knowledge. For listeners there's no exclusivity so it's |
| a bit different but the rule remains the same that we don't have to know |
| what threads are *interested* in the FD, only its holder. |
| |
| Not exact in fact, see FD notes below. |
| |
| activity |
| -------- |
| There should be one activity array per thread group. The dump should |
| simply scan them all since the cumuled values are not very important |
| anyway. |
| |
| applets |
| ------- |
| They use tid_bit only for the task. It looks like the appctx's thread_mask |
| is never used (now removed). Furthermore, it looks like the argument is |
| *always* tid_bit. |
| |
| CPU binding |
| ----------- |
| This is going to be tough. It will be needed to detect that threads overlap |
| and are not bound (i.e. all threads on same mask). In this case, if the number |
| of threads is higher than the number of threads per physical socket, one must |
| try hard to evenly spread them among physical sockets (e.g. one thread group |
| per physical socket) and start as many threads as needed on each, bound to |
| all threads/cores of each socket. If there is a single socket, the same job |
| may be done based on L3 caches. Maybe it could always be done based on L3 |
| caches. The difficulty behind this is the number of sockets to be bound: it |
| is not possible to bind several FDs per listener. Maybe with a new bind |
| keyword we can imagine to automatically duplicate listeners ? In any case, |
| the initially bound cpumap (via taskset) must always be respected, and |
| everything should probably start from there. |
| |
| Frontend binding |
| ---------------- |
| We'll have to define a list of threads and thread-groups per frontend. |
| Probably that having a group mask and a same thread-mask for each group |
| would suffice. |
| |
| Threads should have two numbers: |
| - the per-process number (e.g. 1..256) |
| - the per-group number (1..64) |
| |
| The "bind-thread" lines ought to use the following syntax: |
| - bind 45 ## bind to process' thread 45 |
| - bind 1/45 ## bind to group 1's thread 45 |
| - bind all/45 ## bind to thread 45 in each group |
| - bind 1/all ## bind to all threads in group 1 |
| - bind all ## bind to all threads |
| - bind all/all ## bind to all threads in all groups (=all) |
| - bind 1/65 ## rejected |
| - bind 65 ## OK if there are enough |
| - bind 35-45 ## depends. Rejected if it crosses a group boundary. |
| |
| The global directive "nbthread 28" means 28 total threads for the process. The |
| number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 |
| threads per group. At the beginning, the nbgroup should be manual since it |
| implies config adjustments to bind lines. |
| |
| There should be a trivial way to map a global thread to a group and local ID |
| and to do the opposite. |
| |
| |
| Panic handler + watchdog |
| ------------------------ |
| Will probably depend on what's done for thread_isolate |
| |
| Per-thread arrays inside structures |
| ----------------------------------- |
| - listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should |
| we simply bump the limit ? |
| - same for servers with idle connections. |
| => doesn't seem very practical. |
| - another solution might be to point to dynamically allocated arrays of |
| arrays (e.g. nbthread * nbgroup) or a first level per group and a second |
| per thread. |
| => dynamic allocation based on the global number |
| |
| Other |
| ----- |
| - what about dynamic thread start/stop (e.g. for containers/VMs) ? |
| E.g. if we decide to start $MANY threads in 4 groups, and only use |
| one, in the end it will not be possible to use less than one thread |
| per group, and at most 64 will be present in each group. |
| |
| |
| FD Notes |
| -------- |
| - updt_fd_polling() uses thread_mask to figure where to send the update, |
| the local list or a shared list, and which bits to set in update_mask. |
| This could be changed so that it takes the update mask in argument. The |
| call from the poller's fork would just have to broadcast everywhere. |
| |
| - pollers use it to figure whether they're concerned or not by the activity |
| update. This looks important as otherwise we could re-enable polling on |
| an FD that changed to another thread. |
| |
| - thread_mask being a per-thread active mask looks more exact and is |
| precisely used this way by _update_fd(). In this case using it instead |
| of running_mask to gauge a change or temporarily lock it during a |
| removal could make sense. |
| |
| - running should be conditioned by thread. Polled not (since deferred |
| or migrated). In this case testing thread_mask can be enough most of |
| the time, but this requires synchronization that will have to be |
| extended to tgid.. But migration seems a different beast that we shouldn't |
| care about here: if first performed at the higher level it ought to |
| be safe. |
| |
| In practice the update_mask can be dropped to zero by the first fd_delete() |
| as the only authority allowed to fd_delete() is *the* owner, and as soon as |
| all running_mask are gone, the FD will be closed, hence removed from all |
| pollers. This will be the only way to make sure that update_mask always |
| refers to the current tgid. |
| |
| However, it may happen that a takeover within the same group causes a thread |
| to read the update_mask late, while the FD is being wiped by another thread. |
| That other thread may close it, causing another thread in another group to |
| catch it, and change the tgid and start to update the update_mask. This means |
| that it would be possible for a thread entering do_poll() to see the correct |
| tgid, then the fd would be closed, reopened and reassigned to another tgid, |
| and the thread would see its bit in the update_mask, being confused. Right |
| now this should already happen when the update_mask is not cleared, except |
| that upon wakeup a migration would be detected and that would be all. |
| |
| Thus we might need to set the running bit to prevent the FD from migrating |
| before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( |
| |
| Also even fd_update_events() leaves a risk of updating update_mask after |
| clearing running, thus affecting the wrong one. Probably that update_mask |
| should be updated before clearing running_mask there. Also, how about not |
| creating an update on a close ? Not trivial if done before running, unless |
| thread_mask==0. |
| |
| Note that one situation that is currently visible is that a thread closes a |
| file descriptor that it's the last one to own and to have an update for. In |
| fd_delete_orphan() it does call poller.clo() but this one is not sufficient |
| as it doesn't drop the update_mask nor does it clear the polled_mask. The |
| typical problem that arises is that the close() happens before processing |
| the last update (e.g. a close() just after a partial read), thus it still |
| has *at least* one bit set for the current thread in both update_mask and |
| polled_mask, and it is present in the update_list. Not handling it would |
| mean that the event is lost on update() from the concerned threads and |
| that some resource might leak. Handling it means zeroing the update_mask |
| and polled_mask, and deleting the update entry from the update_list, thus |
| losing the update event. And as indicated above, if the FD switches twice |
| between 2 groups, the finally called thread does not necessarily know that |
| the FD isn't the same anymore, thus it's difficult to decide whether to |
| delete it or not, because deleting the event might in fact mean deleting |
| something that was just re-added for the same thread with the same FD but |
| a different usage. |
| |
| Also it really seems unrealistic to scan a single shared update_list like |
| this using write operations. There should likely be one per thread-group. |
| But in this case there is no more choice than deleting the update event |
| upon fd_delete_orphan(). This also means that poller->clo() must do the |
| job for all of the group's threads at once. This would mean a synchronous |
| removal before the close(), which doesn't seem ridiculously expensive. It |
| just requires that any thread of a group may manipulate any other thread's |
| status for an FD and a poller. |
| |
| Note about our currently supported pollers: |
| |
| - epoll: our current code base relies on the modern version which |
| automatically removes closed FDs, so we don't have anything to do |
| when closing and we don't need the update. |
| |
| - kqueue: according to https://www.freebsd.org/cgi/man.cgi?query=kqueue, just |
| like epoll, a close() implies a removal. Our poller doesn't perform |
| any bookkeeping either so it's OK to directly close. |
| |
| - evports: https://docs.oracle.com/cd/E86824_01/html/E54766/port-dissociate-3c.html |
| says the same, i.e. close() implies a removal of all events. No local |
| processing nor bookkeeping either, we can close. |
| |
| - poll: the fd_evts[] array is global, thus shared by all threads. As such, |
| a single removal is needed to flush it for all threads at once. The |
| operation is already performed like this. |
| |
| - select: works exactly like poll() above, hence already handled. |
| |
| As a preliminary conclusion, it's safe to delete the event and reset |
| update_mask just after calling poller->clo(). If extremely unlucky (changing |
| thread mask due to takeover ?), the same FD may appear at the same time: |
| - in one or several thread-local fd_updt[] arrays. These ones are just work |
| queues, there's nothing to do to ignore them, just leave the holes with an |
| outdated FD which will be ignored once met. As a bonus, poller->clo() could |
| check if the last fd_updt[] points to this specific FD and decide to kill |
| it. |
| |
| - in the global update_list. In this case, fd_rm_from_fd_list() already |
| performs an attachment check, so it's safe to always call it before closing |
| (since no one else may be in the process of changing anything). |
| |
| |
| ########################################################### |
| |
| Current state: |
| |
| |
| Mux / takeover / fd_delete() code ||| poller code |
| -------------------------------------------------|||--------------------------------------------------- |
| \|/ |
| mux_takeover(): | fd_set_running(): |
| if (fd_takeover()<0) | old = {running, thread}; |
| return fail; | new = {tid_bit, tid_bit}; |
| ... | |
| fd_takeover(): | do { |
| atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) |
| old = {running, thread}; | return -1; |
| new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } |
| if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); |
| atomic_and(running, ~tid_bit); | |
| return -1; // fail | fd_clr_running(): |
| } | return atomic_and_fetch(running, ~tid_bit); |
| | |
| while (old == {tid_bit, !=0 }) | poll(): |
| if (dwcas({running, thread}, &old, &new)) { | if (!owner) |
| atomic_and(running, ~tid_bit); | continue; |
| return 0; // success | |
| } | if (!(thread_mask & tid_bit)) { |
| } | epoll_ctl_del(); |
| | continue; |
| atomic_and(running, ~tid_bit); | } |
| return -1; // fail | |
| | // via fd_update_events() |
| fd_delete(): | if (fd_set_running() != -1) { |
| atomic_or(running, tid_bit); | iocb(); |
| atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) |
| if (fd_clr_running(fd) = 0) | fd_delete_orphan(); |
| fd_delete_orphan(); | } |
| |
| |
| The idle_conns_lock prevents the connection from being *picked* and released |
| while someone else is reading it. What it does is guarantee that on idle |
| connections, the caller of the IOCB will not dereference the task's context |
| while the connection is still in the idle list, since it might be picked then |
| freed at the same instant by another thread. As soon as the IOCB manages to |
| get that lock, it removes the connection from the list so that it cannot be |
| taken over anymore. Conversely, the mux's takeover() code runs under that |
| lock so that if it frees the connection and task, this will appear atomic |
| to the IOCB. The timeout task (which is another entry point for connection |
| deletion) does the same. Thus, when coming from the low-level (I/O or timeout): |
| - task always exists, but ctx checked under lock validates; conn removal |
| from list prevents takeover(). |
| - t->context is stable, except during changes under takeover lock. So |
| h2_timeout_task may well run on a different thread than h2_io_cb(). |
| |
| Coming from the top: |
| - takeover() done under lock() clears task's ctx and possibly closes the FD |
| (unless some running remains present). |
| |
| Unlikely but currently possible situations: |
| - multiple pollers (up to N) may have an idle connection's FD being |
| polled, if the connection was passed from thread to thread. The first |
| event on the connection would wake all of them. Most of them would |
| see fdtab[].owner set (the late ones might miss it). All but one would |
| see that their bit is missing from fdtab[].thread_mask and give up. |
| However, just after this test, others might take over the connection, |
| so in practice if terribly unlucky, all but 1 could see their bit in |
| thread_mask just before it gets removed, all of them set their bit |
| in running_mask, and all of them call iocb() (sock_conn_iocb()). |
| Thus all of them dereference the connection and touch the subscriber |
| with no protection, then end up in conn_notify_mux() that will call |
| the mux's wake(). |
| |
| - multiple pollers (up to N-1) might still be in fd_update_events() |
| manipulating fdtab[].state. The cause is that the "locked" variable |
| is determined by atleast2(thread_mask) but that thread_mask is read |
| at a random instant (i.e. it may be stolen by another one during a |
| takeover) since we don't yet hold running to prevent this from being |
| done. Thus we can arrive here with thread_mask==something_else (1bit), |
| locked==0 and fdtab[].state assigned non-atomically. |
| |
| - it looks like nothing prevents h2_release() from being called on a |
| thread (e.g. from the top or task timeout) while sock_conn_iocb() |
| dereferences the connection on another thread. Those killing the |
| connection don't yet consider the fact that it's an FD that others |
| might currently be waking up on. |
| |
| ################### |
| |
| pb with counter: |
| |
| users count doesn't say who's using the FD and two users can do the same |
| close in turn. The thread_mask should define who's responsible for closing |
| the FD, and all those with a bit in it ought to do it. |
| |
| |
| 2021-08-25 - update with minimal locking on tgid value |
| ========== |
| |
| - tgid + refcount at once using CAS |
| - idle_conns lock during updates |
| - update: |
| if tgid differs => close happened, thus drop update |
| otherwise normal stuff. Lock tgid until running if needed. |
| - poll report: |
| if tgid differs => closed |
| if thread differs => stop polling (migrated) |
| keep tgid lock until running |
| - test on thread_id: |
| if (xadd(&tgid,65536) != my_tgid) { |
| // was closed |
| sub(&tgid, 65536) |
| return -1 |
| } |
| if !(thread_id & tidbit) => migrated/closed |
| set_running() |
| sub(tgid,65536) |
| - note: either fd_insert() or the final close() ought to set |
| polled and update to 0. |
| |
| 2021-09-13 - tid / tgroups etc. |
| ========== |
| |
| * tid currently is the thread's global ID. It's essentially used as an index |
| for arrays. It must be clearly stated that it works this way. |
| |
| * tasklets use the global thread id, and __tasklet_wakeup_on() must use a |
| global ID as well. It's capital that tinfo[] provides instant access to |
| local/global bits/indexes/arrays |
| |
| - tid_bit makes no sense process-wide, so it must be redefined to represent |
| the thread's tid within its group. The name is not much welcome though, but |
| there are 286 of it that are not going to be changed that fast. |
| => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still |
| not changed though. If renamed we must make sure the older one vanishes. |
| Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid, |
| gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid" |
| which is half the time not the one we expect. |
| |
| * just like "ti" is the thread_info, we need to have "tg" pointing to the |
| thread_group. |
| |
| - other less commonly used elements should be retrieved from ti->xxx. E.g. |
| the thread's local ID. |
| |
| - lock debugging must reproduce tgid |
| |
| * task profiling must be made per-group (annoying), unless we want to add a |
| per-thread TH_FL_* flag and have the rare places where the bit is changed |
| iterate over all threads if needed. Sounds preferable overall. |
| |
| * an offset might be placed in the tgroup so that even with 64 threads max |
| we could have completely separate tid_bits over several groups. |
| => base and count now |
| |
| 2021-09-15 - bind + listen() + rx |
| ========== |
| |
| - thread_mask (in bind_conf->rx_settings) should become an array of |
| MAX_TGROUP longs. |
| - when parsing "thread 123" or "thread 2/37", the proper bit is set, |
| assuming the array is either a contiguous bitfield or a tgroup array. |
| An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on |
| how the thread num was parsed, so that we reject mixes. |
| - end of parsing: entries translated to the cleanest form (to be determined) |
| - binding: for each socket()/bind()/listen()... just perform one extra dup() |
| for each tgroup and store the multiple FDs into an FD array indexed on |
| MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence |
| to have multiple entries in all tgroup pollers without requiring the user |
| to duplicate the bind line. |
| |
| 2021-09-15 - global thread masks |
| ========== |
| |
| Some global variables currently expect to know about thread IDs and it's |
| uncertain what must be done with them: |
| - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ |
| => touched under the rq lock. Change it per-group ? What exact use is made ? |
| |
| - sleeping_thread_mask /* Threads that are about to sleep in poll() */ |
| => seems that it can be made per group |
| |
| - all_threads_mask: a bit complicated, derived from nbthread and used with |
| masks and with my_ffsl() to wake threads up. Should probably be per-group |
| but we might miss something for global. |
| |
| - stopping_thread_mask: used in combination with all_threads_mask, should |
| move per-group. |
| |
| - threads_harmless_mask: indicates all threads that are currently harmless in |
| that they promise not to access a shared resource. Must be made per-group |
| but then we'll likely need a second stage to have the harmless groups mask. |
| threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one |
| above. Maybe the right approach will be to request harmless on a group mask |
| so that we can detect collisions and arbiter them like today, but on top of |
| this it becomes possible to request harmless only on the local group if |
| desired. The subtlety is that requesting harmless at the group level does |
| not mean it's achieved since the requester cannot vouch for the other ones |
| in the same group. |
| |
| In addition, some variables are related to the global runqueue: |
| __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ |
| struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ |
| unsigned int grq_total; /* total number of entries in the global run queue, atomic */ |
| static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ |
| |
| And others to the global wait queue: |
| struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ |
| struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| |
| |
| 2022-06-14 - progress on task affinity |
| ========== |
| |
| The particularity of the current global run queue is to be usable for remote |
| wakeups because it's protected by a lock. There is no need for a global run |
| queue beyond this, and there could already be a locked queue per thread for |
| remote wakeups, with a random selection at wakeup time. It's just that picking |
| a pending task in a run queue among a number is convenient (though it |
| introduces some excessive locking). A task will either be tied to a single |
| group or will be allowed to run on any group. As such it's pretty clear that we |
| don't need a global run queue. When a run-anywhere task expires, either it runs |
| on the current group's runqueue with any thread, or a target thread is selected |
| during the wakeup and it's directly assigned. |
| |
| A global wait queue seems important for scheduled repetitive tasks however. But |
| maybe it's more a task for a cron-like job and there's no need for the task |
| itself to wake up anywhere, because once the task wakes up, it must be tied to |
| one (or a set of) thread(s). One difficulty if the task is temporarily assigned |
| a thread group is that it's impossible to know where it's running when trying |
| to perform a second wakeup or when trying to kill it. Maybe we'll need to have |
| two tgid for a task (desired, effective). Or maybe we can restrict the ability |
| of such a task to stay in wait queue in case of wakeup, though that sounds |
| difficult. Other approaches would be to set the GID to the current one when |
| waking up the task, and to have a flag (or sign on the GID) indicating that the |
| task is still queued in the global timers queue. We already have TASK_SHARED_WQ |
| so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make |
| sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ? |
| All calls to task_new() use either 1<<thr, tid_bit, all_threads_mask, or come |
| from appctx_new which does exactly the same. The only real user of non-global, |
| non-unique task_new() call is debug_parse_cli_sched() which purposely allows to |
| use an arbitrary mask. |
| |
| +----------------------------------------------------------------------------+ |
| | => we don't need one WQ per group, only a global and N local ones, hence | |
| | the TASK_SHARED_WQ flag can continue to be used for this purpose. | |
| +----------------------------------------------------------------------------+ |
| |
| Having TASK_SHARED_WQ should indicate that a task will always be queued to the |
| shared queue and will always have a temporary gid and thread mask in the run |
| queue. |
| |
| Going further, as we don't have any single case of a task bound to a small set |
| of threads, we could decide to wake up only expired tasks for ourselves by |
| looking them up using eb32sc and adopting them. Thus, there's no more need for |
| a shared runqueue nor a global_runqueue_ticks counter, and we can simply have |
| the ability to wake up a remote task. The task's thread_mask will then change |
| so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in |
| which case it corresponds to the running thread. That's very close to what is |
| already done with tasklets in fact. |
| |
| |
| 2021-09-29 - group designation and masks |
| ========== |
| |
| Neither FDs nor tasks will belong to incomplete subsets of threads spanning |
| over multiple thread groups. In addition there may be a difference between |
| configuration and operation (for FDs). This allows to fix the following rules: |
| |
| group mask description |
| 0 0 bind_conf: groups & thread not set. bind to any/all |
| task: it would be nice to mean "run on the same as the caller". |
| |
| 0 xxx bind_conf: thread set but not group: thread IDs are global |
| FD/task: group 0, mask xxx |
| |
| G>0 0 bind_conf: only group is set: bind to all threads of group G |
| FD/task: mask 0 not permitted (= not owned). May be used to |
| mention "any thread of this group", though already covered by |
| G/xxx like today. |
| |
| G>0 xxx bind_conf: Bind to these threads of this group |
| FD/task: group G, mask xxx |
| |
| It looks like keeping groups starting at zero internally complicates everything |
| though. But forcing it to start at 1 might also require that we rescan all tasks |
| to replace 0 with 1 upon startup. This would also allow group 0 to be special and |
| be used as the default group for any new thread creation, so that group0.count |
| would keep the number of unassigned threads. Let's try: |
| |
| group mask description |
| 0 0 bind_conf: groups & thread not set. bind to any/all |
| task: "run on the same group & thread as the caller". |
| |
| 0 xxx bind_conf: thread set but not group: thread IDs are global |
| FD/task: invalid. Or maybe for a task we could use this to |
| mean "run on current group, thread XXX", which would cover |
| the need for health checks (g/t 0/0 while sleeping, 0/xxx |
| while running) and have wake_expired_tasks() detect 0/0 and |
| wake them up to a random group. |
| |
| G>0 0 bind_conf: only group is set: bind to all threads of group G |
| FD/task: mask 0 not permitted (= not owned). May be used to |
| mention "any thread of this group", though already covered by |
| G/xxx like today. |
| |
| G>0 xxx bind_conf: Bind to these threads of this group |
| FD/task: group G, mask xxx |
| |
| With a single group declared in the config, group 0 would implicitly find the |
| first one. |
| |
| |
| The problem with the approach above is that a task queued in one group+thread's |
| wait queue could very well receive a signal from another thread and/or group, |
| and that there is no indication about where the task is queued, nor how to |
| dequeue it. Thus it seems that it's up to the application itself to unbind/ |
| rebind a task. This contradicts the principle of leaving a task waiting in a |
| wait queue and waking it anywhere. |
| |
| Another possibility might be to decide that a task having a defined group but |
| a mask of zero is shared and will always be queued into its group's wait queue. |
| However, upon expiry, the scheduler would notice the thread-mask 0 and would |
| broadcast it to any group. |
| |
| Right now in the code we have: |
| - 18 calls of task_new(tid_bit) |
| - 17 calls of task_new_anywhere() |
| - 2 calls with a single bit |
| |
| Thus it looks like "task_new_anywhere()", "task_new_on()" and |
| "task_new_here()" would be sufficient. |