Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 1 | Thread groups |
| 2 | ############# |
| 3 | |
| 4 | 2021-07-13 - first draft |
| 5 | ========== |
| 6 | |
| 7 | Objective |
| 8 | --------- |
| 9 | - support multi-socket systems with limited cache-line bouncing between |
| 10 | physical CPUs and/or L3 caches |
| 11 | |
| 12 | - overcome the 64-thread limitation |
| 13 | |
| 14 | - Support a reasonable number of groups. I.e. if modern CPUs arrive with |
| 15 | core complexes made of 8 cores, with 8 CC per chip and 2 chips in a |
| 16 | system, it makes sense to support 16 groups. |
| 17 | |
| 18 | |
| 19 | Non-objective |
| 20 | ------------- |
| 21 | - no need to optimize to the last possible cycle. I.e. some algos like |
| 22 | leastconn will remain shared across all threads, servers will keep a |
| 23 | single queue, etc. Global information remains global. |
| 24 | |
| 25 | - no stubborn enforcement of FD sharing. Per-server idle connection lists |
| 26 | can become per-group; listeners can (and should probably) be per-group. |
| 27 | Other mechanisms (like SO_REUSEADDR) can already overcome this. |
| 28 | |
| 29 | - no need to go beyond 64 threads per group. |
| 30 | |
| 31 | |
| 32 | Identified tasks |
| 33 | ================ |
| 34 | |
| 35 | General |
| 36 | ------- |
| 37 | Everywhere tid_bit is used we absolutely need to find a complement using |
| 38 | either the current group or a specific one. Thread debugging will need to |
| 39 | be extended as masks are extensively used. |
| 40 | |
| 41 | |
| 42 | Scheduler |
| 43 | --------- |
| 44 | The global run queue and global wait queue must become per-group. This |
| 45 | means that a task may only be queued into one of them at a time. It |
| 46 | sounds like tasks may only belong to a given group, but doing so would |
| 47 | bring back the original issue that it's impossible to perform remote wake |
| 48 | ups. |
| 49 | |
| 50 | We could probably ignore the group if we don't need to set the thread mask |
| 51 | in the task. the task's thread_mask is never manipulated using atomics so |
| 52 | it's safe to complement it with a group. |
| 53 | |
| 54 | The sleeping_thread_mask should become per-group. Thus possibly that a |
| 55 | wakeup may only be performed on the assigned group, meaning that either |
| 56 | a task is not assigned, in which case it be self-assigned (like today), |
| 57 | otherwise the tg to be woken up will be retrieved from the task itself. |
| 58 | |
| 59 | Task creation currently takes a thread mask of either tid_bit, a specific |
| 60 | mask, or MAX_THREADS_MASK. How to create a task able to run anywhere |
| 61 | (checks, Lua, ...) ? |
| 62 | |
| 63 | Profiling |
| 64 | --------- |
| 65 | There should be one task_profiling_mask per thread group. Enabling or |
| 66 | disabling profiling should be made per group (possibly by iterating). |
| 67 | |
| 68 | Thread isolation |
| 69 | ---------------- |
| 70 | Thread isolation is difficult as we solely rely on atomic ops to figure |
| 71 | who can complete. Such operation is rare, maybe we could have a global |
| 72 | read_mostly flag containing a mask of the groups that require isolation. |
| 73 | Then the threads_want_rdv_mask etc can become per-group. However setting |
| 74 | and clearing the bits will become problematic as this will happen in two |
| 75 | steps hence will require careful ordering. |
| 76 | |
| 77 | FD |
| 78 | -- |
| 79 | Tidbit is used in a number of atomic ops on the running_mask. If we have |
| 80 | one fdtab[] per group, the mask implies that it's within the group. |
| 81 | Theoretically we should never face a situation where an FD is reported nor |
| 82 | manipulated for a remote group. |
| 83 | |
| 84 | There will still be one poller per thread, except that this time all |
| 85 | operations will be related to the current thread_group. No fd may appear |
| 86 | in two thread_groups at once, but we can probably not prevent that (e.g. |
| 87 | delayed close and reopen). Should we instead have a single shared fdtab[] |
| 88 | (less memory usage also) ? Maybe adding the group in the fdtab entry would |
| 89 | work, but when does a thread know it can leave it ? Currently this is |
| 90 | solved by running_mask and by update_mask. Having two tables could help |
| 91 | with this (each table sees the FD in a different group with a different |
| 92 | mask) but this looks overkill. |
| 93 | |
| 94 | There's polled_mask[] which needs to be decided upon. Probably that it |
| 95 | should be doubled as well. Note, polled_mask left fdtab[] for cacheline |
| 96 | alignment reasons in commit cb92f5cae4. |
| 97 | |
| 98 | If we have one fdtab[] per group, what *really* prevents from using the |
| 99 | same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() |
| 100 | need to check for no-thread usage before closing the FD. This could be |
| 101 | a limiting factor. Enabling could require to wake every poller. |
| 102 | |
| 103 | Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, |
| 104 | used only during creation and close) ? |
| 105 | |
| 106 | Other problem, if we have one fdtab[] per TG, disabling/enabling an FD |
| 107 | (e.g. pause/resume on listener) can become a problem if it's not necessarily |
| 108 | on the current TG. We'll then need a way to figure that one. It sounds like |
| 109 | FDs from listeners and receivers are very specific and suffer from problems |
| 110 | all other ones under high load do not suffer from. Maybe something specific |
| 111 | ought to be done for them, if we can guarantee there is no risk of accidental |
| 112 | reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the |
| 113 | FD's flags). The risk is always that a close() can result in instant pop-up |
| 114 | of the same FD on any other thread of the same process. |
| 115 | |
| 116 | Observations: right now fdtab[].thread_mask more or less corresponds to a |
| 117 | declaration of interest, it's very close to meaning "active per thread". It is |
| 118 | in fact located in the FD while it ought to do nothing there, as it should be |
| 119 | where the FD is used as it rules accesses to a shared resource that is not |
| 120 | the FD but what uses it. Indeed, if neither polled_mask nor running_mask have |
| 121 | a thread's bit, the FD is unknown to that thread and the element using it may |
| 122 | only be reached from above and not from the FD. As such we ought to have a |
| 123 | thread_mask on a listener and another one on connections. These ones will |
| 124 | indicate who uses them. A takeover could then be simplified (atomically set |
| 125 | exclusivity on the FD's running_mask, upon success, takeover the connection, |
| 126 | clear the running mask). Probably that the change ought to be performed on |
| 127 | the connection level first, not the FD level by the way. But running and |
| 128 | polled are the two relevant elements, one indicates userland knowledge, |
| 129 | the other one kernel knowledge. For listeners there's no exclusivity so it's |
| 130 | a bit different but the rule remains the same that we don't have to know |
| 131 | what threads are *interested* in the FD, only its holder. |
| 132 | |
| 133 | Not exact in fact, see FD notes below. |
| 134 | |
| 135 | activity |
| 136 | -------- |
| 137 | There should be one activity array per thread group. The dump should |
| 138 | simply scan them all since the cumuled values are not very important |
| 139 | anyway. |
| 140 | |
| 141 | applets |
| 142 | ------- |
| 143 | They use tid_bit only for the task. It looks like the appctx's thread_mask |
| 144 | is never used (now removed). Furthermore, it looks like the argument is |
| 145 | *always* tid_bit. |
| 146 | |
| 147 | CPU binding |
| 148 | ----------- |
| 149 | This is going to be tough. It will be needed to detect that threads overlap |
| 150 | and are not bound (i.e. all threads on same mask). In this case, if the number |
| 151 | of threads is higher than the number of threads per physical socket, one must |
| 152 | try hard to evenly spread them among physical sockets (e.g. one thread group |
| 153 | per physical socket) and start as many threads as needed on each, bound to |
| 154 | all threads/cores of each socket. If there is a single socket, the same job |
| 155 | may be done based on L3 caches. Maybe it could always be done based on L3 |
| 156 | caches. The difficulty behind this is the number of sockets to be bound: it |
| 157 | is not possible to bind several FDs per listener. Maybe with a new bind |
| 158 | keyword we can imagine to automatically duplicate listeners ? In any case, |
| 159 | the initially bound cpumap (via taskset) must always be respected, and |
| 160 | everything should probably start from there. |
| 161 | |
| 162 | Frontend binding |
| 163 | ---------------- |
| 164 | We'll have to define a list of threads and thread-groups per frontend. |
| 165 | Probably that having a group mask and a same thread-mask for each group |
| 166 | would suffice. |
| 167 | |
| 168 | Threads should have two numbers: |
| 169 | - the per-process number (e.g. 1..256) |
| 170 | - the per-group number (1..64) |
| 171 | |
| 172 | The "bind-thread" lines ought to use the following syntax: |
| 173 | - bind 45 ## bind to process' thread 45 |
| 174 | - bind 1/45 ## bind to group 1's thread 45 |
| 175 | - bind all/45 ## bind to thread 45 in each group |
| 176 | - bind 1/all ## bind to all threads in group 1 |
| 177 | - bind all ## bind to all threads |
| 178 | - bind all/all ## bind to all threads in all groups (=all) |
| 179 | - bind 1/65 ## rejected |
| 180 | - bind 65 ## OK if there are enough |
| 181 | - bind 35-45 ## depends. Rejected if it crosses a group boundary. |
| 182 | |
| 183 | The global directive "nbthread 28" means 28 total threads for the process. The |
| 184 | number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 |
| 185 | threads per group. At the beginning, the nbgroup should be manual since it |
| 186 | implies config adjustments to bind lines. |
| 187 | |
| 188 | There should be a trivial way to map a global thread to a group and local ID |
| 189 | and to do the opposite. |
| 190 | |
| 191 | |
| 192 | Panic handler + watchdog |
| 193 | ------------------------ |
| 194 | Will probably depend on what's done for thread_isolate |
| 195 | |
| 196 | Per-thread arrays inside structures |
| 197 | ----------------------------------- |
| 198 | - listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should |
| 199 | we simply bump the limit ? |
| 200 | - same for servers with idle connections. |
| 201 | => doesn't seem very practical. |
| 202 | - another solution might be to point to dynamically allocated arrays of |
| 203 | arrays (e.g. nbthread * nbgroup) or a first level per group and a second |
| 204 | per thread. |
| 205 | => dynamic allocation based on the global number |
| 206 | |
| 207 | Other |
| 208 | ----- |
| 209 | - what about dynamic thread start/stop (e.g. for containers/VMs) ? |
| 210 | E.g. if we decide to start $MANY threads in 4 groups, and only use |
| 211 | one, in the end it will not be possible to use less than one thread |
| 212 | per group, and at most 64 will be present in each group. |
| 213 | |
| 214 | |
| 215 | FD Notes |
| 216 | -------- |
| 217 | - updt_fd_polling() uses thread_mask to figure where to send the update, |
| 218 | the local list or a shared list, and which bits to set in update_mask. |
| 219 | This could be changed so that it takes the update mask in argument. The |
| 220 | call from the poller's fork would just have to broadcast everywhere. |
| 221 | |
| 222 | - pollers use it to figure whether they're concerned or not by the activity |
| 223 | update. This looks important as otherwise we could re-enable polling on |
| 224 | an FD that changed to another thread. |
| 225 | |
| 226 | - thread_mask being a per-thread active mask looks more exact and is |
| 227 | precisely used this way by _update_fd(). In this case using it instead |
| 228 | of running_mask to gauge a change or temporarily lock it during a |
| 229 | removal could make sense. |
| 230 | |
| 231 | - running should be conditionned by thread. Polled not (since deferred |
| 232 | or migrated). In this case testing thread_mask can be enough most of |
| 233 | the time, but this requires synchronization that will have to be |
| 234 | extended to tgid.. But migration seems a different beast that we shouldn't |
| 235 | care about here: if first performed at the higher level it ought to |
| 236 | be safe. |
| 237 | |
| 238 | In practice the update_mask can be dropped to zero by the first fd_delete() |
| 239 | as the only authority allowed to fd_delete() is *the* owner, and as soon as |
| 240 | all running_mask are gone, the FD will be closed, hence removed from all |
| 241 | pollers. This will be the only way to make sure that update_mask always |
| 242 | refers to the current tgid. |
| 243 | |
| 244 | However, it may happen that a takeover within the same group causes a thread |
| 245 | to read the update_mask late, while the FD is being wiped by another thread. |
| 246 | That other thread may close it, causing another thread in another group to |
| 247 | catch it, and change the tgid and start to update the update_mask. This means |
| 248 | that it would be possible for a thread entering do_poll() to see the correct |
| 249 | tgid, then the fd would be closed, reopened and reassigned to another tgid, |
| 250 | and the thread would see its bit in the update_mask, being confused. Right |
| 251 | now this should already happen when the update_mask is not cleared, except |
| 252 | that upon wakeup a migration would be detected and that would be all. |
| 253 | |
| 254 | Thus we might need to set the running bit to prevent the FD from migrating |
| 255 | before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( |
| 256 | |
| 257 | Also even fd_update_events() leaves a risk of updating update_mask after |
| 258 | clearing running, thus affecting the wrong one. Probably that update_mask |
| 259 | should be updated before clearing running_mask there. Also, how about not |
| 260 | creating an update on a close ? Not trivial if done before running, unless |
| 261 | thread_mask==0. |
| 262 | |
| 263 | ########################################################### |
| 264 | |
| 265 | Current state: |
| 266 | |
| 267 | |
| 268 | Mux / takeover / fd_delete() code ||| poller code |
| 269 | -------------------------------------------------|||--------------------------------------------------- |
| 270 | \|/ |
| 271 | mux_takeover(): | fd_set_running(): |
| 272 | if (fd_takeover()<0) | old = {running, thread}; |
| 273 | return fail; | new = {tid_bit, tid_bit}; |
| 274 | ... | |
| 275 | fd_takeover(): | do { |
| 276 | atomic_or(runnning, tid_bit); | if (!(old.thread & tid_bit)) |
| 277 | old = {running, thread}; | return -1; |
| 278 | new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } |
| 279 | if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); |
| 280 | atomic_and(runnning, ~tid_bit); | |
| 281 | return -1; // fail | fd_clr_running(): |
| 282 | } | return atomic_and_fetch(running, ~tid_bit); |
| 283 | | |
| 284 | while (old == {tid_bit, !=0 }) | poll(): |
| 285 | if (dwcas({running, thread}, &old, &new)) { | if (!owner) |
| 286 | atomic_and(runnning, ~tid_bit); | continue; |
| 287 | return 0; // success | |
| 288 | } | if (!(thread_mask & tid_bit)) { |
| 289 | } | epoll_ctl_del(); |
| 290 | | continue; |
| 291 | atomic_and(runnning, ~tid_bit); | } |
| 292 | return -1; // fail | |
| 293 | | // via fd_update_events() |
| 294 | fd_delete(): | if (fd_set_running() != -1) { |
| 295 | atomic_or(running, tid_bit); | iocb(); |
| 296 | atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) |
| 297 | if (fd_clr_running(fd) = 0) | fd_delete_orphan(); |
| 298 | fd_delete_orphan(); | } |
| 299 | |
| 300 | |
| 301 | The idle_conns_lock prevents the connection from being *picked* and released |
| 302 | while someone else is reading it. What it does is guarantee that on idle |
| 303 | connections, the caller of the IOCB will not dereference the task's context |
| 304 | while the connection is still in the idle list, since it might be picked then |
| 305 | freed at the same instant by another thread. As soon as the IOCB manages to |
| 306 | get that lock, it removes the connection from the list so that it cannot be |
| 307 | taken over anymore. Conversely, the mux's takeover() code runs under that |
| 308 | lock so that if it frees the connection and task, this will appear atomic |
| 309 | to the IOCB. The timeout task (which is another entry point for connection |
| 310 | deletion) does the same. Thus, when coming from the low-level (I/O or timeout): |
| 311 | - task always exists, but ctx checked under lock validates; conn removal |
| 312 | from list prevents takeover(). |
| 313 | - t->context is stable, except during changes under takeover lock. So |
| 314 | h2_timeout_task may well run on a different thread than h2_io_cb(). |
| 315 | |
| 316 | Coming from the top: |
| 317 | - takeover() done under lock() clears task's ctx and possibly closes the FD |
| 318 | (unless some running remains present). |
| 319 | |
| 320 | Unlikely but currently possible situations: |
| 321 | - multiple pollers (up to N) may have an idle connection's FD being |
| 322 | polled, if the connection was passed from thread to thread. The first |
| 323 | event on the connection would wake all of them. Most of them would |
| 324 | see fdtab[].owner set (the late ones might miss it). All but one would |
| 325 | see that their bit is missing from fdtab[].thread_mask and give up. |
| 326 | However, just after this test, others might take over the connection, |
| 327 | so in practice if terribly unlucky, all but 1 could see their bit in |
| 328 | thread_mask just before it gets removed, all of them set their bit |
| 329 | in running_mask, and all of them call iocb() (sock_conn_iocb()). |
| 330 | Thus all of them dereference the connection and touch the subscriber |
| 331 | with no protection, then end up in conn_notify_mux() that will call |
| 332 | the mux's wake(). |
| 333 | |
| 334 | - multiple pollers (up to N-1) might still be in fd_update_events() |
| 335 | manipulating fdtab[].state. The cause is that the "locked" variable |
| 336 | is determined by atleast2(thread_mask) but that thread_mask is read |
| 337 | at a random instant (i.e. it may be stolen by another one during a |
| 338 | takeover) since we don't yet hold running to prevent this from being |
| 339 | done. Thus we can arrive here with thread_mask==something_else (1bit), |
| 340 | locked==0 and fdtab[].state assigned non-atomically. |
| 341 | |
| 342 | - it looks like nothing prevents h2_release() from being called on a |
| 343 | thread (e.g. from the top or task timeout) while sock_conn_iocb() |
| 344 | dereferences the connection on another thread. Those killing the |
| 345 | connection don't yet consider the fact that it's an FD that others |
| 346 | might currently be waking up on. |
| 347 | |
| 348 | ################### |
| 349 | |
| 350 | pb with counter: |
| 351 | |
| 352 | users count doesn't say who's using the FD and two users can do the same |
| 353 | close in turn. The thread_mask should define who's responsible for closing |
| 354 | the FD, and all those with a bit in it ought to do it. |
| 355 | |
| 356 | |
| 357 | 2021-08-25 - update with minimal locking on tgid value |
| 358 | ========== |
| 359 | |
| 360 | - tgid + refcount at once using CAS |
| 361 | - idle_conns lock during updates |
| 362 | - update: |
| 363 | if tgid differs => close happened, thus drop update |
| 364 | otherwise normal stuff. Lock tgid until running if needed. |
| 365 | - poll report: |
| 366 | if tgid differs => closed |
| 367 | if thread differs => stop polling (migrated) |
| 368 | keep tgid lock until running |
| 369 | - test on thread_id: |
| 370 | if (xadd(&tgid,65536) != my_tgid) { |
| 371 | // was closed |
| 372 | sub(&tgid, 65536) |
| 373 | return -1 |
| 374 | } |
| 375 | if !(thread_id & tidbit) => migrated/closed |
| 376 | set_running() |
| 377 | sub(tgid,65536) |
| 378 | - note: either fd_insert() or the final close() ought to set |
| 379 | polled and update to 0. |
| 380 | |
| 381 | 2021-09-13 - tid / tgroups etc. |
| 382 | ========== |
| 383 | |
| 384 | - tid currently is the thread's global ID. It's essentially used as an index |
| 385 | for arrays. It must be clearly stated that it works this way. |
| 386 | |
| 387 | - tid_bit makes no sense process-wide, so it must be redefined to represent |
| 388 | the thread's tid within its group. The name is not much welcome though, but |
| 389 | there are 286 of it that are not going to be changed that fast. |
| 390 | |
| 391 | - just like "ti" is the thread_info, we need to have "tg" pointing to the |
| 392 | thread_group. |
| 393 | |
| 394 | - other less commonly used elements should be retrieved from ti->xxx. E.g. |
| 395 | the thread's local ID. |
| 396 | |
| 397 | - lock debugging must reproduce tgid |
| 398 | |
| 399 | - an offset might be placed in the tgroup so that even with 64 threads max |
| 400 | we could have completely separate tid_bits over several groups. |
| 401 | |
| 402 | 2021-09-15 - bind + listen() + rx |
| 403 | ========== |
| 404 | |
| 405 | - thread_mask (in bind_conf->rx_settings) should become an array of |
| 406 | MAX_TGROUP longs. |
| 407 | - when parsing "thread 123" or "thread 2/37", the proper bit is set, |
| 408 | assuming the array is either a contigous bitfield or a tgroup array. |
| 409 | An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on |
| 410 | how the thread num was parsed, so that we reject mixes. |
| 411 | - end of parsing: entries translated to the cleanest form (to be determined) |
| 412 | - binding: for each socket()/bind()/listen()... just perform one extra dup() |
| 413 | for each tgroup and store the multiple FDs into an FD array indexed on |
| 414 | MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence |
| 415 | to have multiple entries in all tgroup pollers without requiring the user |
| 416 | to duplicate the bind line. |
| 417 | |
| 418 | 2021-09-15 - global thread masks |
| 419 | ========== |
| 420 | |
| 421 | Some global variables currently expect to know about thread IDs and it's |
| 422 | uncertain what must be done with them: |
| 423 | - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ |
| 424 | => touched under the rq lock. Change it per-group ? What exact use is made ? |
| 425 | |
| 426 | - sleeping_thread_mask /* Threads that are about to sleep in poll() */ |
| 427 | => seems that it can be made per group |
| 428 | |
| 429 | - all_threads_mask: a bit complicated, derived from nbthread and used with |
| 430 | masks and with my_ffsl() to wake threads up. Should probably be per-group |
| 431 | but we might miss something for global. |
| 432 | |
| 433 | - stopping_thread_mask: used in combination with all_threads_mask, should |
| 434 | move per-group. |
| 435 | |
| 436 | - threads_harmless_mask: indicates all threads that are currently harmless in |
| 437 | that they promise not to access a shared resource. Must be made per-group |
| 438 | but then we'll likely need a second stage to have the harmless groups mask. |
| 439 | threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one |
| 440 | above. Maybe the right approach will be to request harmless on a group mask |
| 441 | so that we can detect collisions and arbiter them like today, but on top of |
| 442 | this it becomes possible to request harmless only on the local group if |
| 443 | desired. The subtlety is that requesting harmless at the group level does |
| 444 | not mean it's achieved since the requester cannot vouch for the other ones |
| 445 | in the same group. |
| 446 | |
| 447 | In addition, some variables are related to the global runqueue: |
| 448 | __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ |
| 449 | struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ |
| 450 | unsigned int grq_total; /* total number of entries in the global run queue, atomic */ |
| 451 | static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ |
| 452 | |
| 453 | And others to the global wait queue: |
| 454 | struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| 455 | __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ |
| 456 | struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| 457 | |
| 458 | |
| 459 | 2021-09-29 - group designation and masks |
| 460 | ========== |
| 461 | |
| 462 | Neither FDs nor tasks will belong to incomplete subsets of threads spanning |
| 463 | over multiple thread groups. In addition there may be a difference between |
| 464 | configuration and operation (for FDs). This allows to fix the following rules: |
| 465 | |
| 466 | group mask description |
| 467 | 0 0 bind_conf: groups & thread not set. bind to any/all |
| 468 | task: it would be nice to mean "run on the same as the caller". |
| 469 | |
| 470 | 0 xxx bind_conf: thread set but not group: thread IDs are global |
| 471 | FD/task: group 0, mask xxx |
| 472 | |
| 473 | G>0 0 bind_conf: only group is set: bind to all threads of group G |
| 474 | FD/task: mask 0 not permitted (= not owned). May be used to |
| 475 | mention "any thread of this group", though already covered by |
| 476 | G/xxx like today. |
| 477 | |
| 478 | G>0 xxx bind_conf: Bind to these threads of this group |
| 479 | FD/task: group G, mask xxx |
| 480 | |
| 481 | It looks like keeping groups starting at zero internally complicates everything |
| 482 | though. But forcing it to start at 1 might also require that we rescan all tasks |
| 483 | to replace 0 with 1 upon startup. This would also allow group 0 to be special and |
| 484 | be used as the default group for any new thread creation, so that group0.count |
| 485 | would keep the number of unassigned threads. Let's try: |
| 486 | |
| 487 | group mask description |
| 488 | 0 0 bind_conf: groups & thread not set. bind to any/all |
| 489 | task: "run on the same group & thread as the caller". |
| 490 | |
| 491 | 0 xxx bind_conf: thread set but not group: thread IDs are global |
| 492 | FD/task: invalid. Or maybe for a task we could use this to |
| 493 | mean "run on current group, thread XXX", which would cover |
| 494 | the need for health checks (g/t 0/0 while sleeping, 0/xxx |
| 495 | while running) and have wake_expired_tasks() detect 0/0 and |
| 496 | wake them up to a random group. |
| 497 | |
| 498 | G>0 0 bind_conf: only group is set: bind to all threads of group G |
| 499 | FD/task: mask 0 not permitted (= not owned). May be used to |
| 500 | mention "any thread of this group", though already covered by |
| 501 | G/xxx like today. |
| 502 | |
| 503 | G>0 xxx bind_conf: Bind to these threads of this group |
| 504 | FD/task: group G, mask xxx |
| 505 | |
| 506 | With a single group declared in the config, group 0 would implicitly find the |
| 507 | first one. |
| 508 | |
| 509 | |
| 510 | The problem with the approach above is that a task queued in one group+thread's |
| 511 | wait queue could very well receive a signal from another thread and/or group, |
| 512 | and that there is no indication about where the task is queued, nor how to |
| 513 | dequeue it. Thus it seems that it's up to the application itself to unbind/ |
| 514 | rebind a task. This contradicts the principle of leaving a task waiting in a |
| 515 | wait queue and waking it anywhere. |
| 516 | |
| 517 | Another possibility might be to decide that a task having a defined group but |
| 518 | a mask of zero is shared and will always be queued into its group's wait queue. |
| 519 | However, upon expiry, the scheduler would notice the thread-mask 0 and would |
| 520 | broadcast it to any group. |
| 521 | |
| 522 | Right now in the code we have: |
| 523 | - 18 calls of task_new(tid_bit) |
| 524 | - 18 calls of task_new(MAX_THREADS_MASK) |
| 525 | - 2 calls with a single bit |
| 526 | |
| 527 | Thus it looks like "task_new_anywhere()", "task_new_on()" and |
| 528 | "task_new_here()" would be sufficient. |