Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 1 | Thread groups |
| 2 | ############# |
| 3 | |
| 4 | 2021-07-13 - first draft |
| 5 | ========== |
| 6 | |
| 7 | Objective |
| 8 | --------- |
| 9 | - support multi-socket systems with limited cache-line bouncing between |
| 10 | physical CPUs and/or L3 caches |
| 11 | |
| 12 | - overcome the 64-thread limitation |
| 13 | |
| 14 | - Support a reasonable number of groups. I.e. if modern CPUs arrive with |
| 15 | core complexes made of 8 cores, with 8 CC per chip and 2 chips in a |
| 16 | system, it makes sense to support 16 groups. |
| 17 | |
| 18 | |
| 19 | Non-objective |
| 20 | ------------- |
| 21 | - no need to optimize to the last possible cycle. I.e. some algos like |
| 22 | leastconn will remain shared across all threads, servers will keep a |
| 23 | single queue, etc. Global information remains global. |
| 24 | |
| 25 | - no stubborn enforcement of FD sharing. Per-server idle connection lists |
| 26 | can become per-group; listeners can (and should probably) be per-group. |
| 27 | Other mechanisms (like SO_REUSEADDR) can already overcome this. |
| 28 | |
| 29 | - no need to go beyond 64 threads per group. |
| 30 | |
| 31 | |
| 32 | Identified tasks |
| 33 | ================ |
| 34 | |
| 35 | General |
| 36 | ------- |
| 37 | Everywhere tid_bit is used we absolutely need to find a complement using |
| 38 | either the current group or a specific one. Thread debugging will need to |
| 39 | be extended as masks are extensively used. |
| 40 | |
| 41 | |
| 42 | Scheduler |
| 43 | --------- |
| 44 | The global run queue and global wait queue must become per-group. This |
| 45 | means that a task may only be queued into one of them at a time. It |
| 46 | sounds like tasks may only belong to a given group, but doing so would |
| 47 | bring back the original issue that it's impossible to perform remote wake |
| 48 | ups. |
| 49 | |
| 50 | We could probably ignore the group if we don't need to set the thread mask |
| 51 | in the task. the task's thread_mask is never manipulated using atomics so |
| 52 | it's safe to complement it with a group. |
| 53 | |
| 54 | The sleeping_thread_mask should become per-group. Thus possibly that a |
| 55 | wakeup may only be performed on the assigned group, meaning that either |
| 56 | a task is not assigned, in which case it be self-assigned (like today), |
| 57 | otherwise the tg to be woken up will be retrieved from the task itself. |
| 58 | |
| 59 | Task creation currently takes a thread mask of either tid_bit, a specific |
| 60 | mask, or MAX_THREADS_MASK. How to create a task able to run anywhere |
| 61 | (checks, Lua, ...) ? |
| 62 | |
Willy Tarreau | 680ed5f | 2022-06-13 15:59:39 +0200 | [diff] [blame] | 63 | Profiling -> completed |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 64 | --------- |
| 65 | There should be one task_profiling_mask per thread group. Enabling or |
| 66 | disabling profiling should be made per group (possibly by iterating). |
Willy Tarreau | 680ed5f | 2022-06-13 15:59:39 +0200 | [diff] [blame] | 67 | -> not needed anymore, one flag per thread in each thread's context. |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 68 | |
| 69 | Thread isolation |
| 70 | ---------------- |
| 71 | Thread isolation is difficult as we solely rely on atomic ops to figure |
| 72 | who can complete. Such operation is rare, maybe we could have a global |
| 73 | read_mostly flag containing a mask of the groups that require isolation. |
| 74 | Then the threads_want_rdv_mask etc can become per-group. However setting |
| 75 | and clearing the bits will become problematic as this will happen in two |
| 76 | steps hence will require careful ordering. |
| 77 | |
| 78 | FD |
| 79 | -- |
| 80 | Tidbit is used in a number of atomic ops on the running_mask. If we have |
| 81 | one fdtab[] per group, the mask implies that it's within the group. |
| 82 | Theoretically we should never face a situation where an FD is reported nor |
| 83 | manipulated for a remote group. |
| 84 | |
| 85 | There will still be one poller per thread, except that this time all |
| 86 | operations will be related to the current thread_group. No fd may appear |
| 87 | in two thread_groups at once, but we can probably not prevent that (e.g. |
| 88 | delayed close and reopen). Should we instead have a single shared fdtab[] |
| 89 | (less memory usage also) ? Maybe adding the group in the fdtab entry would |
| 90 | work, but when does a thread know it can leave it ? Currently this is |
| 91 | solved by running_mask and by update_mask. Having two tables could help |
| 92 | with this (each table sees the FD in a different group with a different |
| 93 | mask) but this looks overkill. |
| 94 | |
| 95 | There's polled_mask[] which needs to be decided upon. Probably that it |
| 96 | should be doubled as well. Note, polled_mask left fdtab[] for cacheline |
| 97 | alignment reasons in commit cb92f5cae4. |
| 98 | |
| 99 | If we have one fdtab[] per group, what *really* prevents from using the |
| 100 | same FD in multiple groups ? _fd_delete_orphan() and fd_update_events() |
| 101 | need to check for no-thread usage before closing the FD. This could be |
| 102 | a limiting factor. Enabling could require to wake every poller. |
| 103 | |
| 104 | Shouldn't we remerge fdinfo[] with fdtab[] (one pointer + one int/short, |
| 105 | used only during creation and close) ? |
| 106 | |
| 107 | Other problem, if we have one fdtab[] per TG, disabling/enabling an FD |
| 108 | (e.g. pause/resume on listener) can become a problem if it's not necessarily |
| 109 | on the current TG. We'll then need a way to figure that one. It sounds like |
| 110 | FDs from listeners and receivers are very specific and suffer from problems |
| 111 | all other ones under high load do not suffer from. Maybe something specific |
| 112 | ought to be done for them, if we can guarantee there is no risk of accidental |
| 113 | reuse (e.g. locate the TG info in the receiver and have a "MT" bit in the |
| 114 | FD's flags). The risk is always that a close() can result in instant pop-up |
| 115 | of the same FD on any other thread of the same process. |
| 116 | |
| 117 | Observations: right now fdtab[].thread_mask more or less corresponds to a |
| 118 | declaration of interest, it's very close to meaning "active per thread". It is |
| 119 | in fact located in the FD while it ought to do nothing there, as it should be |
| 120 | where the FD is used as it rules accesses to a shared resource that is not |
| 121 | the FD but what uses it. Indeed, if neither polled_mask nor running_mask have |
| 122 | a thread's bit, the FD is unknown to that thread and the element using it may |
| 123 | only be reached from above and not from the FD. As such we ought to have a |
| 124 | thread_mask on a listener and another one on connections. These ones will |
| 125 | indicate who uses them. A takeover could then be simplified (atomically set |
| 126 | exclusivity on the FD's running_mask, upon success, takeover the connection, |
| 127 | clear the running mask). Probably that the change ought to be performed on |
| 128 | the connection level first, not the FD level by the way. But running and |
| 129 | polled are the two relevant elements, one indicates userland knowledge, |
| 130 | the other one kernel knowledge. For listeners there's no exclusivity so it's |
| 131 | a bit different but the rule remains the same that we don't have to know |
| 132 | what threads are *interested* in the FD, only its holder. |
| 133 | |
| 134 | Not exact in fact, see FD notes below. |
| 135 | |
| 136 | activity |
| 137 | -------- |
| 138 | There should be one activity array per thread group. The dump should |
| 139 | simply scan them all since the cumuled values are not very important |
| 140 | anyway. |
| 141 | |
| 142 | applets |
| 143 | ------- |
| 144 | They use tid_bit only for the task. It looks like the appctx's thread_mask |
| 145 | is never used (now removed). Furthermore, it looks like the argument is |
| 146 | *always* tid_bit. |
| 147 | |
| 148 | CPU binding |
| 149 | ----------- |
| 150 | This is going to be tough. It will be needed to detect that threads overlap |
| 151 | and are not bound (i.e. all threads on same mask). In this case, if the number |
| 152 | of threads is higher than the number of threads per physical socket, one must |
| 153 | try hard to evenly spread them among physical sockets (e.g. one thread group |
| 154 | per physical socket) and start as many threads as needed on each, bound to |
| 155 | all threads/cores of each socket. If there is a single socket, the same job |
| 156 | may be done based on L3 caches. Maybe it could always be done based on L3 |
| 157 | caches. The difficulty behind this is the number of sockets to be bound: it |
| 158 | is not possible to bind several FDs per listener. Maybe with a new bind |
| 159 | keyword we can imagine to automatically duplicate listeners ? In any case, |
| 160 | the initially bound cpumap (via taskset) must always be respected, and |
| 161 | everything should probably start from there. |
| 162 | |
| 163 | Frontend binding |
| 164 | ---------------- |
| 165 | We'll have to define a list of threads and thread-groups per frontend. |
| 166 | Probably that having a group mask and a same thread-mask for each group |
| 167 | would suffice. |
| 168 | |
| 169 | Threads should have two numbers: |
| 170 | - the per-process number (e.g. 1..256) |
| 171 | - the per-group number (1..64) |
| 172 | |
| 173 | The "bind-thread" lines ought to use the following syntax: |
| 174 | - bind 45 ## bind to process' thread 45 |
| 175 | - bind 1/45 ## bind to group 1's thread 45 |
| 176 | - bind all/45 ## bind to thread 45 in each group |
| 177 | - bind 1/all ## bind to all threads in group 1 |
| 178 | - bind all ## bind to all threads |
| 179 | - bind all/all ## bind to all threads in all groups (=all) |
| 180 | - bind 1/65 ## rejected |
| 181 | - bind 65 ## OK if there are enough |
| 182 | - bind 35-45 ## depends. Rejected if it crosses a group boundary. |
| 183 | |
| 184 | The global directive "nbthread 28" means 28 total threads for the process. The |
| 185 | number of groups will sub-divide this. E.g. 4 groups will very likely imply 7 |
| 186 | threads per group. At the beginning, the nbgroup should be manual since it |
| 187 | implies config adjustments to bind lines. |
| 188 | |
| 189 | There should be a trivial way to map a global thread to a group and local ID |
| 190 | and to do the opposite. |
| 191 | |
| 192 | |
| 193 | Panic handler + watchdog |
| 194 | ------------------------ |
| 195 | Will probably depend on what's done for thread_isolate |
| 196 | |
| 197 | Per-thread arrays inside structures |
| 198 | ----------------------------------- |
| 199 | - listeners have a thr_conn[] array, currently limited to MAX_THREADS. Should |
| 200 | we simply bump the limit ? |
| 201 | - same for servers with idle connections. |
| 202 | => doesn't seem very practical. |
| 203 | - another solution might be to point to dynamically allocated arrays of |
| 204 | arrays (e.g. nbthread * nbgroup) or a first level per group and a second |
| 205 | per thread. |
| 206 | => dynamic allocation based on the global number |
| 207 | |
| 208 | Other |
| 209 | ----- |
| 210 | - what about dynamic thread start/stop (e.g. for containers/VMs) ? |
| 211 | E.g. if we decide to start $MANY threads in 4 groups, and only use |
| 212 | one, in the end it will not be possible to use less than one thread |
| 213 | per group, and at most 64 will be present in each group. |
| 214 | |
| 215 | |
| 216 | FD Notes |
| 217 | -------- |
| 218 | - updt_fd_polling() uses thread_mask to figure where to send the update, |
| 219 | the local list or a shared list, and which bits to set in update_mask. |
| 220 | This could be changed so that it takes the update mask in argument. The |
| 221 | call from the poller's fork would just have to broadcast everywhere. |
| 222 | |
| 223 | - pollers use it to figure whether they're concerned or not by the activity |
| 224 | update. This looks important as otherwise we could re-enable polling on |
| 225 | an FD that changed to another thread. |
| 226 | |
| 227 | - thread_mask being a per-thread active mask looks more exact and is |
| 228 | precisely used this way by _update_fd(). In this case using it instead |
| 229 | of running_mask to gauge a change or temporarily lock it during a |
| 230 | removal could make sense. |
| 231 | |
Willy Tarreau | 55b9689 | 2022-05-31 08:07:43 +0200 | [diff] [blame] | 232 | - running should be conditioned by thread. Polled not (since deferred |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 233 | or migrated). In this case testing thread_mask can be enough most of |
| 234 | the time, but this requires synchronization that will have to be |
| 235 | extended to tgid.. But migration seems a different beast that we shouldn't |
| 236 | care about here: if first performed at the higher level it ought to |
| 237 | be safe. |
| 238 | |
| 239 | In practice the update_mask can be dropped to zero by the first fd_delete() |
| 240 | as the only authority allowed to fd_delete() is *the* owner, and as soon as |
| 241 | all running_mask are gone, the FD will be closed, hence removed from all |
| 242 | pollers. This will be the only way to make sure that update_mask always |
| 243 | refers to the current tgid. |
| 244 | |
| 245 | However, it may happen that a takeover within the same group causes a thread |
| 246 | to read the update_mask late, while the FD is being wiped by another thread. |
| 247 | That other thread may close it, causing another thread in another group to |
| 248 | catch it, and change the tgid and start to update the update_mask. This means |
| 249 | that it would be possible for a thread entering do_poll() to see the correct |
| 250 | tgid, then the fd would be closed, reopened and reassigned to another tgid, |
| 251 | and the thread would see its bit in the update_mask, being confused. Right |
| 252 | now this should already happen when the update_mask is not cleared, except |
| 253 | that upon wakeup a migration would be detected and that would be all. |
| 254 | |
| 255 | Thus we might need to set the running bit to prevent the FD from migrating |
| 256 | before reading update_mask, which also implies closing on fd_clr_running() == 0 :-( |
| 257 | |
| 258 | Also even fd_update_events() leaves a risk of updating update_mask after |
| 259 | clearing running, thus affecting the wrong one. Probably that update_mask |
| 260 | should be updated before clearing running_mask there. Also, how about not |
| 261 | creating an update on a close ? Not trivial if done before running, unless |
| 262 | thread_mask==0. |
| 263 | |
Willy Tarreau | d60269f | 2022-07-06 15:44:49 +0200 | [diff] [blame] | 264 | Note that one situation that is currently visible is that a thread closes a |
| 265 | file descriptor that it's the last one to own and to have an update for. In |
| 266 | fd_delete_orphan() it does call poller.clo() but this one is not sufficient |
| 267 | as it doesn't drop the update_mask nor does it clear the polled_mask. The |
| 268 | typical problem that arises is that the close() happens before processing |
| 269 | the last update (e.g. a close() just after a partial read), thus it still |
| 270 | has *at least* one bit set for the current thread in both update_mask and |
| 271 | polled_mask, and it is present in the update_list. Not handling it would |
| 272 | mean that the event is lost on update() from the concerned threads and |
| 273 | that some resource might leak. Handling it means zeroing the update_mask |
| 274 | and polled_mask, and deleting the update entry from the update_list, thus |
| 275 | losing the update event. And as indicated above, if the FD switches twice |
| 276 | between 2 groups, the finally called thread does not necessarily know that |
| 277 | the FD isn't the same anymore, thus it's difficult to decide whether to |
| 278 | delete it or not, because deleting the event might in fact mean deleting |
| 279 | something that was just re-added for the same thread with the same FD but |
| 280 | a different usage. |
| 281 | |
| 282 | Also it really seems unrealistic to scan a single shared update_list like |
| 283 | this using write operations. There should likely be one per thread-group. |
| 284 | But in this case there is no more choice than deleting the update event |
| 285 | upon fd_delete_orphan(). This also means that poller->clo() must do the |
| 286 | job for all of the group's threads at once. This would mean a synchronous |
| 287 | removal before the close(), which doesn't seem ridiculously expensive. It |
| 288 | just requires that any thread of a group may manipulate any other thread's |
| 289 | status for an FD and a poller. |
| 290 | |
| 291 | Note about our currently supported pollers: |
| 292 | |
| 293 | - epoll: our current code base relies on the modern version which |
| 294 | automatically removes closed FDs, so we don't have anything to do |
| 295 | when closing and we don't need the update. |
| 296 | |
| 297 | - kqueue: according to https://www.freebsd.org/cgi/man.cgi?query=kqueue, just |
| 298 | like epoll, a close() implies a removal. Our poller doesn't perform |
| 299 | any bookkeeping either so it's OK to directly close. |
| 300 | |
| 301 | - evports: https://docs.oracle.com/cd/E86824_01/html/E54766/port-dissociate-3c.html |
| 302 | says the same, i.e. close() implies a removal of all events. No local |
| 303 | processing nor bookkeeping either, we can close. |
| 304 | |
| 305 | - poll: the fd_evts[] array is global, thus shared by all threads. As such, |
| 306 | a single removal is needed to flush it for all threads at once. The |
| 307 | operation is already performed like this. |
| 308 | |
| 309 | - select: works exactly like poll() above, hence already handled. |
| 310 | |
| 311 | As a preliminary conclusion, it's safe to delete the event and reset |
| 312 | update_mask just after calling poller->clo(). If extremely unlucky (changing |
| 313 | thread mask due to takeover ?), the same FD may appear at the same time: |
| 314 | - in one or several thread-local fd_updt[] arrays. These ones are just work |
| 315 | queues, there's nothing to do to ignore them, just leave the holes with an |
| 316 | outdated FD which will be ignored once met. As a bonus, poller->clo() could |
| 317 | check if the last fd_updt[] points to this specific FD and decide to kill |
| 318 | it. |
| 319 | |
| 320 | - in the global update_list. In this case, fd_rm_from_fd_list() already |
| 321 | performs an attachment check, so it's safe to always call it before closing |
Ilya Shipitsin | 3b64a28 | 2022-07-29 22:26:53 +0500 | [diff] [blame^] | 322 | (since no one else may be in the process of changing anything). |
Willy Tarreau | d60269f | 2022-07-06 15:44:49 +0200 | [diff] [blame] | 323 | |
| 324 | |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 325 | ########################################################### |
| 326 | |
| 327 | Current state: |
| 328 | |
| 329 | |
| 330 | Mux / takeover / fd_delete() code ||| poller code |
| 331 | -------------------------------------------------|||--------------------------------------------------- |
| 332 | \|/ |
| 333 | mux_takeover(): | fd_set_running(): |
| 334 | if (fd_takeover()<0) | old = {running, thread}; |
| 335 | return fail; | new = {tid_bit, tid_bit}; |
| 336 | ... | |
| 337 | fd_takeover(): | do { |
Willy Tarreau | 55b9689 | 2022-05-31 08:07:43 +0200 | [diff] [blame] | 338 | atomic_or(running, tid_bit); | if (!(old.thread & tid_bit)) |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 339 | old = {running, thread}; | return -1; |
| 340 | new = {tid_bit, tid_bit}; | new = { running | tid_bit, old.thread } |
| 341 | if (owner != expected) { | } while (!dwcas({running, thread}, &old, &new)); |
| 342 | atomic_and(runnning, ~tid_bit); | |
| 343 | return -1; // fail | fd_clr_running(): |
| 344 | } | return atomic_and_fetch(running, ~tid_bit); |
| 345 | | |
| 346 | while (old == {tid_bit, !=0 }) | poll(): |
| 347 | if (dwcas({running, thread}, &old, &new)) { | if (!owner) |
| 348 | atomic_and(runnning, ~tid_bit); | continue; |
| 349 | return 0; // success | |
| 350 | } | if (!(thread_mask & tid_bit)) { |
| 351 | } | epoll_ctl_del(); |
| 352 | | continue; |
| 353 | atomic_and(runnning, ~tid_bit); | } |
| 354 | return -1; // fail | |
| 355 | | // via fd_update_events() |
| 356 | fd_delete(): | if (fd_set_running() != -1) { |
| 357 | atomic_or(running, tid_bit); | iocb(); |
| 358 | atomic_store(thread, 0); | if (fd_clr_running() == 0 && !thread_mask) |
| 359 | if (fd_clr_running(fd) = 0) | fd_delete_orphan(); |
| 360 | fd_delete_orphan(); | } |
| 361 | |
| 362 | |
| 363 | The idle_conns_lock prevents the connection from being *picked* and released |
| 364 | while someone else is reading it. What it does is guarantee that on idle |
| 365 | connections, the caller of the IOCB will not dereference the task's context |
| 366 | while the connection is still in the idle list, since it might be picked then |
| 367 | freed at the same instant by another thread. As soon as the IOCB manages to |
| 368 | get that lock, it removes the connection from the list so that it cannot be |
| 369 | taken over anymore. Conversely, the mux's takeover() code runs under that |
| 370 | lock so that if it frees the connection and task, this will appear atomic |
| 371 | to the IOCB. The timeout task (which is another entry point for connection |
| 372 | deletion) does the same. Thus, when coming from the low-level (I/O or timeout): |
| 373 | - task always exists, but ctx checked under lock validates; conn removal |
| 374 | from list prevents takeover(). |
| 375 | - t->context is stable, except during changes under takeover lock. So |
| 376 | h2_timeout_task may well run on a different thread than h2_io_cb(). |
| 377 | |
| 378 | Coming from the top: |
| 379 | - takeover() done under lock() clears task's ctx and possibly closes the FD |
| 380 | (unless some running remains present). |
| 381 | |
| 382 | Unlikely but currently possible situations: |
| 383 | - multiple pollers (up to N) may have an idle connection's FD being |
| 384 | polled, if the connection was passed from thread to thread. The first |
| 385 | event on the connection would wake all of them. Most of them would |
| 386 | see fdtab[].owner set (the late ones might miss it). All but one would |
| 387 | see that their bit is missing from fdtab[].thread_mask and give up. |
| 388 | However, just after this test, others might take over the connection, |
| 389 | so in practice if terribly unlucky, all but 1 could see their bit in |
| 390 | thread_mask just before it gets removed, all of them set their bit |
| 391 | in running_mask, and all of them call iocb() (sock_conn_iocb()). |
| 392 | Thus all of them dereference the connection and touch the subscriber |
| 393 | with no protection, then end up in conn_notify_mux() that will call |
| 394 | the mux's wake(). |
| 395 | |
| 396 | - multiple pollers (up to N-1) might still be in fd_update_events() |
| 397 | manipulating fdtab[].state. The cause is that the "locked" variable |
| 398 | is determined by atleast2(thread_mask) but that thread_mask is read |
| 399 | at a random instant (i.e. it may be stolen by another one during a |
| 400 | takeover) since we don't yet hold running to prevent this from being |
| 401 | done. Thus we can arrive here with thread_mask==something_else (1bit), |
| 402 | locked==0 and fdtab[].state assigned non-atomically. |
| 403 | |
| 404 | - it looks like nothing prevents h2_release() from being called on a |
| 405 | thread (e.g. from the top or task timeout) while sock_conn_iocb() |
| 406 | dereferences the connection on another thread. Those killing the |
| 407 | connection don't yet consider the fact that it's an FD that others |
| 408 | might currently be waking up on. |
| 409 | |
| 410 | ################### |
| 411 | |
| 412 | pb with counter: |
| 413 | |
| 414 | users count doesn't say who's using the FD and two users can do the same |
| 415 | close in turn. The thread_mask should define who's responsible for closing |
| 416 | the FD, and all those with a bit in it ought to do it. |
| 417 | |
| 418 | |
| 419 | 2021-08-25 - update with minimal locking on tgid value |
| 420 | ========== |
| 421 | |
| 422 | - tgid + refcount at once using CAS |
| 423 | - idle_conns lock during updates |
| 424 | - update: |
| 425 | if tgid differs => close happened, thus drop update |
| 426 | otherwise normal stuff. Lock tgid until running if needed. |
| 427 | - poll report: |
| 428 | if tgid differs => closed |
| 429 | if thread differs => stop polling (migrated) |
| 430 | keep tgid lock until running |
| 431 | - test on thread_id: |
| 432 | if (xadd(&tgid,65536) != my_tgid) { |
| 433 | // was closed |
| 434 | sub(&tgid, 65536) |
| 435 | return -1 |
| 436 | } |
| 437 | if !(thread_id & tidbit) => migrated/closed |
| 438 | set_running() |
| 439 | sub(tgid,65536) |
| 440 | - note: either fd_insert() or the final close() ought to set |
| 441 | polled and update to 0. |
| 442 | |
| 443 | 2021-09-13 - tid / tgroups etc. |
| 444 | ========== |
| 445 | |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 446 | * tid currently is the thread's global ID. It's essentially used as an index |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 447 | for arrays. It must be clearly stated that it works this way. |
| 448 | |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 449 | * tasklets use the global thread id, and __tasklet_wakeup_on() must use a |
| 450 | global ID as well. It's capital that tinfo[] provides instant access to |
| 451 | local/global bits/indexes/arrays |
| 452 | |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 453 | - tid_bit makes no sense process-wide, so it must be redefined to represent |
| 454 | the thread's tid within its group. The name is not much welcome though, but |
| 455 | there are 286 of it that are not going to be changed that fast. |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 456 | => now we have ltid and ltid_bit in thread_info. thread-local tid_bit still |
| 457 | not changed though. If renamed we must make sure the older one vanishes. |
| 458 | Why not rename "ptid, ptid_bit" for the process-wide tid and "gtid, |
| 459 | gtid_bit" for the group-wide ones ? This removes the ambiguity on "tid" |
| 460 | which is half the time not the one we expect. |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 461 | |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 462 | * just like "ti" is the thread_info, we need to have "tg" pointing to the |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 463 | thread_group. |
| 464 | |
| 465 | - other less commonly used elements should be retrieved from ti->xxx. E.g. |
| 466 | the thread's local ID. |
| 467 | |
| 468 | - lock debugging must reproduce tgid |
| 469 | |
Willy Tarreau | 680ed5f | 2022-06-13 15:59:39 +0200 | [diff] [blame] | 470 | * task profiling must be made per-group (annoying), unless we want to add a |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 471 | per-thread TH_FL_* flag and have the rare places where the bit is changed |
| 472 | iterate over all threads if needed. Sounds preferable overall. |
| 473 | |
| 474 | * an offset might be placed in the tgroup so that even with 64 threads max |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 475 | we could have completely separate tid_bits over several groups. |
Willy Tarreau | 1424774 | 2022-06-10 16:05:59 +0200 | [diff] [blame] | 476 | => base and count now |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 477 | |
| 478 | 2021-09-15 - bind + listen() + rx |
| 479 | ========== |
| 480 | |
| 481 | - thread_mask (in bind_conf->rx_settings) should become an array of |
| 482 | MAX_TGROUP longs. |
| 483 | - when parsing "thread 123" or "thread 2/37", the proper bit is set, |
Ilya Shipitsin | 3b64a28 | 2022-07-29 22:26:53 +0500 | [diff] [blame^] | 484 | assuming the array is either a contiguous bitfield or a tgroup array. |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 485 | An option RX_O_THR_PER_GRP or RX_O_THR_PER_PROC is set depending on |
| 486 | how the thread num was parsed, so that we reject mixes. |
| 487 | - end of parsing: entries translated to the cleanest form (to be determined) |
| 488 | - binding: for each socket()/bind()/listen()... just perform one extra dup() |
| 489 | for each tgroup and store the multiple FDs into an FD array indexed on |
| 490 | MAX_TGROUP. => allows to use one FD per tgroup for the same socket, hence |
| 491 | to have multiple entries in all tgroup pollers without requiring the user |
| 492 | to duplicate the bind line. |
| 493 | |
| 494 | 2021-09-15 - global thread masks |
| 495 | ========== |
| 496 | |
| 497 | Some global variables currently expect to know about thread IDs and it's |
| 498 | uncertain what must be done with them: |
| 499 | - global_tasks_mask /* Mask of threads with tasks in the global runqueue */ |
| 500 | => touched under the rq lock. Change it per-group ? What exact use is made ? |
| 501 | |
| 502 | - sleeping_thread_mask /* Threads that are about to sleep in poll() */ |
| 503 | => seems that it can be made per group |
| 504 | |
| 505 | - all_threads_mask: a bit complicated, derived from nbthread and used with |
| 506 | masks and with my_ffsl() to wake threads up. Should probably be per-group |
| 507 | but we might miss something for global. |
| 508 | |
| 509 | - stopping_thread_mask: used in combination with all_threads_mask, should |
| 510 | move per-group. |
| 511 | |
| 512 | - threads_harmless_mask: indicates all threads that are currently harmless in |
| 513 | that they promise not to access a shared resource. Must be made per-group |
| 514 | but then we'll likely need a second stage to have the harmless groups mask. |
| 515 | threads_idle_mask, threads_sync_mask, threads_want_rdv_mask go with the one |
| 516 | above. Maybe the right approach will be to request harmless on a group mask |
| 517 | so that we can detect collisions and arbiter them like today, but on top of |
| 518 | this it becomes possible to request harmless only on the local group if |
| 519 | desired. The subtlety is that requesting harmless at the group level does |
| 520 | not mean it's achieved since the requester cannot vouch for the other ones |
| 521 | in the same group. |
| 522 | |
| 523 | In addition, some variables are related to the global runqueue: |
| 524 | __decl_aligned_spinlock(rq_lock); /* spin lock related to run queue */ |
| 525 | struct eb_root rqueue; /* tree constituting the global run queue, accessed under rq_lock */ |
| 526 | unsigned int grq_total; /* total number of entries in the global run queue, atomic */ |
| 527 | static unsigned int global_rqueue_ticks; /* insertion count in the grq, use rq_lock */ |
| 528 | |
| 529 | And others to the global wait queue: |
| 530 | struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| 531 | __decl_aligned_rwlock(wq_lock); /* RW lock related to the wait queue */ |
| 532 | struct eb_root timers; /* sorted timers tree, global, accessed under wq_lock */ |
| 533 | |
| 534 | |
Willy Tarreau | 0aa6f3e | 2022-06-14 15:00:40 +0200 | [diff] [blame] | 535 | 2022-06-14 - progress on task affinity |
| 536 | ========== |
| 537 | |
| 538 | The particularity of the current global run queue is to be usable for remote |
| 539 | wakeups because it's protected by a lock. There is no need for a global run |
| 540 | queue beyond this, and there could already be a locked queue per thread for |
| 541 | remote wakeups, with a random selection at wakeup time. It's just that picking |
| 542 | a pending task in a run queue among a number is convenient (though it |
| 543 | introduces some excessive locking). A task will either be tied to a single |
| 544 | group or will be allowed to run on any group. As such it's pretty clear that we |
| 545 | don't need a global run queue. When a run-anywhere task expires, either it runs |
| 546 | on the current group's runqueue with any thread, or a target thread is selected |
| 547 | during the wakeup and it's directly assigned. |
| 548 | |
| 549 | A global wait queue seems important for scheduled repetitive tasks however. But |
| 550 | maybe it's more a task for a cron-like job and there's no need for the task |
| 551 | itself to wake up anywhere, because once the task wakes up, it must be tied to |
| 552 | one (or a set of) thread(s). One difficulty if the task is temporarily assigned |
| 553 | a thread group is that it's impossible to know where it's running when trying |
| 554 | to perform a second wakeup or when trying to kill it. Maybe we'll need to have |
| 555 | two tgid for a task (desired, effective). Or maybe we can restrict the ability |
| 556 | of such a task to stay in wait queue in case of wakeup, though that sounds |
| 557 | difficult. Other approaches would be to set the GID to the current one when |
| 558 | waking up the task, and to have a flag (or sign on the GID) indicating that the |
| 559 | task is still queued in the global timers queue. We already have TASK_SHARED_WQ |
| 560 | so it seems that antoher similar flag such as TASK_WAKE_ANYWHERE could make |
| 561 | sense. But when is TASK_SHARED_WQ really used, except for the "anywhere" case ? |
| 562 | All calls to task_new() use either 1<<thr, tid_bit, all_threads_mask, or come |
| 563 | from appctx_new which does exactly the same. The only real user of non-global, |
| 564 | non-unique task_new() call is debug_parse_cli_sched() which purposely allows to |
| 565 | use an arbitrary mask. |
| 566 | |
| 567 | +----------------------------------------------------------------------------+ |
| 568 | | => we don't need one WQ per group, only a global and N local ones, hence | |
| 569 | | the TASK_SHARED_WQ flag can continue to be used for this purpose. | |
| 570 | +----------------------------------------------------------------------------+ |
| 571 | |
| 572 | Having TASK_SHARED_WQ should indicate that a task will always be queued to the |
| 573 | shared queue and will always have a temporary gid and thread mask in the run |
| 574 | queue. |
| 575 | |
| 576 | Going further, as we don't have any single case of a task bound to a small set |
| 577 | of threads, we could decide to wake up only expired tasks for ourselves by |
| 578 | looking them up using eb32sc and adopting them. Thus, there's no more need for |
| 579 | a shared runqueue nor a global_runqueue_ticks counter, and we can simply have |
| 580 | the ability to wake up a remote task. The task's thread_mask will then change |
| 581 | so that it's only a thread ID, except when the task has TASK_SHARED_WQ, in |
| 582 | which case it corresponds to the running thread. That's very close to what is |
| 583 | already done with tasklets in fact. |
| 584 | |
| 585 | |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 586 | 2021-09-29 - group designation and masks |
| 587 | ========== |
| 588 | |
| 589 | Neither FDs nor tasks will belong to incomplete subsets of threads spanning |
| 590 | over multiple thread groups. In addition there may be a difference between |
| 591 | configuration and operation (for FDs). This allows to fix the following rules: |
| 592 | |
| 593 | group mask description |
| 594 | 0 0 bind_conf: groups & thread not set. bind to any/all |
| 595 | task: it would be nice to mean "run on the same as the caller". |
| 596 | |
| 597 | 0 xxx bind_conf: thread set but not group: thread IDs are global |
| 598 | FD/task: group 0, mask xxx |
| 599 | |
| 600 | G>0 0 bind_conf: only group is set: bind to all threads of group G |
| 601 | FD/task: mask 0 not permitted (= not owned). May be used to |
| 602 | mention "any thread of this group", though already covered by |
| 603 | G/xxx like today. |
| 604 | |
| 605 | G>0 xxx bind_conf: Bind to these threads of this group |
| 606 | FD/task: group G, mask xxx |
| 607 | |
| 608 | It looks like keeping groups starting at zero internally complicates everything |
| 609 | though. But forcing it to start at 1 might also require that we rescan all tasks |
| 610 | to replace 0 with 1 upon startup. This would also allow group 0 to be special and |
| 611 | be used as the default group for any new thread creation, so that group0.count |
| 612 | would keep the number of unassigned threads. Let's try: |
| 613 | |
| 614 | group mask description |
| 615 | 0 0 bind_conf: groups & thread not set. bind to any/all |
| 616 | task: "run on the same group & thread as the caller". |
| 617 | |
| 618 | 0 xxx bind_conf: thread set but not group: thread IDs are global |
| 619 | FD/task: invalid. Or maybe for a task we could use this to |
| 620 | mean "run on current group, thread XXX", which would cover |
| 621 | the need for health checks (g/t 0/0 while sleeping, 0/xxx |
| 622 | while running) and have wake_expired_tasks() detect 0/0 and |
| 623 | wake them up to a random group. |
| 624 | |
| 625 | G>0 0 bind_conf: only group is set: bind to all threads of group G |
| 626 | FD/task: mask 0 not permitted (= not owned). May be used to |
| 627 | mention "any thread of this group", though already covered by |
| 628 | G/xxx like today. |
| 629 | |
| 630 | G>0 xxx bind_conf: Bind to these threads of this group |
| 631 | FD/task: group G, mask xxx |
| 632 | |
| 633 | With a single group declared in the config, group 0 would implicitly find the |
| 634 | first one. |
| 635 | |
| 636 | |
| 637 | The problem with the approach above is that a task queued in one group+thread's |
| 638 | wait queue could very well receive a signal from another thread and/or group, |
| 639 | and that there is no indication about where the task is queued, nor how to |
| 640 | dequeue it. Thus it seems that it's up to the application itself to unbind/ |
| 641 | rebind a task. This contradicts the principle of leaving a task waiting in a |
| 642 | wait queue and waking it anywhere. |
| 643 | |
| 644 | Another possibility might be to decide that a task having a defined group but |
| 645 | a mask of zero is shared and will always be queued into its group's wait queue. |
| 646 | However, upon expiry, the scheduler would notice the thread-mask 0 and would |
| 647 | broadcast it to any group. |
| 648 | |
| 649 | Right now in the code we have: |
| 650 | - 18 calls of task_new(tid_bit) |
Willy Tarreau | 3ccb14d | 2022-06-14 11:18:40 +0200 | [diff] [blame] | 651 | - 17 calls of task_new_anywhere() |
Willy Tarreau | 2747162 | 2021-11-18 17:45:57 +0100 | [diff] [blame] | 652 | - 2 calls with a single bit |
| 653 | |
| 654 | Thus it looks like "task_new_anywhere()", "task_new_on()" and |
| 655 | "task_new_here()" would be sufficient. |