Willy Tarreau | 88babd9 | 2021-07-30 17:40:07 +0200 | [diff] [blame] | 1 | 2021-07-30 - File descriptor migration between threads |
| 2 | |
| 3 | An FD migration may happen on any idle connection that experiences a takeover() |
| 4 | operation by another thread. In this case the acting thread becomes the owner |
| 5 | of the connection (and FD) while previous one(s) need to forget about it. |
| 6 | |
| 7 | File descriptor migration between threads is a fairly complex operation because |
| 8 | it is required to maintain a durable consistency between the pollers states and |
| 9 | the haproxy's desired state. Indeed, very often the FD is registered within one |
| 10 | thread's poller and that thread might be waiting in the system, so there is no |
| 11 | way to synchronously update it. This is where thread_mask, polled_mask and per |
| 12 | thread updates are used: |
| 13 | |
| 14 | - a thread knows if it's allowed to manipulate an FD by looking at its bit in |
| 15 | the FD's thread_mask ; |
| 16 | |
| 17 | - each thread knows if it was polling an FD by looking at its bit in the |
| 18 | polled_mask field ; a recent migration is usually indicated by a bit being |
| 19 | present in polled_mask and absent from thread_mask. |
| 20 | |
| 21 | - other threads know whether it's safe to take over an FD by looking at the |
| 22 | running mask: if it contains any other thread's bit, then other threads are |
| 23 | using it and it's not safe to take it over. |
| 24 | |
| 25 | - sleeping threads are notified about the need to update their polling via |
| 26 | local or global updates to the FD. Each thread has its own local update |
| 27 | list and its own bit in the update_mask to know whether there are pending |
| 28 | updates for it. This allows to reconverge polling with the desired state |
| 29 | at the last instant before polling. |
| 30 | |
| 31 | While the description above could be seen as "progressive" (it technically is) |
| 32 | in that there is always a transition and convergence period in a migrated FD's |
| 33 | life, functionally speaking it's perfectly atomic thanks to the running bit and |
| 34 | to the per-thread idle connections lock: no takeover is permitted without |
| 35 | holding the idle_conns lock, and takeover may only happen by atomically picking |
| 36 | a connection from the list that is also protected by this lock. In practice, an |
| 37 | FD is never taken over by itself, but always in the context of a connection, |
| 38 | and by atomically removing a connection from an idle list, it is possible to |
| 39 | guarantee that a connection will not be picked, hence that its FD will not be |
| 40 | taken over. |
| 41 | |
| 42 | same thread as list! |
| 43 | |
| 44 | The possible entry points to a race to use a file descriptor are the following |
| 45 | ones, with their respective sequences: |
| 46 | |
| 47 | 1) takeover: requested by conn_backend_get() on behalf of connect_server() |
| 48 | - take the idle_conns_lock, protecting against a parallel access from the |
| 49 | I/O tasklet or timeout task |
| 50 | - pick the first connection from the list |
| 51 | - attempt an fd_takeover() on this connection's fd. Usually it works, |
| 52 | unless a late wakeup of the owning thread shows up in the FD's running |
| 53 | mask. The operation is performed in fd_takeover() using a DWCAS which |
| 54 | tries to switch both running and thread_mask to the caller's tid_bit. A |
| 55 | concurrent bit in running is enough to make it fail. This guarantees |
| 56 | another thread does not wakeup from I/O in the middle of the takeover. |
| 57 | In case of conflict, this FD is skipped and the attempt is tried again |
| 58 | with the next connection. |
| 59 | - resets the task/tasklet contexts to NULL, as a signal that they are not |
| 60 | allowed to run anymore. The tasks retrieve their execution context from |
| 61 | the scheduler in the arguments, but will check the tasks' context from |
| 62 | the structure under the lock to detect this possible change, and abort. |
| 63 | - at this point the takeover suceeded, the idle_conns_lock is released and |
| 64 | the connection and its FD are now owned by the caller |
| 65 | |
| 66 | 2) poll report: happens on late rx, shutdown or error on idle conns |
| 67 | - fd_set_running() is called to atomically set the running_mask and check |
| 68 | that the caller's tid_bit is still present in the thread_mask. Upon |
| 69 | failure the caller arranges itself to stop reporting that FD (e.g. by |
| 70 | immediate removal or by an asynchronous update). Upon success, it's |
| 71 | guaranteed that any concurrent fd_takeover() will fail the DWCAS and that |
| 72 | another connection will need to be picked instead. |
| 73 | - FD's state is possibly updated |
| 74 | - the iocb is called if needed (almost always) |
| 75 | - if the iocb didn't kill the connection, release the bit from running_mask |
| 76 | making the connection possibly available to a subsequent fd_takeover(). |
| 77 | |
| 78 | 3) I/O tasklet, timeout task: timeout or subscribed wakeup |
| 79 | - start by taking the idle_conns_lock, ensuring no takeover() will pick the |
| 80 | same connection from this point. |
| 81 | - check the task/tasklet's context to verify that no recently completed |
| 82 | takeover() stole the connection. If it's NULL, the connection was lost, |
| 83 | the lock is released and the task/tasklet killed. Otherwise it is |
| 84 | guaranted that no other thread may use that connection (current takeover |
| 85 | candidates are waiting on the lock, previous owners waking from poll() |
| 86 | lost their bit in the thread_mask and will not touch the FD). |
| 87 | - the connection is removed from the idle conns list. From this point on, |
| 88 | no other thread will even find it there nor even try fd_takeover() on it. |
| 89 | - the idle_conns_lock is now released, the connection is protected and its |
| 90 | FD is not reachable by other threads anymore. |
| 91 | - the task does what it has to do |
| 92 | - if the connection is still usable (i.e. not upon timeout), it's inserted |
| 93 | again into the idle conns list, meaning it may instantly be taken over |
| 94 | by a competing thread. |
| 95 | |
| 96 | 4) wake() callback: happens on last user after xfers (may free() the conn) |
| 97 | - the connection is still owned by the caller, it's still subscribed to |
| 98 | polling but the connection is idle thus inactive. Errors or shutdowns |
| 99 | may be reported late, via sock_conn_iocb() and conn_notify_mux(), thus |
| 100 | the running bit is set (i.e. a concurrent fd_takeover() will fail). |
| 101 | - if the connection is in the list, the idle_conns_lock is grabbed, the |
| 102 | connection is removed from the list, and the lock is released. |
| 103 | - mux->wake() is called |
| 104 | - if the connection previously was in the list, it's reinserted under the |
| 105 | idle_conns_lock. |
Willy Tarreau | f69fea6 | 2021-08-03 09:04:32 +0200 | [diff] [blame^] | 106 | |
| 107 | |
| 108 | With the DWCAS removal between running_mask & thread_mask: |
| 109 | |
| 110 | fd_takeover: |
| 111 | 1 if (!CAS(&running_mask, 0, tid_bit)) |
| 112 | 2 return fail; |
| 113 | 3 atomic_store(&thread_mask, tid_bit); |
| 114 | 4 atomic_and(&running_mask, ~tid_bit); |
| 115 | |
| 116 | poller: |
| 117 | 1 do { |
| 118 | 2 /* read consistent running_mask & thread_mask */ |
| 119 | 3 do { |
| 120 | 4 run = atomic_load(&running_mask); |
| 121 | 5 thr = atomic_load(&thread_mask); |
| 122 | 6 } while (run & ~thr); |
| 123 | 7 |
| 124 | 8 if (!(thr & tid_bit)) { |
| 125 | 9 /* takeover has started */ |
| 126 | 10 goto disable_fd; |
| 127 | 11 } |
| 128 | 12 } while (!CAS(&running_mask, run, run | tid_bit)); |
| 129 | |
| 130 | fd_delete: |
| 131 | 1 atomic_or(&running_mask, tid_bit); |
| 132 | 2 atomic_store(&thread_mask, 0); |
| 133 | 3 atomic_and(&running_mask, ~tid_bit); |
| 134 | |
| 135 | The loop in poller:3-6 is used to make sure the thread_mask we read matches |
| 136 | the last updated running_mask. If nobody can give up on fd_takeover(), it |
| 137 | might even be possible to spin on thread_mask only. Late pollers will not |
| 138 | set running anymore with this. |