BUG/MAJOR: fd/thread: fix race between updates and closing FD

While running some L7 retries tests, Christopher and I stumbled upon a
very strange behavior showing some occasional server timeouts when the
server closes keep-alive connections quickly. The issue can be
reproduced with the following config:

    global
        expose-experimental-directives
        #tune.fd.edge-triggered on   # can speed up the issue

    defaults
        mode http
        timeout client 5s
        timeout server 10s
        timeout connect 2s

    listen f
        bind :8001
        http-reuse always
        retry-on all-retryable-errors
        server next 127.0.0.1:8002

    frontend b
        bind :8002
        timeout http-keep-alive 1  # one ms
        redirect location /

Sending fast requests without reusing the client connection on port 8001
with a single connection and at least 3 threads on haproxy occasionally
shows some glitches pauses (below with timeout server 2s):

  $ taskset -c 2,3 h1load  -e -t 1 -r 1 -c 1 http://127.0.0.1:8001/
  #     time conns tot_conn  tot_req      tot_bytes    err  cps  rps  bps   ttfb
           1     1     9794     9793         959714      0 9k79 9k79 7M67 42.94u
           2     1     9794     9793         959714      0 0.00 0.00 0.00    -
           3     1     9794     9793         959714      0 0.00 0.00 0.00    -
           4     0    16015    16015        1569470      0 6k22 6k22 4M87 522.9u
           5     0    18657    18656        1828190      2 2k63 2k63 2M06 39.22u

If this doesn't happen, limiting to a request rate close to 1/timeout
may help.

What is happening is that after several migrations, a late report
via fd_update_events() may detect that the thread is not welcome, and
will want to program an update so that the current thread's poller
disables its polling on it. It is allowed to do so because it used
fd_grab_tgid(). But what if _fd_delete_orphan() was just starting to
be called and already reset the update_mask ? We'll end up with a bit
present in the update mask, then _fd_delete_orphan() resets the tgid,
which will prevent the poller from consuming that update. The update
is not needed anymore since the FD was closed, but in this case nobody
will clear this bit until the same FD is reused again and cleared. And
as long as the thread's bit remains in the update_mask, no new updates
will be programmed for the next use of this FD on the same thread since
due to the bit being present, fd_nbupdt will not be changed. This is
what is causing this timeout.

The fix consists in making sure _fd_delete_orphan() waits for the
occasional watchers to leave, and to do this before clearing the
update_mask. This will be either fd_update_events() trying to check
its thread_mask, or the poller checking its updates, so that's pretty
short. But it definitely closes this race.

This fix is needed since the introduction of fd_grab_tgid(), hence 2.7.

Note that while testing the fix, another related issue concerning the
atomicity of running_mask vs thread_mask popped up and will have to be
fixed till 2.5 as part of another patch. It may make the tests for this
fix occasionally tigger a few BUG_ON() or face a null conn->subs in
sock_conn_iocb(), though these ones are much more difficult to trigger.
This is not caused by this fix.
1 file changed