BUG/MAJOR: fd/thread: fix race between updates and closing FD
While running some L7 retries tests, Christopher and I stumbled upon a
very strange behavior showing some occasional server timeouts when the
server closes keep-alive connections quickly. The issue can be
reproduced with the following config:
global
expose-experimental-directives
#tune.fd.edge-triggered on # can speed up the issue
defaults
mode http
timeout client 5s
timeout server 10s
timeout connect 2s
listen f
bind :8001
http-reuse always
retry-on all-retryable-errors
server next 127.0.0.1:8002
frontend b
bind :8002
timeout http-keep-alive 1 # one ms
redirect location /
Sending fast requests without reusing the client connection on port 8001
with a single connection and at least 3 threads on haproxy occasionally
shows some glitches pauses (below with timeout server 2s):
$ taskset -c 2,3 h1load -e -t 1 -r 1 -c 1 http://127.0.0.1:8001/
# time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1 9794 9793 959714 0 9k79 9k79 7M67 42.94u
2 1 9794 9793 959714 0 0.00 0.00 0.00 -
3 1 9794 9793 959714 0 0.00 0.00 0.00 -
4 0 16015 16015 1569470 0 6k22 6k22 4M87 522.9u
5 0 18657 18656 1828190 2 2k63 2k63 2M06 39.22u
If this doesn't happen, limiting to a request rate close to 1/timeout
may help.
What is happening is that after several migrations, a late report
via fd_update_events() may detect that the thread is not welcome, and
will want to program an update so that the current thread's poller
disables its polling on it. It is allowed to do so because it used
fd_grab_tgid(). But what if _fd_delete_orphan() was just starting to
be called and already reset the update_mask ? We'll end up with a bit
present in the update mask, then _fd_delete_orphan() resets the tgid,
which will prevent the poller from consuming that update. The update
is not needed anymore since the FD was closed, but in this case nobody
will clear this bit until the same FD is reused again and cleared. And
as long as the thread's bit remains in the update_mask, no new updates
will be programmed for the next use of this FD on the same thread since
due to the bit being present, fd_nbupdt will not be changed. This is
what is causing this timeout.
The fix consists in making sure _fd_delete_orphan() waits for the
occasional watchers to leave, and to do this before clearing the
update_mask. This will be either fd_update_events() trying to check
its thread_mask, or the poller checking its updates, so that's pretty
short. But it definitely closes this race.
This fix is needed since the introduction of fd_grab_tgid(), hence 2.7.
Note that while testing the fix, another related issue concerning the
atomicity of running_mask vs thread_mask popped up and will have to be
fixed till 2.5 as part of another patch. It may make the tests for this
fix occasionally tigger a few BUG_ON() or face a null conn->subs in
sock_conn_iocb(), though these ones are much more difficult to trigger.
This is not caused by this fix.
1 file changed