7f3f29ecd1907b111c81eb2d0275adef8b37c44f - haproxy

commit	7f3f29ecd1907b111c81eb2d0275adef8b37c44f	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Thu Nov 17 12:05:13 2016 +0100
committer	Willy Tarreau <w@1wt.eu>	Sat Dec 24 19:52:54 2016 +0100
tree	865eb623f69b7e0fd144de69d3d2534eebee2f3e
parent	496ef8603e856b01df671421c9760c0b2aaebbbb [diff]

BUG/MEDIUM: connection: check the control layer before stopping polling

The bug described in commit 568743a ("BUG/MEDIUM: stream-int: completely
detach connection on connect error") was not a stream-interface layer bug
but a connection layer bug. There was exactly one place in the code where
we could change a file descriptor's status without first checking whether
it is valid or not, it was in conn_stop_polling(). This one is called when
the polling status is changed after an update, and calls fd_stop_both even
if we had already closed the file descriptor :

1479388298.484240 ->->->->->   conn_fd_handler > conn_cond_update_polling
1479388298.484240 ->->->->->->   conn_cond_update_polling > conn_stop_polling
1479388298.484241 ->->->->->->->   conn_stop_polling > conn_ctrl_ready
1479388298.484241                  conn_stop_polling < conn_ctrl_ready
1479388298.484241 ->->->->->->->   conn_stop_polling > fd_stop_both
1479388298.484242 ->->->->->->->->   fd_stop_both > fd_update_cache
1479388298.484242 ->->->->->->->->->   fd_update_cache > fd_release_cache_entry
1479388298.484242                      fd_update_cache < fd_release_cache_entry
1479388298.484243                    fd_stop_both < fd_update_cache
1479388298.484243                  conn_stop_polling < fd_stop_both
1479388298.484243                conn_cond_update_polling < conn_stop_polling
1479388298.484243              conn_fd_handler < conn_cond_update_polling

The problem with the previous fix above is that it break the http_proxy mode
and possibly even some Lua parts and peers to a certain extent ; all outgoing
connections where the target address is initially copied into the outgoing
connection which experience a retry would use a random outgoing address after
the retry because closing and detaching the connection causes the target
address to be lost. This was attempted to be addressed by commit 0857d7a
("BUG/MAJOR: stream: properly mark the server address as unset on connect
retry") but it used to only solve the most visible effect and not the root
cause.

Prior to this fix, it was possible to cause this config to keep CLOSE_WAIT
for as long as it takes to expire a client or server timeout (note the
missing client timeout) :

   listen test
        mode http
        bind :8002
        server s1 127.0.0.1:8001

   $ tcploop 8001 L0 W N20 A R P100 S:"HTTP/1.1 200 OK\r\nContent-length: 0\r\n\r\n" &
   $ tcploop 8002 N200 C T W S:"GET / HTTP/1.0\r\n\r\n" O P10000 K

With this patch, these CLOSE_WAIT properly vanish when both processes leave.

This commit reverts the two fixes above and replaces them with the proper
fix in connection.h. It must be backported to 1.6 and 1.5. Thanks to
Robson Roberto Souza Peixoto for providing very detailed traces showing
some obvious inconsistencies leading to finding this bug.
(cherry picked from commit 350135cf49023a8f3f5b12b406a3dd7b9b1579bc)
(cherry picked from commit d8777bd263bcff3c04d799de4755e24cee46ce81)
[wt: the code is in session.c in 1.5]

2 files changed

tree: 865eb623f69b7e0fd144de69d3d2534eebee2f3e