95fad5ba4b2c07ab1ac610d119380ffd7709f0e6 - haproxy

commit	95fad5ba4b2c07ab1ac610d119380ffd7709f0e6	[log] [tgz]
author	Bin Wang <feisuzhu@163.com>	Fri Sep 15 14:56:40 2017 +0800
committer	Willy Tarreau <w@1wt.eu>	Thu Oct 05 11:20:16 2017 +0200
tree	7f5e06a121ae064a869ab035bcbe2cfd025af238
parent	a258479e3fe17fc525d3c82b23e26e311453fd56 [diff]

BUG/MAJOR: stream-int: don't re-arm recv if send fails

When
    1) HAProxy configured to enable splice on both directions
    2) After some high load, there are 2 input channels with their socket buffer
       being non-empty and pipe being full at the same time, sitting in `fd_cache`
       without any other fds.

The 2 channels will repeatedly be stopped for receiving (pipe full) and waken
for receiving (data in socket), thus getting out and in of `fd_cache`, making
their fd swapping location in `fd_cache`.

There is a `if (entry < fd_cache_num && fd_cache[entry] != fd) continue;`
statement in `fd_process_cached_events` to prevent frequent polling, but since
the only 2 fds are constantly swapping location, `fd_cache[entry] != fd` will
always hold true, thus HAProxy can't make any progress.

The root cause of the issue is dual :
  - there is a single fd_cache, for next events and for the ones being
    processed, while using two distinct arrays would avoid the problem.

  - the write side of the stream interface wakes the read side up even
    when it couldn't write, and this one really is a bug.

Due to CF_WRITE_PARTIAL not being cleared during fast forwarding, a failed
send() attempt will still cause ->chk_rcv() to be called on the other side,
re-creating an entry for its connection fd in the cache, causing the same
sequence to be repeated indefinitely without any opportunity to make progress.

CF_WRITE_PARTIAL used to be used for what is present in these tests : check
if a recent write operation was performed. It's part of the CF_WRITE_ACTIVITY
set and is tested to check if timeouts need to be updated. It's also used to
detect if a failed connect() may be retried.

What this patch does is use CF_WROTE_DATA() to check for a successful write
for connection retransmits, and to clear CF_WRITE_PARTIAL before preparing
to send in stream_int_notify(). This way, timeouts are still updated each
time a write succeeds, but chk_rcv() won't be called anymore after a failed
write.

It seems the fix is required all the way down to 1.5.

Without this patch, the only workaround at this point is to disable splicing
in at least one direction. Strictly speaking, splicing is not absolutely
required, as regular forwarding could theorically cause the issue to happen
if the timing is appropriate, but in practice it appears impossible to
reproduce it without splicing, and even with splicing it may vary.

The following config manages to reproduce it after a few attempts (haproxy
going 100% CPU and having to be killed) :

  global
      maxpipes 50000
      maxconn 10000

  listen srv1
      option splice-request
      option splice-response
      bind :8001
      server s1 127.0.0.1:8002

  server$ tcploop 8002 L N20 A R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000
  client$ tcploop 8001 N20 C T S1000000 R10 J

2 files changed

tree: 7f5e06a121ae064a869ab035bcbe2cfd025af238