Diff - f675f094636577421e74034e439c0d7f56368852^! - haproxy

commit	f675f094636577421e74034e439c0d7f56368852	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Thu Jan 20 18:42:16 2022 +0100
committer	Christopher Faulet <cfaulet@haproxy.com>	Mon Mar 14 15:04:32 2022 +0100
tree	48801f0f6309e223ee9d19b98a0b61dab60ecb40
parent	10b09724a53c1e307fbe58d7e82059515171c08a [diff]

BUG/MINOR: stream: make the call_rate only count the no-progress calls

We have an anti-looping protection in process_stream() that detects bugs
that used to affect a few filters like compression in the past which
sometimes forgot to handle a read0 or a particular error, leaving a
thread looping at 100% CPU forever. When such a condition is detected,
an alert it emitted and the process is killed so that it can be replaced
by a sane one:

  [ALERT]    (19061) : A bogus STREAM [0x274abe0] is spinning at 2057156
             calls per second and refuses to die, aborting now! Please
             report this error to developers [strm=0x274abe0,3 src=unix
             fe=MASTER be=MASTER dst=<MCLI> txn=(nil),0 txn.req=-,0
             txn.rsp=-,0 rqf=c02000 rqa=10000 rpf=88000021 rpa=8000000
             sif=EST,40008 sib=DIS,84018 af=(nil),0 csf=0x274ab90,8600
             ab=0x272fd40,1 csb=(nil),0
             cof=0x25d5d80,1300:PASS(0x274aaf0)/RAW((nil))/unix_stream(9)
             cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) filters={}]
    call trace(11):
    |       0x4dbaab [c7 04 25 01 00 00 00 00]: stream_dump_and_crash+0x17b/0x1b4
    |       0x4df31f [e9 bd c8 ff ff 49 83 7c]: process_stream+0x382f/0x53a3
    (...)

One problem with this detection is that it used to only count the call
rate because we weren't sure how to make it more accurate, but the
threshold was high enough to prevent accidental false positives.

There is actually one case that manages to trigger it, which is when
sending huge amounts of requests pipelined on the master CLI. Some
short requests such as "show version" are sufficient to be handled
extremely fast and to cause a wake up of an analyser to parse the
next request, then an applet to handle it, back and forth. But this
condition is not an error, since some data are being forwarded by
the stream, and it's easy to detect it.

This patch modifies the detection so that update_freq_ctr() only
applies to calls made without CF_READ_PARTIAL nor CF_WRITE_PARTIAL
set on any of the channels, which really indicates that nothing is
happening at all.

This is greatly sufficient and extremely effective, as the call above
is still caught (shutr being ignored by an analyser) while a loop on
the master CLI now has no effect. The "call_rate" field in the detailed
"show sess" output will now be much lower, except for bogus streams,
which may help spot them. This field is only there for developers
anyway so it's pretty fine to slightly adjust its meaning.

This patch could be backported to stable versions in case of reports
of such an issue, but as that's unlikely, it's not really needed.

(cherry picked from commit 6c539c4b8c24718177e7ff38af0d186ec84608ea)
[wt: bringing this to 2.5 only for now so that power users have at
 least a stable branch to upgrade to if they face this issue]
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit 8c48c69c0d5f0a2a8bb11a6074e640b6f38e82a7)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

diff --git a/include/haproxy/stream-t.h b/include/haproxy/stream-t.h
index ff9429a..d31ef6a 100644
--- a/include/haproxy/stream-t.h
+++ b/include/haproxy/stream-t.h

@@ -145,7 +145,7 @@
 	struct list back_refs;          /* list of users tracking this stream */
 	struct buffer_wait buffer_wait; /* position in the list of objects waiting for a buffer */
 
-	struct freq_ctr call_rate;      /* stream task call rate */
+	struct freq_ctr call_rate;      /* stream task call rate without making progress */
 
 	short store_count;
 	enum obj_type obj_type;         /* object type == OBJ_TYPE_STREAM */

diff --git a/src/stream.c b/src/stream.c
index ffed2b7..d5b4c34 100644
--- a/src/stream.c
+++ b/src/stream.c

@@ -1624,9 +1624,14 @@
 	si_sync_recv(si_f);
 	si_sync_recv(si_b);
 
-	rate = update_freq_ctr(&s->call_rate, 1);
-	if (rate >= 100000 && s->call_rate.prev_ctr) { // make sure to wait at least a full second
-		stream_dump_and_crash(&s->obj_type, read_freq_ctr(&s->call_rate));
+	/* Let's check if we're looping without making any progress, e.g. due
+	 * to a bogus analyser or the fact that we're ignoring a read0. The
+	 * call_rate counter only counts calls with no progress made.
+	 */
+	if (!((req->flags | res->flags) & (CF_READ_PARTIAL|CF_WRITE_PARTIAL))) {
+		rate = update_freq_ctr(&s->call_rate, 1);
+		if (rate >= 100000 && s->call_rate.prev_ctr) // make sure to wait at least a full second
+			stream_dump_and_crash(&s->obj_type, read_freq_ctr(&s->call_rate));
 	}
 
 	/* this data may be no longer valid, clear it */