MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY

While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.

An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.

But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.

Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".

(cherry picked from commit 74dea8caeadfaf3cac262fd4f2c532788c396fb5)
[wt: tasklet_wakeup_on() is in task.h in 2.3]
Signed-off-by: Willy Tarreau <w@1wt.eu>
diff --git a/src/task.c b/src/task.c
index a2b78e8..8689da8 100644
--- a/src/task.c
+++ b/src/task.c
@@ -403,6 +403,7 @@
 	unsigned int done = 0;
 	unsigned int queue;
 	unsigned short state;
+	char heavy_calls = 0;
 	void *ctx;
 
 	for (queue = 0; queue < TL_CLASSES;) {
@@ -449,7 +450,20 @@
 
 		budgets[queue]--;
 		t = (struct task *)LIST_ELEM(tl_queues[queue].n, struct tasklet *, list);
-		state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_KILLED);
+		state = t->state & (TASK_SHARED_WQ|TASK_SELF_WAKING|TASK_HEAVY|TASK_KILLED);
+
+		if (state & TASK_HEAVY) {
+			/* This is a heavy task. We'll call no more than one
+			 * per function call. If we called one already, we'll
+			 * return and announce the max possible weight so that
+			 * the caller doesn't come back too soon.
+			 */
+			if (heavy_calls) {
+				done = INT_MAX;  // 11ms instead of 3 without this
+				break; // too many heavy tasks processed already
+			}
+			heavy_calls = 1;
+		}
 
 		ti->flags &= ~TI_FL_STUCK; // this thread is still running
 		activity[tid].ctxsw++;