MINOR: tasks: refine the default run queue depth

Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:

 rq\thr|    1     2     4     8    16
 ------+------------------------------
     32|  120k  159k  276k  477k  698k
     40|  122k  160k  276k  478k  722k
     48|  121k  159k  274k  482k  720k
     64|  121k  160k  274k  469k  710k
    200|  114k  150k  247k  415k  613k  <-- default

It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.

The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.

After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.
diff --git a/doc/configuration.txt b/doc/configuration.txt
index 450d831..65aabd8 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -2493,10 +2493,12 @@
 
 tune.runqueue-depth <number>
   Sets the maximum amount of task that can be processed at once when running
-  tasks. The default value is 200. Increasing it may incur latency when
-  dealing with I/Os, making it too small can incur extra overhead. When
-  experimenting with much larger values, it may be useful to also enable
-  tune.sched.low-latency to limit the maximum latency to the lowest possible.
+  tasks. The default value is 40 which tends to show the highest request rates
+  and lowest latencies. Increasing it may incur latency when dealing with I/Os,
+  making it too small can incur extra overhead. When experimenting with much
+  larger values, it may be useful to also enable tune.sched.low-latency and
+  possibly tune.fd.edge-triggered to limit the maximum latency to the lowest
+  possible.
 
 tune.sched.low-latency { on | off }
   Enables ('on') or disables ('off') the low-latency task scheduler. By default
diff --git a/include/haproxy/defaults.h b/include/haproxy/defaults.h
index 6a8b03d..b023827 100644
--- a/include/haproxy/defaults.h
+++ b/include/haproxy/defaults.h
@@ -170,9 +170,19 @@
 #define MAX_POLL_EVENTS 200
 #endif
 
-// the max number of tasks to run at once
+// the max number of tasks to run at once. Tests have shown the following
+// number of requests/s for 1 to 16 threads (1c1t, 1c2t, 2c4t, 4c8t, 4c16t):
+//
+// rq\thr|    1     2     4     8    16
+// ------+------------------------------
+//     32|  120k  159k  276k  477k  698k
+//     40|  122k  160k  276k  478k  722k
+//     48|  121k  159k  274k  482k  720k
+//     64|  121k  160k  274k  469k  710k
+//    200|  114k  150k  247k  415k  613k
+//
 #ifndef RUNQUEUE_DEPTH
-#define RUNQUEUE_DEPTH 200
+#define RUNQUEUE_DEPTH 40
 #endif
 
 // cookie delimiter in "prefix" mode. This character is inserted between the