OPTIM: task: automatically adjust the default runqueue-depth to the threads

The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.

(cherry picked from commit 060a7612487c244175fa1dc1e5b224015cbcf503)
Signed-off-by: Willy Tarreau <w@1wt.eu>
diff --git a/src/haproxy.c b/src/haproxy.c
index 9dfdd22..5da563b 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -2275,8 +2275,14 @@
 	if (global.tune.maxpollevents <= 0)
 		global.tune.maxpollevents = MAX_POLL_EVENTS;
 
-	if (global.tune.runqueue_depth <= 0)
-		global.tune.runqueue_depth = RUNQUEUE_DEPTH;
+	if (global.tune.runqueue_depth <= 0) {
+		/* tests on various thread counts from 1 to 64 have shown an
+		 * optimal queue depth following roughly 1/sqrt(threads).
+		 */
+		int s = my_flsl(global.nbthread);
+		s += (global.nbthread / s); // roughly twice the sqrt.
+		global.tune.runqueue_depth = RUNQUEUE_DEPTH * 2 / s;
+	}
 
 	if (global.tune.recv_enough == 0)
 		global.tune.recv_enough = MIN_RECV_AT_ONCE_ENOUGH;