MINOR: listener: improve incoming traffic distribution

By picking two randoms following the P2C algorithm, we seldom observe
asymmetric loads on bursts of small session counts. This is typically
what makes h2load take a bit of time to complete the last 100% because
if a thread gets two connections while the other ones only have one,
it takes twice the time to complete its work.

This patch proposes a modification of the p2c algorithm which seems
more suitable to this case : it mixes a rotating index with a random.
This way, we're certain that all threads are consulted in turn and at
the same time we're not forced to use the ones we're giving a chance.

This significantly increases the traffic rate. Now h2load shows faster
completion and the average request rates on H2 and the TLS resume rate
increases by a bit more than 5% compared to pure p2c.

The index was placed into the struct bind_conf because 1) it's faster
there and it's the best place to optimally distribute traffic among a
group of listeners. It's the only runtime-modified element there and
it will be quite cache-hot.
diff --git a/include/types/listener.h b/include/types/listener.h
index 876d3c3..95f3cc9 100644
--- a/include/types/listener.h
+++ b/include/types/listener.h
@@ -172,6 +172,7 @@
 	unsigned long bind_thread; /* bitmask of threads allowed to use these listeners */
 	unsigned long thr_2, thr_4, thr_8, thr_16; /* intermediate values for bind_thread counting */
 	unsigned int thr_count;    /* #threads bound */
+	unsigned int thr_idx;      /* thread indexes for queue distribution : (t2<<16)+t1 */
 	uint32_t ns_cip_magic;     /* Excepted NetScaler Client IP magic number */
 	struct list by_fe;         /* next binding for the same frontend, or NULL */
 	char *arg;                 /* argument passed to "bind" for better error reporting */
diff --git a/src/listener.c b/src/listener.c
index 9a9699c..3e080b4 100644
--- a/src/listener.c
+++ b/src/listener.c
@@ -847,12 +847,23 @@
 		count = l->bind_conf->thr_count;
 		if (count > 1 && (global.tune.options & GTUNE_LISTENER_MQ)) {
 			struct accept_queue_ring *ring;
-			int r, t1, t2, q1, q2;
+			int t1, t2, q1, q2;
+
+			/* pick a first thread ID using a round robin index,
+			 * and a second thread ID using a random. The
+			 * connection will be assigned to the one with the
+			 * least connections. This provides fairness on short
+			 * connections (round robin) and on long ones (conn
+			 * count).
+			 */
+			t1 = l->bind_conf->thr_idx;
+			do {
+				t2 = t1 + 1;
+				if (t2 >= count)
+					t2 = 0;
+			} while (!HA_ATOMIC_CAS(&l->bind_conf->thr_idx, &t1, t2));
 
-			/* pick two small distinct random values and drop lower bits */
-			r = (random() >> 8) % ((count - 1) * count);
-			t2 = r / count; // 0..thr_count-2
-			t1 = r % count; // 0..thr_count-1
+			t2 = (random() >> 8) % (count - 1);  // 0..thr_count-2
 			t2 += t1 + 1;   // necessarily different from t1
 
 			if (t2 >= count)