BUG/MEDIUM: tasks: Make sure we wake sleeping threads if needed.

When waking a task on a remote thread, we currently check 1) if this
thread was sleeping, and 2) if it was already marked as active before
writing to its pipe. Unfortunately this doesn't always work as desired
because only one thread from the mask is woken up, while the
active_tasks_mask indicates all eligible threads for this task. As a
result, if one multi-thread task (e.g. a health check) wakes up to run
on any thread, then an accept() dispatches an incoming connection on
thread 2, this thread will already have its bit set in active_tasks_mask
because of the previous wakeup and will not be woken up.

This is easily noticeable on 2.0-dev by injecting on a multi-threaded
listener with a single connection at a time while health checks are
running quickly in the background : the injection runs slowly with
random response times (the poll timeouts). In 1.9 it affects the
dequeing of server connections, which occasionally experience pauses
if multiple threads share the same queue.

The correct solution consists in adjusting the sleeping_thread_mask
when waking another thread up. This mask reflects threads that are
sleeping, hence that need to be signaled to wake up. Threads with a
bit in active_tasks_mask already don't have their sleeping_thread_mask
bit set before polling so the principle remains consistent. And by
doing so we can remove the old_active_mask field.

This should be backported to 1.9.
diff --git a/src/task.c b/src/task.c
index a376513..637e235 100644
--- a/src/task.c
+++ b/src/task.c
@@ -69,7 +69,6 @@
 {
 	void *expected = NULL;
 	int *rq_size;
-	unsigned long __maybe_unused old_active_mask;
 
 #ifdef USE_THREAD
 	if (root == &rqueue) {
@@ -125,7 +124,6 @@
 		__ha_barrier_atomic_store();
 	}
 #endif
-	old_active_mask = active_tasks_mask;
 	_HA_ATOMIC_OR(&active_tasks_mask, t->thread_mask);
 	t->rq.key = _HA_ATOMIC_ADD(&rqueue_ticks, 1);
 
@@ -160,9 +158,13 @@
 	 * wake one.
 	 */
 	if ((((t->thread_mask & all_threads_mask) & sleeping_thread_mask) ==
-	    (t->thread_mask & all_threads_mask)) &&
-	    !(t->thread_mask & old_active_mask))
-		wake_thread(my_ffsl((t->thread_mask & all_threads_mask) &~ tid_bit) - 1);
+	     (t->thread_mask & all_threads_mask))) {
+		unsigned long m = (t->thread_mask & all_threads_mask) &~ tid_bit;
+
+		m = (m & (m - 1)) ^ m; // keep lowest bit set
+		_HA_ATOMIC_AND(&sleeping_thread_mask, ~m);
+		wake_thread(my_ffsl(m) - 1);
+	}
 #endif
 	return;
 }