BUG/MAJOR: fd/threads: close a race on closing connections after takeover

As mentioned in commit 237e6a0d6 ("BUG/MAJOR: fd/thread: fix race between
updates and closing FD"), a race was found during stress tests involving
heavy backend connection reuse with many competing closes.

Here the problem is complex. The analysis in commit f69fea64e ("MAJOR:
fd: get rid of the DWCAS when setting the running_mask") that removed
the DWCAS in 2.5 overlooked a few races.

First, a takeover from thread1 could happen just after fd_update_events()
in thread2 validates it holds the tmask bit in the CAS loop. Since thread1
releases running_mask after the operation, thread2 will succeed the CAS
and both will believe the FD is theirs. This does explain the occasional
crashes seen with h1_io_cb() being called on a bad context, or
sock_conn_iocb() seeing conn->subs vanish after checking it. This issue
can be addressed using a DWCAS in both fd_takeover() and fd_update_events()
as it was before the patch above but this is not portable to all archs and
is not easy to adapt for those lacking it, due to some operations still
happening only on individual masks after the thread groups were added.

Second, the checks after fd_clr_running() for the current thread being
the last one is not sufficient: at the exact moment the operation
completes, another thread may also set and drop the running bit and see
itself as alone, and both can call _fd_close_orphan() in parallel. In
order to prevent this from happening, we cannot rely on the absence of
others, we need an explicit flag indicating that the FD must be closed.
One approach that was attempted consisted in playing with the thread_mask
but that was not reliable since it could still match between the late
deletion and the early insertion that follows. Instead, a new FD flag
was added, FD_MUST_CLOSE, that exactly indicates that the call to
_fd_delete_orphan() must be done. It is set by fd_delete(), and
atomically cleared by the first one which checks it, and which is the
only one to call _fd_delete_orphan().

With both points addressed, there's no more visible race left:

- takeover() only happens under the connection list's lock and cannot
  compete with fd_delete() since fd_delete() must first remove the
  connection from the list before deleting the FD. That's also why it
  doesn't need to call _fd_delete_orphan() when dropping its running
  bit.

- takeover() sets its running bit then atomically replaces the thread
  mask, so that until that's done, it doesn't validate the condition
  to end the synchonization loop in fd_update_events(). Once it's OK,
  the previous thread's bit is lost, and this is checked for in
  fd_update_events()

- fd_update_events() can compete with fd_delete() at various places
  which are explained above. Since fd_delete() clears the thread mask
  as after setting its running bit and after setting the FD_MUST_CLOSE
  bit, the synchronization loop guarantees that the thread mask is seen
  before going further, and that once it's seen, the FD_MUST_CLOSE flag
  is already present.

- fd_delete() may start while fd_update_events() has already started,
  but fd_delete() must hold a bit in thread_mask before starting, and
  that is checked by the first test in fd_update_events() before setting
  the running_mask.

- the poller's _update_fd() will not compete against _fd_delete_orphan()
  nor fd_insert() thanks to the fd_grab_tgid() that's always done before
  updating the polled_mask, and guarantees that we never pretend that a
  polled_mask has a bit before the FD is added.

The issue is very hard to reproduce and is extremely time-sensitive.
Some tests were required with a 1-ms timeout with request rates
closely matching 1 kHz per server, though certain tests sometimes
benefitted from saturation. It was found that adding the following
slowdown at a few key places helped a lot and managed to trigger the
bug in 0.5 to 5 seconds instead of tens of minutes on a 20-thread
setup:

    { volatile int i = 10000; while (i--); }

Particularly, placing it at key places where only one of running_mask
or thread_mask is set and not the other one yet (e.g. after the
synchronization loop in fd_update_events or after dropping the
running bit) did yield great results.

Many thanks to Olivier Houchard for this expert help analysing these
races and reviewing candidate fixes.

The patch must be backported to 2.5. Note that 2.6 does not have tgid
in FDs, and that it requires a change of output on fd_clr_running() as
we need the previous bit. This is provided by carefully backporting
commit d6e1987612 ("MINOR: fd: make fd_clr_running() return the previous
value instead"). Tests have shown that the lack of tgid is a showstopper
for 2.6 and that unless a better workaround is found, it could still be
preferable to backport the minimum pieces required for fd_grab_tgid()
to 2.6 so that it stays stable long.
diff --git a/include/haproxy/fd-t.h b/include/haproxy/fd-t.h
index b2ca87c..d0c3134 100644
--- a/include/haproxy/fd-t.h
+++ b/include/haproxy/fd-t.h
@@ -71,6 +71,7 @@
 #define FD_EXPORTED_BIT    20  /* FD is exported and must not be closed */
 #define FD_EXCL_SYSCALL_BIT 21 /* a syscall claims exclusivity on this FD */
 #define FD_DISOWN_BIT      22  /* this fd will be closed by some external code */
+#define FD_MUST_CLOSE_BIT  23  /* this fd will be closed by some external code */
 
 
 /* and flag values */
@@ -112,6 +113,7 @@
 #define FD_EXPORTED         (1U << FD_EXPORTED_BIT)
 #define FD_EXCL_SYSCALL     (1U << FD_EXCL_SYSCALL_BIT)
 #define FD_DISOWN           (1U << FD_DISOWN_BIT)
+#define FD_MUST_CLOSE       (1U << FD_MUST_CLOSE_BIT)
 
 /* This function is used to report flags in debugging tools. Please reflect
  * below any single-bit flag addition above in the same order via the
diff --git a/src/fd.c b/src/fd.c
index 889b314..706aba8 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -419,12 +419,21 @@
 	 * will not take new bits in its running_mask so we have the guarantee
 	 * that the last thread eliminating running_mask is the one allowed to
 	 * safely delete the FD. Most of the time it will be the current thread.
+	 * We still need to set and check the one-shot flag FD_MUST_CLOSE
+	 * to take care of the rare cases where a thread wakes up on late I/O
+	 * before the thread_mask is zero, and sets its bit in the running_mask
+	 * just after the current thread finishes clearing its own bit, hence
+	 * the two threads see themselves as last ones (which they really are).
 	 */
 
 	HA_ATOMIC_OR(&fdtab[fd].running_mask, ti->ltid_bit);
+	HA_ATOMIC_OR(&fdtab[fd].state, FD_MUST_CLOSE);
 	HA_ATOMIC_STORE(&fdtab[fd].thread_mask, 0);
-	if (fd_clr_running(fd) == ti->ltid_bit)
-		_fd_delete_orphan(fd);
+	if (fd_clr_running(fd) == ti->ltid_bit) {
+		if (HA_ATOMIC_BTR(&fdtab[fd].state, FD_MUST_CLOSE_BIT)) {
+			_fd_delete_orphan(fd);
+		}
+	}
 }
 
 /* makes the new fd non-blocking and clears all other O_* flags; this is meant
@@ -587,29 +596,47 @@
 		return FD_UPDT_CLOSED;
 	}
 
-	/* do nothing if the FD was taken over under us */
+	/* Do not take running_mask if not strictly needed (will trigger a
+	 * cosmetic BUG_ON() in fd_insert() anyway if done).
+	 */
+	tmask = _HA_ATOMIC_LOAD(&fdtab[fd].thread_mask);
+	if (!(tmask & ti->ltid_bit))
+		goto do_update;
+
+	HA_ATOMIC_OR(&fdtab[fd].running_mask, ti->ltid_bit);
+
+	/* From this point, our bit may possibly be in thread_mask, but it may
+	 * still vanish, either because a takeover completed just before taking
+	 * the bit above with the new owner deleting the FD, or because a
+	 * takeover started just before taking the bit. In order to make sure a
+	 * started takeover is complete, we need to verify that all bits of
+	 * running_mask are present in thread_mask, since takeover first takes
+	 * running then atomically replaces thread_mask. Once it's stable, if
+	 * our bit remains there, no further takeover may happen because we
+	 * hold running, but if our bit is not there it means we've lost the
+	 * takeover race and have to decline touching the FD. Regarding the
+	 * risk of deletion, our bit in running_mask prevents fd_delete() from
+	 * finalizing the close, and the caller will leave the FD with a zero
+	 * thread_mask and the FD_MUST_CLOSE flag set. It will then be our
+	 * responsibility to close it.
+	 */
 	do {
-		/* make sure we read a synchronous copy of rmask and tmask
-		 * (tmask is only up to date if it reflects all of rmask's
-		 * bits).
-		 */
-		do {
-			rmask = _HA_ATOMIC_LOAD(&fdtab[fd].running_mask);
-			tmask = _HA_ATOMIC_LOAD(&fdtab[fd].thread_mask);
-		} while (rmask & ~tmask);
+		rmask = _HA_ATOMIC_LOAD(&fdtab[fd].running_mask);
+		tmask = _HA_ATOMIC_LOAD(&fdtab[fd].thread_mask);
+		rmask &= ~ti->ltid_bit;
+	} while (rmask & ~tmask);
 
-		if (!(tmask & ti->ltid_bit)) {
-			/* a takeover has started */
-			activity[tid].poll_skip_fd++;
+	/* Now tmask is stable. Do nothing if the FD was taken over under us */
 
-			/* Let the poller know this FD was lost */
-			if (!HA_ATOMIC_BTS(&fdtab[fd].update_mask, ti->ltid))
-				fd_updt[fd_nbupdt++] = fd;
+	if (!(tmask & ti->ltid_bit)) {
+		/* a takeover has started */
+		activity[tid].poll_skip_fd++;
 
-			fd_drop_tgid(fd);
-			return FD_UPDT_MIGRATED;
-		}
-	} while (!HA_ATOMIC_CAS(&fdtab[fd].running_mask, &rmask, rmask | ti->ltid_bit));
+		if (fd_clr_running(fd) == ti->ltid_bit)
+			goto closed_or_migrated;
+
+		goto do_update;
+	}
 
 	/* with running we're safe now, we can drop the reference */
 	fd_drop_tgid(fd);
@@ -706,16 +733,16 @@
 
 	/* another thread might have attempted to close this FD in the mean
 	 * time (e.g. timeout task) striking on a previous thread and closing.
-	 * This is detected by both thread_mask and running_mask being 0 after
+	 * This is detected by us being the last owners of a running_mask bit,
+	 * and the thread_mask being zero. At the moment we release the running
+	 * bit, a takeover may also happen, so in practice we check for our loss
+	 * of the thread_mask bitboth thread_mask and running_mask being 0 after
 	 * we remove ourselves last. There is no risk the FD gets reassigned
 	 * to a different group since it's not released until the real close()
 	 * in _fd_delete_orphan().
 	 */
-	if (fd_clr_running(fd) == ti->ltid_bit && !tmask) {
-		fd_drop_tgid(fd);
-		_fd_delete_orphan(fd);
-		return FD_UPDT_CLOSED;
-	}
+	if (fd_clr_running(fd) == ti->ltid_bit && !(tmask & ti->ltid_bit))
+		goto closed_or_migrated;
 
 	/* we had to stop this FD and it still must be stopped after the I/O
 	 * cb's changes, so let's program an update for this.
@@ -729,6 +756,35 @@
 
 	fd_drop_tgid(fd);
 	return FD_UPDT_DONE;
+
+ closed_or_migrated:
+	/* We only come here once we've last dropped running and the FD is
+	 * not for us as per !(tmask & tid_bit). It may imply we're
+	 * responsible for closing it. Otherwise it's just a migration.
+	 */
+	if (HA_ATOMIC_BTR(&fdtab[fd].state, FD_MUST_CLOSE_BIT)) {
+		fd_drop_tgid(fd);
+		_fd_delete_orphan(fd);
+		return FD_UPDT_CLOSED;
+	}
+
+	/* So we were alone, no close bit, at best the FD was migrated, at
+	 * worst it's in the process of being closed by another thread. We must
+	 * be ultra-careful as it can be re-inserted by yet another thread as
+	 * the result of socket() or accept(). Let's just tell the poller the
+	 * FD was lost. If it was closed it was already removed and this will
+	 * only cost an update for nothing.
+	 */
+
+ do_update:
+	/* The FD is not closed but we don't want the poller to wake up for
+	 * it anymore.
+	 */
+	if (!HA_ATOMIC_BTS(&fdtab[fd].update_mask, ti->ltid))
+		fd_updt[fd_nbupdt++] = fd;
+
+	fd_drop_tgid(fd);
+	return FD_UPDT_MIGRATED;
 }
 
 /* This is used by pollers at boot time to re-register desired events for