BUG/MEDIUM: fd: do not wait on FD removal in fd_delete()

Christopher discovered an issue mostly affecting 2.2 and to a less extent
2.3 and above, which is that it's possible to deadlock a soft-stop when
several threads are using a same listener:

          thread1                             thread2

      unbind_listener()                   fd_set_running()
          lock(listener)                  listener_accept()
          fd_delete()                          lock(listener)
             while (running_mask);  -----> deadlock
          unlock(listener)

This simple case disappeared from 2.3 due to the removal of some locked
operations at the end of listener_accept() on the regular path, but the
architectural problem is still here and caused by a lock inversion built
around the loop on running_mask in fd_clr_running_excl(), because there
are situations where the caller of fd_delete() may hold a lock that is
preventing other threads from dropping their bit in running_mask.

The real need here is to make sure the last user deletes the FD. We have
all we need to know the last one, it's the one calling fd_clr_running()
last, or entering fd_delete() last, both of which can be summed up as
the last one calling fd_clr_running() if fd_delete() calls fd_clr_running()
at the end. And we can prevent new threads from appearing in running_mask
by removing their bits in thread_mask.

So what this patch does is that it sets the running_mask for the thread
in fd_delete(), clears the thread_mask, thus marking the FD as orphaned,
then clears the running mask again, and completes the deletion if it was
the last one. If it was not, another thread will pass through fd_clr_running
and will complete the deletion of the FD.

The bug is easily reproducible in 2.2 under high connection rates during
soft close. When the old process stops its listener, occasionally two
threads will deadlock and the old process will then be killed by the
watchdog. It's strongly believed that similar situations do exist in 2.3
and 2.4 (e.g. if the removal attempt happens during resume_listener()
called from listener_accept()) but if so, they should be much harder to
trigger.

This should be backported to 2.2 as the issue appeared with the FD
migration. It requires previous patches "fd: make fd_clr_running() return
the remaining running mask" and "MINOR: fd: remove the unneeded running
bit from fd_insert()".

Notes for backport: in 2.2, the fd_dodelete() function requires an extra
argument "do_close" indicating whether we want to remove and close the FD
(fd_delete) or just delete it (fd_remove). While this information is not
conveyed along the chain, we know that late calls always imply do_close=1
become do_close=0 exclusively results from fd_remove() which is only used
by the config parser and the master, both of which are single-threaded,
hence are always the last ones in the running_mask. Thus it is safe to
assume that a postponed FD deletion always implies do_close=1.

Thanks to Olivier for his help in designing this optimal solution.
diff --git a/src/fd.c b/src/fd.c
index 292a9e0..6a63ec5 100644
--- a/src/fd.c
+++ b/src/fd.c
@@ -294,23 +294,14 @@
 #undef _GET_NEXT
 #undef _GET_PREV
 
-/* Deletes an FD from the fdsets.
- * The file descriptor is also closed.
+/* deletes the FD once nobody uses it anymore, as detected by the caller by its
+ * thread_mask being zero and its running mask turning to zero. There is no
+ * protection against concurrent accesses, it's up to the caller to make sure
+ * only the last thread will call it. This is only for internal use, please use
+ * fd_delete() instead.
  */
-void fd_delete(int fd)
+void _fd_delete_orphan(int fd)
 {
-	int locked = fdtab[fd].running_mask != tid_bit;
-
-	/* We're just trying to protect against a concurrent fd_insert()
-	 * here, not against fd_takeover(), because either we're called
-	 * directly from the iocb(), and we're already locked, or we're
-	 * called from the mux tasklet, but then the mux is responsible for
-	 * making sure the tasklet does nothing, and the connection is never
-	 * destroyed.
-	 */
-	if (locked)
-		fd_set_running_excl(fd);
-
 	if (fdtab[fd].linger_risk) {
 		/* this is generally set when connecting to servers */
 		DISGUISE(setsockopt(fd, SOL_SOCKET, SO_LINGER,
@@ -328,12 +319,39 @@
 	port_range_release_port(fdinfo[fd].port_range, fdinfo[fd].local_port);
 	fdinfo[fd].port_range = NULL;
 	fdtab[fd].owner = NULL;
-	fdtab[fd].thread_mask = 0;
 	fdtab[fd].exported = 0;
+	/* perform the close() call last as it's what unlocks the instant reuse
+	 * of this FD by any other thread.
+	 */
 	close(fd);
 	_HA_ATOMIC_SUB(&ha_used_fds, 1);
-	if (locked)
-		fd_clr_running(fd);
+}
+
+/* Deletes an FD from the fdsets. The file descriptor is also closed, possibly
+ * asynchronously. Only the owning thread may do this.
+ */
+void fd_delete(int fd)
+{
+	/* we must postpone removal of an FD that may currently be in use
+	 * by another thread. This can happend in the following two situations:
+	 *   - after a takeover, the owning thread closes the connection but
+	 *     the previous one just woke up from the poller and entered
+	 *     the FD handler iocb. That thread holds an entry in running_mask
+	 *     and requires removal protection.
+	 *   - multiple threads are accepting connections on a listener, and
+	 *     one of them (or even an separate one) decides to unbind the
+	 *     listener under the listener's lock while other ones still hold
+	 *     the running bit.
+	 * In both situations the FD is marked as unused (thread_mask = 0) and
+	 * will not take new bits in its running_mask so we have the guarantee
+	 * that the last thread eliminating running_mask is the one allowed to
+	 * safely delete the FD. Most of the time it will be the current thread.
+	 */
+
+	HA_ATOMIC_OR(&fdtab[fd].running_mask, tid_bit);
+	HA_ATOMIC_STORE(&fdtab[fd].thread_mask, 0);
+	if (fd_clr_running(fd) == 0)
+		_fd_delete_orphan(fd);
 }
 
 #ifndef HA_HAVE_CAS_DW