BUG/MEDIUM: fd: do not wait on FD removal in fd_delete() Christopher discovered an issue mostly affecting 2.2 and to a less extent 2.3 and above, which is that it's possible to deadlock a soft-stop when several threads are using a same listener: thread1 thread2 unbind_listener() fd_set_running() lock(listener) listener_accept() fd_delete() lock(listener) while (running_mask); -----> deadlock unlock(listener) This simple case disappeared from 2.3 due to the removal of some locked operations at the end of listener_accept() on the regular path, but the architectural problem is still here and caused by a lock inversion built around the loop on running_mask in fd_clr_running_excl(), because there are situations where the caller of fd_delete() may hold a lock that is preventing other threads from dropping their bit in running_mask. The real need here is to make sure the last user deletes the FD. We have all we need to know the last one, it's the one calling fd_clr_running() last, or entering fd_delete() last, both of which can be summed up as the last one calling fd_clr_running() if fd_delete() calls fd_clr_running() at the end. And we can prevent new threads from appearing in running_mask by removing their bits in thread_mask. So what this patch does is that it sets the running_mask for the thread in fd_delete(), clears the thread_mask, thus marking the FD as orphaned, then clears the running mask again, and completes the deletion if it was the last one. If it was not, another thread will pass through fd_clr_running and will complete the deletion of the FD. The bug is easily reproducible in 2.2 under high connection rates during soft close. When the old process stops its listener, occasionally two threads will deadlock and the old process will then be killed by the watchdog. It's strongly believed that similar situations do exist in 2.3 and 2.4 (e.g. if the removal attempt happens during resume_listener() called from listener_accept()) but if so, they should be much harder to trigger. This should be backported to 2.2 as the issue appeared with the FD migration. It requires previous patches "fd: make fd_clr_running() return the remaining running mask" and "MINOR: fd: remove the unneeded running bit from fd_insert()". Notes for backport: in 2.2, the fd_dodelete() function requires an extra argument "do_close" indicating whether we want to remove and close the FD (fd_delete) or just delete it (fd_remove). While this information is not conveyed along the chain, we know that late calls always imply do_close=1 become do_close=0 exclusively results from fd_remove() which is only used by the config parser and the master, both of which are single-threaded, hence are always the last ones in the running_mask. Thus it is safe to assume that a postponed FD deletion always implies do_close=1. Thanks to Olivier for his help in designing this optimal solution.

commit: 2c3f9818e883318688e4bba0c912b305e2e2659c [log] [tgz]
author: Willy Tarreau <w@1wt.eu> Wed Mar 24 10:51:32 2021 +0100
committer: Willy Tarreau <w@1wt.eu> Wed Mar 24 17:17:21 2021 +0100
tree: 4a6cac10ccef21821aaad6049f1bba5fa38adf38
parent: 71bada5ca45cdf6d433e699ac0c7f7d78092ed02 [diff] [blame]
diff --git a/src/fd.c b/src/fd.c
index 292a9e0..6a63ec5 100644
--- a/src/fd.c
+++ b/src/fd.c

@@ -294,23 +294,14 @@
 #undef _GET_NEXT
 #undef _GET_PREV
 
-/* Deletes an FD from the fdsets.
- * The file descriptor is also closed.
+/* deletes the FD once nobody uses it anymore, as detected by the caller by its
+ * thread_mask being zero and its running mask turning to zero. There is no
+ * protection against concurrent accesses, it's up to the caller to make sure
+ * only the last thread will call it. This is only for internal use, please use
+ * fd_delete() instead.
  */
-void fd_delete(int fd)
+void _fd_delete_orphan(int fd)
 {
-	int locked = fdtab[fd].running_mask != tid_bit;
-
-	/* We're just trying to protect against a concurrent fd_insert()
-	 * here, not against fd_takeover(), because either we're called
-	 * directly from the iocb(), and we're already locked, or we're
-	 * called from the mux tasklet, but then the mux is responsible for
-	 * making sure the tasklet does nothing, and the connection is never
-	 * destroyed.
-	 */
-	if (locked)
-		fd_set_running_excl(fd);
-
 	if (fdtab[fd].linger_risk) {
 		/* this is generally set when connecting to servers */
 		DISGUISE(setsockopt(fd, SOL_SOCKET, SO_LINGER,
@@ -328,12 +319,39 @@
 	port_range_release_port(fdinfo[fd].port_range, fdinfo[fd].local_port);
 	fdinfo[fd].port_range = NULL;
 	fdtab[fd].owner = NULL;
-	fdtab[fd].thread_mask = 0;
 	fdtab[fd].exported = 0;
+	/* perform the close() call last as it's what unlocks the instant reuse
+	 * of this FD by any other thread.
+	 */
 	close(fd);
 	_HA_ATOMIC_SUB(&ha_used_fds, 1);
-	if (locked)
-		fd_clr_running(fd);
+}
+
+/* Deletes an FD from the fdsets. The file descriptor is also closed, possibly
+ * asynchronously. Only the owning thread may do this.
+ */
+void fd_delete(int fd)
+{
+	/* we must postpone removal of an FD that may currently be in use
+	 * by another thread. This can happend in the following two situations:
+	 *   - after a takeover, the owning thread closes the connection but
+	 *     the previous one just woke up from the poller and entered
+	 *     the FD handler iocb. That thread holds an entry in running_mask
+	 *     and requires removal protection.
+	 *   - multiple threads are accepting connections on a listener, and
+	 *     one of them (or even an separate one) decides to unbind the
+	 *     listener under the listener's lock while other ones still hold
+	 *     the running bit.
+	 * In both situations the FD is marked as unused (thread_mask = 0) and
+	 * will not take new bits in its running_mask so we have the guarantee
+	 * that the last thread eliminating running_mask is the one allowed to
+	 * safely delete the FD. Most of the time it will be the current thread.
+	 */
+
+	HA_ATOMIC_OR(&fdtab[fd].running_mask, tid_bit);
+	HA_ATOMIC_STORE(&fdtab[fd].thread_mask, 0);
+	if (fd_clr_running(fd) == 0)
+		_fd_delete_orphan(fd);
 }
 
 #ifndef HA_HAVE_CAS_DW
commit	2c3f9818e883318688e4bba0c912b305e2e2659c	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Wed Mar 24 10:51:32 2021 +0100
committer	Willy Tarreau <w@1wt.eu>	Wed Mar 24 17:17:21 2021 +0100
tree	4a6cac10ccef21821aaad6049f1bba5fa38adf38
parent	71bada5ca45cdf6d433e699ac0c7f7d78092ed02 [diff] [blame]