BUG/MEDIUM: debug/thread: make the debug handler not wait for !rdv_requests
The debug handler may deadlock with some threads waiting for isolation.
This may happend during a "show threads" command or even during a panic.
The reason is the call to thread_harmless_end() which waits for rdv_requests
to turn to zero before releasing its position in thread_dump_state,
while that one may not progress if another thread was interrupted in
thread_isolate() and is waiting for that thread to drop thread_dump_state.
In order to address this, we now use thread_harmless_end_sig() introduced
by previous commit:
MINOR: threads: add a thread_harmless_end() version that doesn't wait
However there's a catch: since commit f7afdd910 ("MINOR: debug: mark
oneself harmless while waiting for threads to finish"), there's a second
pair of thread_harmless_now()/thread_harmless_end() that surround the
loop around thread_dump_state. Marking a thread harmless before this
loop and dropping that without checking rdv_requests there could break
the harmless promise made to the other thread if it returns first and
proceeds with its isolated work. Hence we just drop this pair which was
only preventive for other signal handlers, while as indicated in that
patch's commit message, other signals are handled asynchronously and do
not require that extra protection.
This fix must be backported to 2.7.
The problem can be seen by running "show threads" in fast loops (100/s)
while reloading haproxy very quickly (10/s) and sending lots of traffic
to it (100krps, 15 Gbps). In this case the soft stop calls pool_gc()
which isolates a lot and manages to race with the dumps after a few
tens of seconds, leaving the process with all threads at 100%.
diff --git a/src/debug.c b/src/debug.c
index 05c7a5c..44f9755 100644
--- a/src/debug.c
+++ b/src/debug.c
@@ -1562,7 +1562,7 @@
ha_thread_relax();
if (!harmless)
- thread_harmless_end();
+ thread_harmless_end_sig();
/* dump if needed */
if (thread_dump_buffer)
@@ -1598,18 +1598,17 @@
}
/* now wait for all others to finish dumping: the lowest part will turn
- * to zero. Then all others decrement the done part.
+ * to zero. Then all others decrement the done part. We must not change
+ * the harmless status anymore because one of the other threads might
+ * have been interrupted in thread_isolate() waiting for all others to
+ * become harmless, and changing the situation here would break that
+ * promise.
*/
- if (!harmless)
- thread_harmless_now();
/* wait for everyone to finish*/
while (HA_ATOMIC_LOAD(&thread_dump_state) & THREAD_DUMP_PMASK)
ha_thread_relax();
- if (!harmless)
- thread_harmless_end();
-
/* we're gone. Past this point anything can happen including another
* thread trying to re-trigger a dump, so thread_dump_buffer and
* thread_dump_tid may become invalid immediately after this call.