BUG/MEDIUM: mux-h2: fail earlier on malloc in takeover()

Connection takeover was implemented for H2 in 2.2 by commit cd4159f03
("MEDIUM: mux_h2: Implement the takeover() method."). It does have one
corner case related to memory allocation failure: in case the task or
tasklet allocation fails, the connection gets released synchronously.
Unfortunately the situation is bad there, because the lower layers are
already switched to the new thread while the tasklet is either NULL or
still the old one, and calling h2_release() will also result in
h2_process() and h2_process_demux() that may process any possibly
pending frames. Even the session remains the old one on the old thread,
so that some sess_log() that are called when facing certain demux errors
will be associated with the previous thread, possibly accessing a number
of elements belonging to another thread. There are even code paths where
the thread will try to grab the lock of its own idle conns list, believing
the connection is there while it has no useful effect. However, if the
owner thread was doing the same at the same moment, and ended up trying
to pick from the current thread (which could happen if picking a connection
for a different name), the two could even deadlock.

The risk is extremely low, but Fred managed to reproduce use-after-free
errors in conn_backend_get() after a takeover() failed by playing with
-dMfail, indicating that h2_release() had been successfully called. In
practise it's sufficient to have h2 on the server side with reuse-always
and to inject lots of request on it with -dMfail.

This patch takes a simple but radically different approach. Instead of
starting to migrate the connection before risking to face allocation
failures, it first pre-allocates a new task and tasklet, then assigns
them to the connection if the migration succeeds, otherwise it just
frees them. This way it's no longer needed to manipulate the connection
until it's fully migrated, and as a bonus this means the connection will
continue to exist and the use-after-free condition is solved at the same
time.

This should be backported to 2.2. Thanks to Fred for the initial analysis
of the problem!

(cherry picked from commit 4f02e3da67aece3476e78f43f1cea45685edc48a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 960f37c7c6dbfb5bed36560e8acd6b4ac06e9519)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 11c42009f4f9dc863796536f23837e1ec49cd050)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit f2734f7f19e77ab48950d95e0062c02da4480cf9)
[cf: task_new(tid_bit) is used instead of task_new_here()]
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
1 file changed