BUG/MAJOR: pools: fix possible race with free() in the lockless variant

In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed
report showing reproducible crashes under musl. Musl is one of the libs
coming with a simple allocator for which we prefer to keep the shared
cache. On x86 we have a DWCAS so the lockless implementation is enabled
for such libraries.

And this implementation has had a small race since day one: the allocator
will need to read the first object's <next> pointer to place it into the
free list's head. If another thread picks the same element and immediately
releases it, while both the local and the shared pools are too crowded, it
will be freed to the OS. If the libc's allocator immediately releases it,
the memory area is unmapped and we can have a crash while trying to read
that pointer. However there is no problem as long as the item remains
mapped in memory because whatever value found there will not be placed
into the head since the counter will have changed.

The probability for this to happen is extremely low, but as analyzed by
Fabiano, it increases with the buffer size. On 16 threads it's relatively
easy to reproduce with 2MB buffers above 200k req/s, where it should
happen within the first 20 seconds of traffic usually.

This is a structural issue for which there are two non-trivial solutions:
  - place a read lock in the alloc call and a barrier made of lock/unlock
    in the free() call to force to serialize operations; this will have
    a big performance impact since free() is already one of the contention
    points;

  - change the allocator to use a self-locked head, similar to what is
    done in the MT_LISTS. This requires two memory writes to the head
    instead of a single one, thus the overhead is exactly one memory
    write during alloc and one during free;

This patch implements the second option. A new POOL_DUMMY pointer was
defined for the locked pointer value, allowing to both read and lock it
with a single xchg call. The code was carefully optimized so that the
locked period remains the shortest possible and that bus writes are
avoided as much as possible whenever the lock is held.

Tests show that while a bit slower than the original lockless
implementation on large buffers (2MB), it's 2.6 times faster than both
the no-cache and the locked implementation on such large buffers, and
remains as fast or faster than the all implementations when buffers are
48k or higher. Tests were also run on arm64 with similar results.

Note that this code is not used on modern libcs featuring a fast allocator.

A nice benefit of this change is that since it removes a dependency on
the DWCAS, it will be possible to remove the locked implementation and
replace it with this one, that is then usable on all systems, thus
significantly increasing their performance with large buffers.

Given that lockless pools were introduced in 1.9 (not supported anymore),
this patch will have to be backported as far as 2.0. The code changed
several times in this area and is subject to many ifdefs which will
complicate the backport. What is important is to remove all the DWCAS
code from the shared cache alloc/free lockless code and replace it with
this one. The pool_flush() code is basically the same code as the
allocator, retrieving the whole list at once. If in doubt regarding what
barriers to use in older versions, it's safe to use the generic ones.

This patch depends on the following previous commits:

 - MINOR: pools: do not maintain the lock during pool_flush()
 - MINOR: pools: call malloc_trim() under thread isolation
 - MEDIUM: pools: use a single pool_gc() function for locked and lockless

The last one also removes one occurrence of an unneeded DWCAS in the
code that was incompatible with this fix. The removal of the now unused
seq field will happen in a future patch.

Many thanks to Fabiano for his detailed report, and to Olivier for
his help on this issue.

(cherry picked from commit 2a4523f6f4cba1cb9c899d0527a1bb4c5d5c0f2e)
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit d0cc3761495b55764e58aff97f7c82a687c5239f)
[wt: backported into the lockless functions only; kept pool_free_area()
 and local accounting instead of pool_free_to_os(); tested with random
 pool_flush() and heavy pool_gc() calls]
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit bc76411e025d146e6627dae6122d243e2f8982c4)
[wt: s/__ha_cpu_relax()/pl_cpu_relax()]
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit 179687ecc97e161f0c2381ae4ac7167339d13316)
[wt: note: this fix includes the two missing pieces that followed this
     patch in 2.2 (187cc82c4 and 7350448be). files are memory.{c,h};
     __pool_get_first() is the function that accesses the local cache
     in 2.0 so the atomic retrieval can only be done after this part;
     there is no needed_avg here, and __pool_free() never frees by
     itself in 2.0; the flush_lock isn't used anymore]
Signed-off-by: Willy Tarreau <w@1wt.eu>
2 files changed