MEDIUM: activity: collect memory allocator statistics with USE_MEMORY_PROFILING

When built with USE_MEMORY_PROFILING the main memory allocation functions
are diverted to collect statistics per caller. It is a bit tricky because
the only way to call the original ones is to find their pointer, which
requires dlsym(), and which is not available everywhere.

Thus all functions are designed to call their fallback function (the
original one), which is preset to an initialization function that is
supposed to call dlsym() to resolve the missing symbols, and vanish.
This saves expensive tests in the critical path.

A second problem is that dlsym() calls calloc() to initialize some
error messages. After plenty of tests with posix_memalign(), valloc()
and friends, it turns out that returning NULL still makes it happy.
Thus we currently use a visit counter (in_memprof) to detect if we're
reentering, in which case all allocation functions return NULL.

In order to convert a return address to an entry in the stats, we
perform a cheap hash consisting in multiplying the pointer by a
balanced number (as many zeros as ones) and keeping the middle bits.
The hash is already pretty good like this, achieving to store up to
638 entries in a 2048-entry table without collision. But in order to
further refine this and improve the fill ratio of the table, in case
of collision we move up to 16 adjacent entries to find a free place.
This remains quite cheap and manages to store all of these inside a
1024-entries hash table with even less risk of collision.

Also, free(NULL) does not produce any stats. By doing so we reduce
from 638 to 208 the average number of entries needed for a basic
config using SSL. free(NULL) not only provides no information as it's
a NOP, but keeping it is pure pollution as it happens all the time.

When DEBUG_MEM_STATS is enabled, malloc/calloc/realloc are redefined as
macros, preventing the code from compiling. Thus, when this option is
detected, the macros are undefined as they are pointless there anyway.

The functions are optimized to quickly jump to the fallback and as such
become almost invisible in terms of processing time, execpt an extra
"if" on a read_mostly variable and a jump. Considering that this only
happens for pool misses and library routines, this remains acceptable.

Performance tests in SSL (the most stressful test) shows less than 1%
performance loss when profiling is enabled on 2c4t.

The code was written in a way to ease backporting to modern versions
(2.2+) if needed, so it keeps the long names for integers and doesn't
use the _INC version of the atomic ops.
1 file changed