tree dfa44590386a9b0192f19b93713c974d9869b4d4
parent 522841c47bba8bdd5c3b26b2eac564e8ca066ef6
author Willy Tarreau <w@1wt.eu> 1673536121 +0100
committer Willy Tarreau <w@1wt.eu> 1673537865 +0100
encoding latin1

OPTIM: global: move byte counts out of global and per-thread

During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.

Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.

After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
