OPTIM: global: move byte counts out of global and per-thread
During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.
Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.
After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
diff --git a/src/quic_sock.c b/src/quic_sock.c
index 9d5c5be..ba8d36f 100644
--- a/src/quic_sock.c
+++ b/src/quic_sock.c
@@ -552,8 +552,8 @@
* The reason for the latter is that freq_ctr are limited to 4GB and
* that it's not enough per second.
*/
- _HA_ATOMIC_ADD(&global.out_bytes, ret);
- update_freq_ctr(&global.out_32bps, (ret + 16) / 32);
+ _HA_ATOMIC_ADD(&th_ctx->out_bytes, ret);
+ update_freq_ctr(&th_ctx->out_32bps, (ret + 16) / 32);
return 0;
}