BUG/MINOR: freq_ctr/threads: make use of the last updated global time
The freq counters were using the thread's own time as the start of the
current period. The problem is that in case of contention, it was
occasionally possible to perform non-monotonic updates on the edge of
the next second, because if the upfront thread updates a counter first,
it causes a rotation, then the second thread loses the race from its
older time, and tries again, and detects a different time again, but
in the past so it only updates the counter, then a third thread on the
new date would detect a change again, thus provoking a rotation again.
The effect was triple:
- rare loss of stored values during certain transitions from one
period to the next one, causing counters to report 0
- half of the threads forced to go through the slow path every second
- difficult convergence when using many threads where the CAS can fail
a lot and we can observe N(N-1) attempts for N threads to complete
This patch fixes this issue in two ways:
- first, it now makes use og the monotonic global_now value which also
happens to be volatile and to carry the latest known time; this way
time will never jump backwards anymore and only the first thread
updates it on transition, the other ones do not need to.
- second, re-read the time in the loop after each failure, because
if the date changed in the counter, it means that one thread knows
a more recent one and we need to update. In this case if it matches
the new current second, the fast path is usable.
This patch relies on previous patch "MINOR: time: export the global_now
variable" and must be backported as far as 1.8.
(cherry picked from commit a1ecbca0a50ccb273bb5cdc9f031408d782a7bcf)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
diff --git a/src/freq_ctr.c b/src/freq_ctr.c
index ad032a3..ddd9699 100644
--- a/src/freq_ctr.c
+++ b/src/freq_ctr.c
@@ -51,7 +51,7 @@
break;
}
- age = now.tv_sec - curr_sec;
+ age = (global_now >> 32) - curr_sec;
if (unlikely(age > 1))
return 0;
@@ -94,7 +94,7 @@
break;
}
- age = now.tv_sec - curr_sec;
+ age = (global_now >> 32) - curr_sec;
if (unlikely(age > 1))
curr = 0;
else {
@@ -141,7 +141,7 @@
break;
}
- age = now.tv_sec - curr_sec;
+ age = (global_now >> 32) - curr_sec;
if (unlikely(age > 1))
curr = 0;
else {
@@ -163,7 +163,7 @@
/* Reads a frequency counter taking history into account for missing time in
* current period. The period has to be passed in number of ticks and must
* match the one used to feed the counter. The counter value is reported for
- * current date (now_ms). The return value has the same precision as one input
+ * current global date. The return value has the same precision as one input
* data sample, so low rates over the period will be inaccurate but still
* appropriate for max checking. One trick we use for low values is to specially
* handle the case where the rate is between 0 and 1 in order to avoid flapping
@@ -200,7 +200,7 @@
break;
};
- remain = curr_tick + period - now_ms;
+ remain = curr_tick + period - (uint32_t)global_now / 1000;
if (unlikely((int)remain < 0)) {
/* We're past the first period, check if we can still report a
* part of last period or if we're too far away.
@@ -247,7 +247,7 @@
break;
};
- remain = curr_tick + period - now_ms;
+ remain = curr_tick + period - (uint32_t)global_now / 1000;
if (likely((int)remain < 0)) {
/* We're past the first period, check if we can still report a
* part of last period or if we're too far away.