BUG/MINOR: checks: postpone the startup of health checks by the boot time
When health checks are started at boot, now_ms could be off by the boot
time. In general it's not even noticeable, but with very large configs
taking up to one or even a few seconds to start, this can result in a
part of the servers' checks being scheduled slightly in the past. As
such all of them will start groupped, partially defeating the purpose of
the spread-checks setting. For example, this can cause a burst of
connections for the network, or an excess of CPU usage during SSL
handshakes, possibly even causing some timeouts to expire early.
Here in order to compensate for this, we simply add the known boot time
to the computed delay when scheduling the startup of checks. That's very
simple and particularly efficient. For example, a config with 5k servers
in 800 backends checked every 5 seconds, that was taking 3.8 seconds to
start used to show this distribution of health checks previously despite
the spread-checks 50:
3690 08:59:25
417 08:59:26
213 08:59:27
71 08:59:28
428 08:59:29
860 08:59:30
918 08:59:31
938 08:59:32
1124 08:59:33
904 08:59:34
647 08:59:35
890 08:59:36
973 08:59:37
856 08:59:38
893 08:59:39
154 08:59:40
Now with the fix it shows this:
470 08:59:59
929 09:00:00
896 09:00:01
937 09:00:02
854 09:00:03
827 09:00:04
906 09:00:05
863 09:00:06
913 09:00:07
873 09:00:08
162 09:00:09
This should be backported to all supported versions. It depends on
this commit:
MINOR: clock: measure the total boot time
For 2.8 where the internal clock is now totally independent on the human
one, an more generic fix will consist in simply updating now_ms to reflect
the startup time.
(cherry picked from commit 8e978a094d24a7835790da6be6c94e38f9888026)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
(cherry picked from commit 3c4fb9f298894bef822d12443087bd6267d31bce)
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit cdfc93e19c9dbc8e399a24276e8464da900c4abc)
[wt: minor ctx adjustment]
Signed-off-by: Willy Tarreau <w@1wt.eu>
diff --git a/src/check.c b/src/check.c
index 0f5fc1e..54704bd 100644
--- a/src/check.c
+++ b/src/check.c
@@ -1356,6 +1356,7 @@
{
struct task *t;
unsigned long thread_mask = MAX_THREADS_MASK;
+ ulong boottime = tv_ms_remain(&start_date, &ready_date);
if (check->type == PR_O2_EXT_CHK)
thread_mask = 1;
@@ -1386,7 +1387,7 @@
mininter = global.max_spread_checks;
/* check this every ms */
- t->expire = tick_add(now_ms, MS_TO_TICKS(mininter * srvpos / nbcheck));
+ t->expire = tick_add(now_ms, MS_TO_TICKS(boottime + mininter * srvpos / nbcheck));
check->start = now;
task_queue(t);