dfe79251dab0346fe4c71e7e3da7bb87d41c880c - haproxy

commit	dfe79251dab0346fe4c71e7e3da7bb87d41c880c	[log] [tgz]
author	Willy Tarreau <w@1wt.eu>	Tue Nov 03 17:47:41 2020 +0100
committer	Willy Tarreau <w@1wt.eu>	Tue Nov 03 18:02:42 2020 +0100
tree	bec35a7249ca1f854f9288127aa412e42ce0b19c
parent	e6ee820c07a503f88c767a6800c2ea4e2f346012 [diff]

BUG/MEDIUM: stick-table: limit the time spent purging old entries An interesting case was reported with threads and moderately sized stick-tables. Sometimes the watchdog would trigger during the purge. It turns out that the stick tables were sized in the 10s of K entries which is the order of magnitude of the possible number of connections, and that threads were used over distinct NUMA nodes. While at first glance nothing looks problematic there, actually there is a risk that a thread trying to purge the table faces 100% of entries still in use by a connection with (ts->ref_cnt > 0), and ends up scanning the whole table, while other threads on the other NUMA node are causing the cache lines to bounce back and forth and considerably slow down its progress to the point of possibly spending hundreds of milliseconds there, multiplied by the number of queued threads all failing on the same point. Interestingly, smaller tables would not trigger it because the scan would be faster, and larger ones would not trigger it because plenty of entries would be idle! The most efficient solution is to increase the table size to be large enough for this never to happen, but this is not reliable. We could have a parallel list of idle entries but that would significantly increase the storage and processing cost only to improve a few rare corner cases. This patch takes a more pragmatic approach, it considers that it will not visit more than twice the number of nodes to be deleted, which means that it accepts to fail up to 50% of the time. Given that very small batches are programmed each time (1/256 of the table size), this means the operation will finish quickly (128 times faster than now), and will reduce the inter-thread contention. If this needs to be reconsidered, it will probably mean that the batch size needs to be fixed differently. This needs to be backported to stable releases which extensively use threads, typically 2.0. Kudos to Nenad Merdanovic for figuring the root cause triggering this!