BUG/MEDIUM: stick-table: limit the time spent purging old entries
An interesting case was reported with threads and moderately sized
stick-tables. Sometimes the watchdog would trigger during the purge.
It turns out that the stick tables were sized in the 10s of K entries
which is the order of magnitude of the possible number of connections,
and that threads were used over distinct NUMA nodes. While at first
glance nothing looks problematic there, actually there is a risk that
a thread trying to purge the table faces 100% of entries still in use
by a connection with (ts->ref_cnt > 0), and ends up scanning the whole
table, while other threads on the other NUMA node are causing the
cache lines to bounce back and forth and considerably slow down its
progress to the point of possibly spending hundreds of milliseconds
there, multiplied by the number of queued threads all failing on the
same point.
Interestingly, smaller tables would not trigger it because the scan
would be faster, and larger ones would not trigger it because plenty
of entries would be idle!
The most efficient solution is to increase the table size to be large
enough for this never to happen, but this is not reliable. We could
have a parallel list of idle entries but that would significantly
increase the storage and processing cost only to improve a few rare
corner cases.
This patch takes a more pragmatic approach, it considers that it will
not visit more than twice the number of nodes to be deleted, which
means that it accepts to fail up to 50% of the time. Given that very
small batches are programmed each time (1/256 of the table size), this
means the operation will finish quickly (128 times faster than now),
and will reduce the inter-thread contention. If this needs to be
reconsidered, it will probably mean that the batch size needs to be
fixed differently.
This needs to be backported to stable releases which extensively use
threads, typically 2.0.
Kudos to Nenad Merdanovic for figuring the root cause triggering this!
1 file changed