BUG/MEDIUM: stick-table: do not leave entries in end of window during purge

At some moments expired stick table records stop being removed. This
happens when the internal time wraps around the 32-bit limit, or every
49.7 days. What precisely happens is that some elements that are collected
close to the end of the time window (2^32 - table's "expire" setting)
might have been updated and will be requeued further, at the beginning
of the next window. Here, three bad situations happen:

  - the incorrect integer-based comparison that is not aware of wrapping
    will result in the scan to restart from the freshly requeued element,
    skipping all those at the end of the window. The net effect of this
    is that at each wakeup of the expiration task, only one element from
    the end of the window will be expired, and other ones will remain
    there for a very long time, especially if they have to wait for all
    the predecessors to be picked one at a time after slow wakeups due
    to a long expiration ; this is what was observed in issue #2034
    making the table fill up and appear as not expiring at all, and it
    seems that issue #2024 reports the same problem at the same moment
    (since such issues happen for everyone roughly at the same time
    when the clock doesn't drift too much).

  - the elements that were placed at the beginning of the next window
    are skipped as well for as long as there are refreshed entries at
    the end of the previous window, so these ones participate to filling
    the table as well. This is cause by the restart from the current,
    updated node that is generally placed after most other less recently
    updated elements.

  - once the last element at the end of the window is picked, suddenly
    there is a large amount of expired entries at the beginning of the
    next window that all have to be requeued. If the expiration delay
    is large, the number can be big and it can take a long time, which
    can very likely explain the periodic crashes reported in issue #2025.
    Limiting the batch size as done in commit dfe79251d ("BUG/MEDIUM:
    stick-table: limit the time spent purging old entries") would make
    sense for process_table_expire() as well.

This patch addresses the incorrect tree scan algorithm to make sure that:
  - there's always a next element to compare against, even when dealing
    with the last one in the tree, the first one must be used ;

  - time comparisons used to decide whether to restart from the current
    element use tick_is_lt() as it is the only case where we know the
    current element will be placed before any other one (since the tree
    respects insertion ordering for duplicates)

In order to reproduce the issue, it was found that injecting traffic on
a random key that spans over half of the size of a table whose expiration
is set to 15s while the date is going to wrap in 20s does exhibit an
increase of the table's size 5s after startup, when entries start to be
pushed to the next window. It's more effective when a second load
generator constantly hammers a same key to be certain that none of them
is ready to expire. This doesn't happen anymore after this patch.

This fix needs to be backported to all stable versions. The bug has been
there for as long as the stick tables were introduced in 1.4-dev7 with
commit 3bd697e07 ("[MEDIUM] Add stick table (persistence) management
functions and types"). A cleanup could consists in deduplicating that
code by having process_table_expire() call __stktable_trash_oldest(),
with that one improved to support an optional time check.
1 file changed