blob: 567571968ed621e045a46837242d1240ac8f38e2 [file] [log] [blame]
Willy Tarreaub64ef3e2022-01-11 14:48:20 +010012022-01-04 - Pools structure and API
2
31. Background
4-------------
5
6Memory allocation is a complex problem covered by a massive amount of
7literature. Memory allocators found in field cover a broad spectrum of
8capabilities, performance, fragmentation, efficiency etc.
9
10The main difficulty of memory allocation comes from finding the optimal chunks
11for arbitrary sized requests, that will still preserve a low fragmentation
12level. Doing this well is often expensive in CPU usage and/or memory usage.
13
14In programs like HAProxy that deal with a large number of fixed size objects,
15there is no point having to endure all this risk of fragmentation, and the
16associated costs (sometimes up to several milliseconds with certain minimalist
17allocators) are simply not acceptable. A better approach consists in grouping
18frequently used objects by size, knowing that due to the high repetitiveness of
19operations, a freed object will immediately be needed for another operation.
20
21This grouping of objects by size is what is called a pool. Pools are created
22for certain frequently allocated objects, are usually merged together when they
23are of the same size (or almost the same size), and significantly reduce the
24number of calls to the memory allocator.
25
26With the arrival of threads, pools started to become a bottleneck so they now
27implement an optional thread-local lockless cache. Finally with the arrival of
28really efficient memory allocator in modern operating systems, the shared part
29has also become optional so that it doesn't consume memory if it does not bring
30any value.
31
32
332. Principles
34-------------
35
36The pools architecture is selected at build time. The main options are:
37
38 - thread-local caches and process-wide shared pool enabled (1)
39
40 This is the default situation on most operating systems. Each thread has
41 its own local cache, and when depleted it refills from the process-wide
42 pool that avoids calling the standard allocator too often. It is possible
43 to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS.
44
45 - thread-local caches only are enabled (2)
46
47 This is the situation on operating systems where a fast and modern memory
48 allocator is detected and when it is estimated that the process-wide shared
49 pool will not bring any benefit. This detection is automatic at build time,
50 but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS.
51
52 - pass-through to the standard allocator (3)
53
54 This is used when one absolutely wants to disable pools and rely on regular
55 malloc() and free() calls, essentially in order to trace memory allocations
56 by call points, either internally via DEBUG_MEM_STATS, or externally via
57 tools such as Valgrind. This mode of operation may be forced at build time
58 by setting DEBUG_NO_POOLS.
59
60 - pass-through to an mmap-based allocator for debugging (4)
61
62 This is used only during deep debugging when trying to detect various
63 conditions such as use-after-free. In this case each allocated object's
64 size is rounded up to a multiple of a page size (4096 bytes) and an
65 integral number of pages is allocated for each object using mmap(),
66 surrounded by two unaccessible holes that aim to detect some out-of-bounds
67 accesses. Released objects are instantly freed using munmap() so that any
68 immediate subsequent access to the memory area crashes the process if the
69 area had not been reallocated yet. This mode can be enabled at build time
70 by setting DEBUG_UAF. It tends to consume a lot of memory and not to scale
71 at all with concurrent calls, that tends to make the system stall. The
72 watchdog may even trigger on some slow allocations.
73
74There are no more provisions for running with a shared pool but no thread-local
75cache: the shared pool's main goal is to compensate for the expensive calls to
76the memory allocator. This gain may be huge on tiny systems using basic
77allocators, but the thread-local cache will already achieve this. And on larger
78threaded systems, the shared pool's benefit is visible when the underlying
79allocator scales poorly, but in this case the shared pool would suffer from
80the same limitations without its thread-local cache and wouldn't provide any
81benefit.
82
83Summary of the various operation modes:
84
85 (1) (2) (3) (4)
86
87 User User User User
88 | | | |
89 pool_alloc() V V | |
90 +---------+ +---------+ | |
91 | Thread | | Thread | | |
92 | Local | | Local | | |
93 | Cache | | Cache | | |
94 +---------+ +---------+ | |
95 | | | |
96 pool_refill*() V | | |
97 +---------+ | | |
98 | Shared | | | |
99 | Pool | | | |
100 +---------+ | | |
101 | | | |
102 malloc() V V V |
103 +---------+ +---------+ +---------+ |
104 | Library | | Library | | Library | |
105 +---------+ +---------+ +---------+ |
106 | | | |
107 mmap() V V V V
108 +---------+ +---------+ +---------+ +---------+
109 | OS | | OS | | OS | | OS |
110 +---------+ +---------+ +---------+ +---------+
111
112One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation
113failure in pool_alloc() by randomly returning NULL, to test that callers
114properly handle allocation failures. In this case the desired average rate of
115allocation failures can be fixed by global setting "tune.fail-alloc" expressed
116in percent.
117
118The thread-local caches contain the freshest objects whose total size amounts
119to CONFIG_HAP_POOL_CACHE_SIZE bytes, which is typically was 1MB before 2.6 and
120is 512kB after. The aim is to keep hot objects that still fit in the CPU core's
121private L2 cache. Once these objects do not fit into the cache anymore, there's
122no benefit keeping them local to the thread, so they'd rather be returned to
123the shared pool or the main allocator so that any other thread may make use of
124them.
125
126
1273. Storage in thread-local caches
128---------------------------------
129
130This section describes how objects are linked in thread local caches. This is
131not meant to be a concern for users of the pools API but it can be useful when
132inspecting post-mortem dumps or when trying to figure certain size constraints.
133
134Objects are stored in the local cache using a doubly-linked list. This ensures
135that they can be visited by freshness order like a stack, while at the same
136time being able to access them from oldest to newest when it is needed to
137evict coldest ones first:
138
139 - releasing an object to the cache always puts it on the top.
140
141 - allocating an object from the cache always takes the topmost one, hence the
142 freshest one.
143
144 - scanning for older objects to evict starts from the bottom, where the
145 oldest ones are located
146
147To that end, each thread-local cache keeps a list head in the "list" member of
148its "pool_cache_head" descriptor, that links all objects cast to type
149"pool_cache_item" via their "by_pool" member.
150
151Note that the mechanism described above only works for a single pool. When
152trying to limit the total cache size to a certain value, all pools included,
153there is also a need to arrange all objects from all pools together in the
154local caches. For this, each thread_ctx maintains a list head of recently
155released objects, all pools included, in its member "pool_lru_head". All items
156in a thread-local cache are linked there via their "by_lru" member.
157
158This means that releasing an object using pool_free() consists in inserting
159it at the beginning of two lists:
160 - the local pool_cache_head's "list" list head
161 - the thread context's "pool_lru_head" list head
162
163Allocating an object consists in picking the first entry from the pool's "list"
164and deleting its "by_pool" and "by_lru" links.
165
166Evicting an object consists in scanning the thread context's "pool_lru_head"
167backwards and deleting the object's "by_pool" and "by_lru" links.
168
169Given that entries are both inserted and removed synchronously, we have the
170guarantee that the oldest object in the thread's LRU list is always the oldest
171object in its pool, and that the next element is the cache's list head. This is
172what allows the LRU eviction mechanism to figure what pool an object belongs to
173when releasing it.
174
175Note:
176 | Since a pool_cache_item has two list entries, on 64-bit systems it will be
177 | 32-bytes long. This is the smallest size that a pool may be, and any smaller
178 | size will automatically be rounded up to this size.
179
Willy Tarreau0575d8f2022-01-21 19:00:25 +0100180When build option DEBUG_POOL_INTEGRITY is set, the area of the object between
181the two list elements and the end according to pool->size will be filled with
182pseudo-random words during pool_put_to_cache(), and these words will be
183compared between each other during pool_get_from_cache(), and the process will
184crash in case any bit differs, as this would indicate that the memory area was
185modified after the free. The pseudo-random pattern is in fact incremented by
186(~0)/3 upon each free so that roughly half of the bits change each time and we
187maximize the likelihood of detecting a single bit flip in either direction. In
188order to avoid an immediate reuse and maximize the time the object spends in
189the cache, when this option is set, objects are picked from the cache from the
190oldest one instead of the freshest one. This way even late memory corruptions
191have a chance to be detected.
192
Willy Tarreaub64ef3e2022-01-11 14:48:20 +0100193When build option DEBUG_MEMORY_POOLS is set, pool objects and allocated with
194one extra pointer compared to the requested size, so that the bytes that follow
195the memory area point to the pool descriptor itself as long as the object is
196allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is
197compared and the code will crash in if it differs. This allows to detect both
198memory overflows and object released to the wrong pool (code bug resulting from
199a copy-paste error typically).
200
201Thus an object will look like this depending whether it's in the cache or is
202currently in use:
203
204 in cache in use
205 +------------+ +------------+
206 <--+ by_pool.p | | N bytes |
207 | by_pool.n +--> | |
208 +------------+ |N=16 min on |
209 <--+ by_lru.p | | 32-bit, |
210 | by_lru.n +--> | 32 min on |
211 +------------+ | 64-bit |
212 : : : :
213 | N bytes | | |
214 +------------+ +------------+ \ optional, only if
215 : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS
216 +------------+ +------------+ / is set at build time
217
218Right now no provisions are made to return objects aligned on larger boundaries
219than those currently covered by malloc() (i.e. two pointers). This need appears
220from time to time and the layout above might evolve a little bit if needed.
221
222
2234. Storage in the process-wide shared pool
224------------------------------------------
225
226In order for the shared pool not to be a contention point in a multi-threaded
227environment, objects are allocated from or released to shared pools by clusters
228of a few objects at once. The maximum number of objects that may be moved to or
229from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build
230time, and currently defaults to 8.
231
232In order to remain scalable, the shared pool has to make some tradeoffs to
233limit the number of atomic operations and the duration of any locked operation.
234As such, it's composed of a single-linked list of clusters, themselves made of
235a single-linked list of objects.
236
237Clusters and objects are of the same type "pool_item" and are accessed from the
238pool's "free_list" member. This member points to the latest pool_item inserted
239into the pool by a release operation. And the pool_item's "next" member points
240to the next pool_item, which was the one present in the pool's free_list just
241before the pool_item was inserted, and the last pool_item in the list simply
242has a NULL "next" field.
243
244The pool_item's "down" pointer points down to the next objects part of the same
245cluster, that will be released or allocated at the same time as the first one.
246Each of these items also has a NULL "next" field, and are chained by their
247respective "down" pointers until the last one is detected by a NULL value.
248
249This results in the following layout:
250
251 pool pool_item pool_item pool_item
252 +-----------+ +------+ +------+ +------+
253 | free_list +--> | next +--> | next +--> | NULL |
254 +-----------+ +------+ +------+ +------+
255 | down | | NULL | | down |
256 +--+---+ +------+ +--+---+
257 | |
258 V V
259 +------+ +------+
260 | NULL | | NULL |
261 +------+ +------+
262 | down | | NULL |
263 +--+---+ +------+
264 |
265 V
266 +------+
267 | NULL |
268 +------+
269 | NULL |
270 +------+
271
272Allocating an entry is only a matter of performing two atomic allocations on
273the free_list and reading the pool's "next" value:
274
275 - atomically mark the free_list as being updated by writing a "magic" pointer
276 - read the first pool_item's "next" field
277 - atomically replace the free_list with this value
278
279This results in a fast operation that instantly retrieves a cluster at once.
280Then outside of the critical section entries are walked over and inserted into
281the local cache one at a time. In order to keep the code simple and efficient,
282objects allocated from the shared pool are all placed into the local cache, and
283only then the first one is allocated from the cache. This operation is
284performed by the dedicated function pool_refill_local_from_shared() which is
285called from pool_get_from_cache() when the cache is empty. It means there is an
286overhead of two list insert/delete operations for the first object and that
287could be avoided at the expense of more complex code in the fast path, but this
288is negligible since it only concerns objects that need to be visited anyway.
289
290Freeing a group of objects consists in performing the operation the other way
291around:
292
293 - atomically mark the free_list as being updated by writing a "magic" pointer
294 - write the free_list value to the to-be-released item's "next" entry
295 - atomically replace the free_list with the pool_item's pointer
296
297The cluster will simply have to be prepared before being sent to the shared
298pool. The operation of releasing a cluster at once is performed by function
299pool_put_to_shared_cache() which is called from pool_evict_last_items() which
300itself is responsible for building the clusters.
301
302Due to the way objects are stored, it is important to try to group objects as
303much as possible when releasing them because this is what will condition their
304retrieval as groups as well. This is the reason why pool_evict_last_items()
305uses the LRU to find a first entry but tries to pick several items at once from
306a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8
307achieves up to 6-6.5 objects on average per operation, which effectively
308divides by as much the average time spent per object by each thread and pushes
309the contention point further.
310
311Also, grouping items in clusters is a property of the process-wide shared pool
312and not of the thread-local caches. This means that there is no grouped
313operation when not using the shared pool (mode "2" in the diagram above).
314
315
3165. API
317------
318
319The following functions are public and available for user code:
320
321struct pool_head *create_pool(char *name, uint size, uint flags)
322 Create a new pool named <name> for objects of size <size> bytes. Pool
323 names are truncated to their first 11 characters. Pools of very similar
324 size will usually be merged if both have set the flag MEM_F_SHARED in
325 <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, the pools
326 also need to have the exact same name to be merged. In addition, unless
327 MEM_F_EXACT is set in <flags>, the object size will usually be rounded
328 up to the size of pointers (16 or 32 bytes). The name that will appear
329 in the pool upon merging is the name of the first created pool. The
330 returned pointer is the new (or reused) pool head, or NULL upon error.
331 Pools created this way must be destroyed using pool_destroy().
332
333void *pool_destroy(struct pool_head *pool)
334 Destroy pool <pool>, that is, all of its unused objects are freed and
335 the structure is freed as well if the pool didn't have any used objects
336 anymore. In this case NULL is returned. If some objects remain in use,
337 the pool is preserved and its pointer is returned. This ought to be
338 used essentially on exit or in rare situations where some internal
339 entities that hold pools have to be destroyed.
340
341void pool_destroy_all(void)
342 Destroy all pools, without checking which ones still have used entries.
343 This is only meant for use on exit.
344
345void *__pool_alloc(struct pool_head *pool, uint flags)
346 Allocate an entry from the pool <pool>. The allocator will first look
347 for an object in the thread-local cache if enabled, then in the shared
348 pool if enabled, then will fall back to the operating system's default
349 allocator. NULL is returned if the object couldn't be allocated (due to
350 configured limits or lack of memory). Object allocated this way have to
351 be released using pool_free(). Like with malloc(), by default the
352 contents of the returned object are undefined. If memory poisonning is
353 enabled, the object will be filled with the poisonning byte. If the
354 global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is
355 enabled, a random number generator will be called to randomly return a
356 NULL. The allocator's behavior may be adjusted using a few flags passed
357 in <flags>:
358 - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when
359 pointless and expensive, like for buffers)
360 - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before
361 being returned, similar to what calloc() does
362 - POOL_F_NO_FAIL : when set, disables the random allocation failure,
363 e.g. for use during early init code or critical sections.
364
365void *pool_alloc(struct pool_head *pool)
366 This is an exact equivalent of __pool_alloc(pool, 0). It is the regular
367 way to allocate entries from a pool.
368
369void *pool_alloc_nocache(struct pool_head *pool)
370 Allocate an entry from the pool <pool>, bypassing the cache. If shared
371 pools are enabled, they will be consulted first. Otherwise the object
372 is allocated using the operating system's default allocator. This is
373 essentially used during early boot to pre-allocate a number of objects
374 for pools which require a minimum number of entries to exist.
375
376void *pool_zalloc(struct pool_head *pool)
377 This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO).
378
379void pool_free(struct pool_head *pool, void *ptr)
380 Free an entry allocate from one of the pool_alloc() functions above
381 from pool <pool>. The object will be placed into the thread-local cache
382 if enabled, or in the shared pool if enabled, or will be released using
383 the operating system's default allocator. When memory poisonning is
384 enabled, the area will be overwritten before being released. This can
385 sometimes help detect use-after-free conditions. When a local cache is
386 enabled, if the local cache size becomes larger than 75% of the maximum
387 size configured at build time, some objects will be evicted to the
388 shared pool. Such objects are taken first from the same pool, but if
389 the total size is really huge, other pools might be checked as well.
390 Some extra checks enabled at build time may enforce extra checks so
391 that the process will immediately crash if the object was not allocated
392 from this pool or experienced an overflow or some memory corruption.
393
394void pool_flush(struct pool_head *pool)
395 Free all unused objects from shared pool <pool>. Thread-local caches
396 are not affected. This is essentially used when running low on memory
397 or when stopping, in order to release a maximum amount of memory for
398 the new process.
399
400void pool_gc(struct pool_head *pool)
401 Free all unused objects from all pools, but respecting the minimum
402 number of spare objects required for each of them. Then, for operating
403 systems which support it, indicate the system that all unused memory
404 can be released. Thread-local caches are not affected. This operation
405 differs from pool_flush() in that it is run locklessly, under thread
406 isolation, and on all pools in a row. It is called by the SIGQUIT
407 signal handler and upon exit. Note that the obsolete argument <pool> is
408 not used and the convention is to pass NULL there.
409
410void dump_pools_to_trash(void)
411 Dump the current status of all pools into the trash buffer. This is
412 essentially used by the "show pools" CLI command or the SIGQUIT signal
413 handler to dump them on stderr. The total report size may not exceed
414 the size of the trash buffer. If it does, some entries will be missing.
415
416void dump_pools(void)
417 Dump the current status of all pools to stderr. This just calls
418 dump_pools_to_trash() and writes the trash to stderr.
419
420int pool_total_failures(void)
421 Report the total number of failed allocations. This is solely used to
422 report the "PoolFailed" metrics of the "show info" output. The total
423 is calculated on the fly by summing the number of failures in all pools
424 and is only meant to be used as an indicator rather than a precise
425 measure.
426
427ulong pool_total_allocated(void)
428 Report the total number of bytes allocated in all pools, for reporting
429 in the "PoolAlloc_MB" field of the "show info" output. The total is
430 calculated on the fly by summing the number of allocated bytes in all
431 pools and is only meant to be used as an indicator rather than a
432 precise measure.
433
434ulong pool_total_used(void)
435 Report the total number of bytes used in all pools, for reporting in
436 the "PoolUsed_MB" field of the "show info" output. The total is
437 calculated on the fly by summing the number of used bytes in all pools
438 and is only meant to be used as an indicator rather than a precise
439 measure. Note that objects present in caches are accounted as used.
440
441Some other functions exist and are only used by the pools code itself. While
442not strictly forbidden to use outside of this code, it is generally recommended
443to avoid touching them in order not to create undesired dependencies that will
444complicate maintenance.
445
446A few macros exist to ease the declaration of pools:
447
448DECLARE_POOL(ptr, name, size)
449 Placed at the top level of a file, this declares a global memory pool
450 as variable <ptr>, name <name> and size <size> bytes per element. This
451 is made via a call to REGISTER_POOL() and by assigning the resulting
452 pointer to variable <ptr>. <ptr> will be created of type "struct
453 pool_head *". If the pool needs to be visible outside of the function
454 (which is likely), it will also need to be declared somewhere as
455 "extern struct pool_head *<ptr>;". It is recommended to place such
456 declarations very early in the source file so that the variable is
457 already known to all subsequent functions which may use it.
458
459DECLARE_STATIC_POOL(ptr, name, size)
460 Placed at the top level of a file, this declares a static memory pool
461 as variable <ptr>, name <name> and size <size> bytes per element. This
462 is made via a call to REGISTER_POOL() and by assigning the resulting
463 pointer to local variable <ptr>. <ptr> will be created of type "static
464 struct pool_head *". It is recommended to place such declarations very
465 early in the source file so that the variable is already known to all
466 subsequent functions which may use it.
467
468
4696. Build options
470----------------
471
472A number of build-time defines allow to tune the pools behavior. All of them
473have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG
474variable.
475
476DEBUG_NO_POOLS
477 When this is set, pools are entirely disabled, and allocations are made
478 using malloc() instead. This is not recommended for production but may
479 be useful for tracing allocations.
480
481DEBUG_MEMORY_POOLS
482 When this is set, an extra pointer is allocated at the end of each
483 object to reference the pool the object was allocated from and detect
484 buffer overflows. Then, pool_free() will provoke a crash in case it
485 detects an anomaly (pointer at the end not matching the pool).
486
487DEBUG_FAIL_ALLOC
488 When enabled, a global setting "tune.fail-alloc" may be set to a non-
489 zero value representing a percentage of memory allocations that will be
490 made to fail in order to stress the calling code.
491
492DEBUG_DONT_SHARE_POOLS
493 When enabled, pools of similar sizes are not merged unless the have the
494 exact same name.
495
496DEBUG_UAF
497 When enabled, pools are disabled and all allocations and releases pass
498 through mmap() and munmap(). The memory usage significantly inflates
499 and the performance degrades, but this allows to detect a lot of
500 use-after-free conditions by crashing the program at the first abnormal
501 access. This should not be used in production.
502
Willy Tarreau0575d8f2022-01-21 19:00:25 +0100503DEBUG_POOL_INTEGRITY
504 When enabled, objects picked from the cache are checked for corruption
505 by comparing their contents against a pattern that was placed when they
506 were inserted into the cache. Objects are also allocated in the reverse
507 order, from the oldest one to the most recent, so as to maximize the
508 ability to detect such a corruption. The goal is to detect writes after
509 free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
510 this cannot detect reads after free, but may possibly detect later
511 corruptions and will not consume extra memory. The CPU usage will
512 increase a bit due to the cost of filling/checking the area and for the
513 preference for cold cache instead of hot cache, though not as much as
514 with DEBUG_UAF. This option is meant to be usable in production.
515
Willy Tarreauadd43fa2022-01-24 15:52:51 +0100516DEBUG_POOL_TRACING
517 When enabled, the callers of pool_alloc() and pool_free() will be
518 recorded into an extra memory area placed after the end of the object.
519 This may only be required by developers who want to get a few more
520 hints about code paths involved in some crashes, but will serve no
521 purpose outside of this. It remains compatible (and completes well)
522 DEBUG_POOL_INTEGRITY above. Such information become meaningless once
523 the objects leave the thread-local cache.
524
Willy Tarreaub64ef3e2022-01-11 14:48:20 +0100525DEBUG_MEM_STATS
526 When enabled, all malloc/calloc/realloc/strdup/free calls are accounted
527 for per call place (file+line number), and may be displayed or reset on
528 the CLI using "debug dev memstats". This is essentially used to detect
529 potential leaks or abnormal usages. When pools are enabled (default),
530 such calls are rare and the output will mostly contain calls induced by
531 libraries. When pools are disabled, about all calls to pool_alloc() and
532 pool_free() will also appear since they will be remapped to standard
533 functions.
534
535CONFIG_HAP_GLOBAL_POOLS
536 When enabled, process-wide shared pools will be forcefully enabled even
537 if not considered useful on the platform. The default is to let haproxy
538 decide based on the OS and C library.
539
540CONFIG_HAP_NO_GLOBAL_POOLS
541 When enabled, process-wide shared pools will be forcefully disabled
542 even if considered useful on the platform. The default is to let
543 haproxy decide based on the OS and C library.
544
545CONFIG_HAP_POOL_CACHE_SIZE
546 This allows one to define the size of the per-thread cache, in bytes.
547 The default value is 512 kB (524288). Smaller values will use less
548 memory at the expense of a possibly higher CPU usage when using many
549 threads. Higher values will give diminishing returns on performance
550 while using much more memory. Usually there is no benefit in using
551 more than a per-core L2 cache size. It would be better not to set this
552 value lower than a few times the size of a buffer (bufsize, defaults to
553 16 kB).
554
555CONFIG_HAP_POOL_CLUSTER_SIZE
556 This allows one to define the maximum number of objects that will be
557 groupped together in an allocation from the shared pool. Values 4 to 8
558 have experimentally shown good results with 16 threads. On systems with
559 more cores or losely coupled caches exhibiting slow atomic operations,
560 it could possibly make sense to slightly increase this value.