Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 1 | 2022-01-04 - Pools structure and API |
| 2 | |
| 3 | 1. Background |
| 4 | ------------- |
| 5 | |
| 6 | Memory allocation is a complex problem covered by a massive amount of |
| 7 | literature. Memory allocators found in field cover a broad spectrum of |
| 8 | capabilities, performance, fragmentation, efficiency etc. |
| 9 | |
| 10 | The main difficulty of memory allocation comes from finding the optimal chunks |
| 11 | for arbitrary sized requests, that will still preserve a low fragmentation |
| 12 | level. Doing this well is often expensive in CPU usage and/or memory usage. |
| 13 | |
| 14 | In programs like HAProxy that deal with a large number of fixed size objects, |
| 15 | there is no point having to endure all this risk of fragmentation, and the |
| 16 | associated costs (sometimes up to several milliseconds with certain minimalist |
| 17 | allocators) are simply not acceptable. A better approach consists in grouping |
| 18 | frequently used objects by size, knowing that due to the high repetitiveness of |
| 19 | operations, a freed object will immediately be needed for another operation. |
| 20 | |
| 21 | This grouping of objects by size is what is called a pool. Pools are created |
| 22 | for certain frequently allocated objects, are usually merged together when they |
| 23 | are of the same size (or almost the same size), and significantly reduce the |
| 24 | number of calls to the memory allocator. |
| 25 | |
| 26 | With the arrival of threads, pools started to become a bottleneck so they now |
| 27 | implement an optional thread-local lockless cache. Finally with the arrival of |
| 28 | really efficient memory allocator in modern operating systems, the shared part |
| 29 | has also become optional so that it doesn't consume memory if it does not bring |
| 30 | any value. |
| 31 | |
| 32 | |
| 33 | 2. Principles |
| 34 | ------------- |
| 35 | |
| 36 | The pools architecture is selected at build time. The main options are: |
| 37 | |
| 38 | - thread-local caches and process-wide shared pool enabled (1) |
| 39 | |
| 40 | This is the default situation on most operating systems. Each thread has |
| 41 | its own local cache, and when depleted it refills from the process-wide |
| 42 | pool that avoids calling the standard allocator too often. It is possible |
| 43 | to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS. |
| 44 | |
| 45 | - thread-local caches only are enabled (2) |
| 46 | |
| 47 | This is the situation on operating systems where a fast and modern memory |
| 48 | allocator is detected and when it is estimated that the process-wide shared |
| 49 | pool will not bring any benefit. This detection is automatic at build time, |
| 50 | but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS. |
| 51 | |
| 52 | - pass-through to the standard allocator (3) |
| 53 | |
| 54 | This is used when one absolutely wants to disable pools and rely on regular |
| 55 | malloc() and free() calls, essentially in order to trace memory allocations |
| 56 | by call points, either internally via DEBUG_MEM_STATS, or externally via |
| 57 | tools such as Valgrind. This mode of operation may be forced at build time |
| 58 | by setting DEBUG_NO_POOLS. |
| 59 | |
| 60 | - pass-through to an mmap-based allocator for debugging (4) |
| 61 | |
| 62 | This is used only during deep debugging when trying to detect various |
| 63 | conditions such as use-after-free. In this case each allocated object's |
| 64 | size is rounded up to a multiple of a page size (4096 bytes) and an |
| 65 | integral number of pages is allocated for each object using mmap(), |
| 66 | surrounded by two unaccessible holes that aim to detect some out-of-bounds |
| 67 | accesses. Released objects are instantly freed using munmap() so that any |
| 68 | immediate subsequent access to the memory area crashes the process if the |
| 69 | area had not been reallocated yet. This mode can be enabled at build time |
| 70 | by setting DEBUG_UAF. It tends to consume a lot of memory and not to scale |
| 71 | at all with concurrent calls, that tends to make the system stall. The |
| 72 | watchdog may even trigger on some slow allocations. |
| 73 | |
| 74 | There are no more provisions for running with a shared pool but no thread-local |
| 75 | cache: the shared pool's main goal is to compensate for the expensive calls to |
| 76 | the memory allocator. This gain may be huge on tiny systems using basic |
| 77 | allocators, but the thread-local cache will already achieve this. And on larger |
| 78 | threaded systems, the shared pool's benefit is visible when the underlying |
| 79 | allocator scales poorly, but in this case the shared pool would suffer from |
| 80 | the same limitations without its thread-local cache and wouldn't provide any |
| 81 | benefit. |
| 82 | |
| 83 | Summary of the various operation modes: |
| 84 | |
| 85 | (1) (2) (3) (4) |
| 86 | |
| 87 | User User User User |
| 88 | | | | | |
| 89 | pool_alloc() V V | | |
| 90 | +---------+ +---------+ | | |
| 91 | | Thread | | Thread | | | |
| 92 | | Local | | Local | | | |
| 93 | | Cache | | Cache | | | |
| 94 | +---------+ +---------+ | | |
| 95 | | | | | |
| 96 | pool_refill*() V | | | |
| 97 | +---------+ | | | |
| 98 | | Shared | | | | |
| 99 | | Pool | | | | |
| 100 | +---------+ | | | |
| 101 | | | | | |
| 102 | malloc() V V V | |
| 103 | +---------+ +---------+ +---------+ | |
| 104 | | Library | | Library | | Library | | |
| 105 | +---------+ +---------+ +---------+ | |
| 106 | | | | | |
| 107 | mmap() V V V V |
| 108 | +---------+ +---------+ +---------+ +---------+ |
| 109 | | OS | | OS | | OS | | OS | |
| 110 | +---------+ +---------+ +---------+ +---------+ |
| 111 | |
| 112 | One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation |
| 113 | failure in pool_alloc() by randomly returning NULL, to test that callers |
| 114 | properly handle allocation failures. In this case the desired average rate of |
| 115 | allocation failures can be fixed by global setting "tune.fail-alloc" expressed |
| 116 | in percent. |
| 117 | |
| 118 | The thread-local caches contain the freshest objects whose total size amounts |
| 119 | to CONFIG_HAP_POOL_CACHE_SIZE bytes, which is typically was 1MB before 2.6 and |
| 120 | is 512kB after. The aim is to keep hot objects that still fit in the CPU core's |
| 121 | private L2 cache. Once these objects do not fit into the cache anymore, there's |
| 122 | no benefit keeping them local to the thread, so they'd rather be returned to |
| 123 | the shared pool or the main allocator so that any other thread may make use of |
| 124 | them. |
| 125 | |
| 126 | |
| 127 | 3. Storage in thread-local caches |
| 128 | --------------------------------- |
| 129 | |
| 130 | This section describes how objects are linked in thread local caches. This is |
| 131 | not meant to be a concern for users of the pools API but it can be useful when |
| 132 | inspecting post-mortem dumps or when trying to figure certain size constraints. |
| 133 | |
| 134 | Objects are stored in the local cache using a doubly-linked list. This ensures |
| 135 | that they can be visited by freshness order like a stack, while at the same |
| 136 | time being able to access them from oldest to newest when it is needed to |
| 137 | evict coldest ones first: |
| 138 | |
| 139 | - releasing an object to the cache always puts it on the top. |
| 140 | |
| 141 | - allocating an object from the cache always takes the topmost one, hence the |
| 142 | freshest one. |
| 143 | |
| 144 | - scanning for older objects to evict starts from the bottom, where the |
| 145 | oldest ones are located |
| 146 | |
| 147 | To that end, each thread-local cache keeps a list head in the "list" member of |
| 148 | its "pool_cache_head" descriptor, that links all objects cast to type |
| 149 | "pool_cache_item" via their "by_pool" member. |
| 150 | |
| 151 | Note that the mechanism described above only works for a single pool. When |
| 152 | trying to limit the total cache size to a certain value, all pools included, |
| 153 | there is also a need to arrange all objects from all pools together in the |
| 154 | local caches. For this, each thread_ctx maintains a list head of recently |
| 155 | released objects, all pools included, in its member "pool_lru_head". All items |
| 156 | in a thread-local cache are linked there via their "by_lru" member. |
| 157 | |
| 158 | This means that releasing an object using pool_free() consists in inserting |
| 159 | it at the beginning of two lists: |
| 160 | - the local pool_cache_head's "list" list head |
| 161 | - the thread context's "pool_lru_head" list head |
| 162 | |
| 163 | Allocating an object consists in picking the first entry from the pool's "list" |
| 164 | and deleting its "by_pool" and "by_lru" links. |
| 165 | |
| 166 | Evicting an object consists in scanning the thread context's "pool_lru_head" |
| 167 | backwards and deleting the object's "by_pool" and "by_lru" links. |
| 168 | |
| 169 | Given that entries are both inserted and removed synchronously, we have the |
| 170 | guarantee that the oldest object in the thread's LRU list is always the oldest |
| 171 | object in its pool, and that the next element is the cache's list head. This is |
| 172 | what allows the LRU eviction mechanism to figure what pool an object belongs to |
| 173 | when releasing it. |
| 174 | |
| 175 | Note: |
| 176 | | Since a pool_cache_item has two list entries, on 64-bit systems it will be |
| 177 | | 32-bytes long. This is the smallest size that a pool may be, and any smaller |
| 178 | | size will automatically be rounded up to this size. |
| 179 | |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 180 | When build option DEBUG_POOL_INTEGRITY is set, the area of the object between |
| 181 | the two list elements and the end according to pool->size will be filled with |
| 182 | pseudo-random words during pool_put_to_cache(), and these words will be |
| 183 | compared between each other during pool_get_from_cache(), and the process will |
| 184 | crash in case any bit differs, as this would indicate that the memory area was |
| 185 | modified after the free. The pseudo-random pattern is in fact incremented by |
| 186 | (~0)/3 upon each free so that roughly half of the bits change each time and we |
| 187 | maximize the likelihood of detecting a single bit flip in either direction. In |
| 188 | order to avoid an immediate reuse and maximize the time the object spends in |
| 189 | the cache, when this option is set, objects are picked from the cache from the |
| 190 | oldest one instead of the freshest one. This way even late memory corruptions |
| 191 | have a chance to be detected. |
| 192 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 193 | When build option DEBUG_MEMORY_POOLS is set, pool objects and allocated with |
| 194 | one extra pointer compared to the requested size, so that the bytes that follow |
| 195 | the memory area point to the pool descriptor itself as long as the object is |
| 196 | allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is |
| 197 | compared and the code will crash in if it differs. This allows to detect both |
| 198 | memory overflows and object released to the wrong pool (code bug resulting from |
| 199 | a copy-paste error typically). |
| 200 | |
| 201 | Thus an object will look like this depending whether it's in the cache or is |
| 202 | currently in use: |
| 203 | |
| 204 | in cache in use |
| 205 | +------------+ +------------+ |
| 206 | <--+ by_pool.p | | N bytes | |
| 207 | | by_pool.n +--> | | |
| 208 | +------------+ |N=16 min on | |
| 209 | <--+ by_lru.p | | 32-bit, | |
| 210 | | by_lru.n +--> | 32 min on | |
| 211 | +------------+ | 64-bit | |
| 212 | : : : : |
| 213 | | N bytes | | | |
| 214 | +------------+ +------------+ \ optional, only if |
| 215 | : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS |
| 216 | +------------+ +------------+ / is set at build time |
| 217 | |
| 218 | Right now no provisions are made to return objects aligned on larger boundaries |
| 219 | than those currently covered by malloc() (i.e. two pointers). This need appears |
| 220 | from time to time and the layout above might evolve a little bit if needed. |
| 221 | |
| 222 | |
| 223 | 4. Storage in the process-wide shared pool |
| 224 | ------------------------------------------ |
| 225 | |
| 226 | In order for the shared pool not to be a contention point in a multi-threaded |
| 227 | environment, objects are allocated from or released to shared pools by clusters |
| 228 | of a few objects at once. The maximum number of objects that may be moved to or |
| 229 | from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build |
| 230 | time, and currently defaults to 8. |
| 231 | |
| 232 | In order to remain scalable, the shared pool has to make some tradeoffs to |
| 233 | limit the number of atomic operations and the duration of any locked operation. |
| 234 | As such, it's composed of a single-linked list of clusters, themselves made of |
| 235 | a single-linked list of objects. |
| 236 | |
| 237 | Clusters and objects are of the same type "pool_item" and are accessed from the |
| 238 | pool's "free_list" member. This member points to the latest pool_item inserted |
| 239 | into the pool by a release operation. And the pool_item's "next" member points |
| 240 | to the next pool_item, which was the one present in the pool's free_list just |
| 241 | before the pool_item was inserted, and the last pool_item in the list simply |
| 242 | has a NULL "next" field. |
| 243 | |
| 244 | The pool_item's "down" pointer points down to the next objects part of the same |
| 245 | cluster, that will be released or allocated at the same time as the first one. |
| 246 | Each of these items also has a NULL "next" field, and are chained by their |
| 247 | respective "down" pointers until the last one is detected by a NULL value. |
| 248 | |
| 249 | This results in the following layout: |
| 250 | |
| 251 | pool pool_item pool_item pool_item |
| 252 | +-----------+ +------+ +------+ +------+ |
| 253 | | free_list +--> | next +--> | next +--> | NULL | |
| 254 | +-----------+ +------+ +------+ +------+ |
| 255 | | down | | NULL | | down | |
| 256 | +--+---+ +------+ +--+---+ |
| 257 | | | |
| 258 | V V |
| 259 | +------+ +------+ |
| 260 | | NULL | | NULL | |
| 261 | +------+ +------+ |
| 262 | | down | | NULL | |
| 263 | +--+---+ +------+ |
| 264 | | |
| 265 | V |
| 266 | +------+ |
| 267 | | NULL | |
| 268 | +------+ |
| 269 | | NULL | |
| 270 | +------+ |
| 271 | |
| 272 | Allocating an entry is only a matter of performing two atomic allocations on |
| 273 | the free_list and reading the pool's "next" value: |
| 274 | |
| 275 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 276 | - read the first pool_item's "next" field |
| 277 | - atomically replace the free_list with this value |
| 278 | |
| 279 | This results in a fast operation that instantly retrieves a cluster at once. |
| 280 | Then outside of the critical section entries are walked over and inserted into |
| 281 | the local cache one at a time. In order to keep the code simple and efficient, |
| 282 | objects allocated from the shared pool are all placed into the local cache, and |
| 283 | only then the first one is allocated from the cache. This operation is |
| 284 | performed by the dedicated function pool_refill_local_from_shared() which is |
| 285 | called from pool_get_from_cache() when the cache is empty. It means there is an |
| 286 | overhead of two list insert/delete operations for the first object and that |
| 287 | could be avoided at the expense of more complex code in the fast path, but this |
| 288 | is negligible since it only concerns objects that need to be visited anyway. |
| 289 | |
| 290 | Freeing a group of objects consists in performing the operation the other way |
| 291 | around: |
| 292 | |
| 293 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 294 | - write the free_list value to the to-be-released item's "next" entry |
| 295 | - atomically replace the free_list with the pool_item's pointer |
| 296 | |
| 297 | The cluster will simply have to be prepared before being sent to the shared |
| 298 | pool. The operation of releasing a cluster at once is performed by function |
| 299 | pool_put_to_shared_cache() which is called from pool_evict_last_items() which |
| 300 | itself is responsible for building the clusters. |
| 301 | |
| 302 | Due to the way objects are stored, it is important to try to group objects as |
| 303 | much as possible when releasing them because this is what will condition their |
| 304 | retrieval as groups as well. This is the reason why pool_evict_last_items() |
| 305 | uses the LRU to find a first entry but tries to pick several items at once from |
| 306 | a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8 |
| 307 | achieves up to 6-6.5 objects on average per operation, which effectively |
| 308 | divides by as much the average time spent per object by each thread and pushes |
| 309 | the contention point further. |
| 310 | |
| 311 | Also, grouping items in clusters is a property of the process-wide shared pool |
| 312 | and not of the thread-local caches. This means that there is no grouped |
| 313 | operation when not using the shared pool (mode "2" in the diagram above). |
| 314 | |
| 315 | |
| 316 | 5. API |
| 317 | ------ |
| 318 | |
| 319 | The following functions are public and available for user code: |
| 320 | |
| 321 | struct pool_head *create_pool(char *name, uint size, uint flags) |
| 322 | Create a new pool named <name> for objects of size <size> bytes. Pool |
| 323 | names are truncated to their first 11 characters. Pools of very similar |
| 324 | size will usually be merged if both have set the flag MEM_F_SHARED in |
| 325 | <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, the pools |
| 326 | also need to have the exact same name to be merged. In addition, unless |
| 327 | MEM_F_EXACT is set in <flags>, the object size will usually be rounded |
| 328 | up to the size of pointers (16 or 32 bytes). The name that will appear |
| 329 | in the pool upon merging is the name of the first created pool. The |
| 330 | returned pointer is the new (or reused) pool head, or NULL upon error. |
| 331 | Pools created this way must be destroyed using pool_destroy(). |
| 332 | |
| 333 | void *pool_destroy(struct pool_head *pool) |
| 334 | Destroy pool <pool>, that is, all of its unused objects are freed and |
| 335 | the structure is freed as well if the pool didn't have any used objects |
| 336 | anymore. In this case NULL is returned. If some objects remain in use, |
| 337 | the pool is preserved and its pointer is returned. This ought to be |
| 338 | used essentially on exit or in rare situations where some internal |
| 339 | entities that hold pools have to be destroyed. |
| 340 | |
| 341 | void pool_destroy_all(void) |
| 342 | Destroy all pools, without checking which ones still have used entries. |
| 343 | This is only meant for use on exit. |
| 344 | |
| 345 | void *__pool_alloc(struct pool_head *pool, uint flags) |
| 346 | Allocate an entry from the pool <pool>. The allocator will first look |
| 347 | for an object in the thread-local cache if enabled, then in the shared |
| 348 | pool if enabled, then will fall back to the operating system's default |
| 349 | allocator. NULL is returned if the object couldn't be allocated (due to |
| 350 | configured limits or lack of memory). Object allocated this way have to |
| 351 | be released using pool_free(). Like with malloc(), by default the |
| 352 | contents of the returned object are undefined. If memory poisonning is |
| 353 | enabled, the object will be filled with the poisonning byte. If the |
| 354 | global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is |
| 355 | enabled, a random number generator will be called to randomly return a |
| 356 | NULL. The allocator's behavior may be adjusted using a few flags passed |
| 357 | in <flags>: |
| 358 | - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when |
| 359 | pointless and expensive, like for buffers) |
| 360 | - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before |
| 361 | being returned, similar to what calloc() does |
| 362 | - POOL_F_NO_FAIL : when set, disables the random allocation failure, |
| 363 | e.g. for use during early init code or critical sections. |
| 364 | |
| 365 | void *pool_alloc(struct pool_head *pool) |
| 366 | This is an exact equivalent of __pool_alloc(pool, 0). It is the regular |
| 367 | way to allocate entries from a pool. |
| 368 | |
| 369 | void *pool_alloc_nocache(struct pool_head *pool) |
| 370 | Allocate an entry from the pool <pool>, bypassing the cache. If shared |
| 371 | pools are enabled, they will be consulted first. Otherwise the object |
| 372 | is allocated using the operating system's default allocator. This is |
| 373 | essentially used during early boot to pre-allocate a number of objects |
| 374 | for pools which require a minimum number of entries to exist. |
| 375 | |
| 376 | void *pool_zalloc(struct pool_head *pool) |
| 377 | This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO). |
| 378 | |
| 379 | void pool_free(struct pool_head *pool, void *ptr) |
| 380 | Free an entry allocate from one of the pool_alloc() functions above |
| 381 | from pool <pool>. The object will be placed into the thread-local cache |
| 382 | if enabled, or in the shared pool if enabled, or will be released using |
| 383 | the operating system's default allocator. When memory poisonning is |
| 384 | enabled, the area will be overwritten before being released. This can |
| 385 | sometimes help detect use-after-free conditions. When a local cache is |
| 386 | enabled, if the local cache size becomes larger than 75% of the maximum |
| 387 | size configured at build time, some objects will be evicted to the |
| 388 | shared pool. Such objects are taken first from the same pool, but if |
| 389 | the total size is really huge, other pools might be checked as well. |
| 390 | Some extra checks enabled at build time may enforce extra checks so |
| 391 | that the process will immediately crash if the object was not allocated |
| 392 | from this pool or experienced an overflow or some memory corruption. |
| 393 | |
| 394 | void pool_flush(struct pool_head *pool) |
| 395 | Free all unused objects from shared pool <pool>. Thread-local caches |
| 396 | are not affected. This is essentially used when running low on memory |
| 397 | or when stopping, in order to release a maximum amount of memory for |
| 398 | the new process. |
| 399 | |
| 400 | void pool_gc(struct pool_head *pool) |
| 401 | Free all unused objects from all pools, but respecting the minimum |
| 402 | number of spare objects required for each of them. Then, for operating |
| 403 | systems which support it, indicate the system that all unused memory |
| 404 | can be released. Thread-local caches are not affected. This operation |
| 405 | differs from pool_flush() in that it is run locklessly, under thread |
| 406 | isolation, and on all pools in a row. It is called by the SIGQUIT |
| 407 | signal handler and upon exit. Note that the obsolete argument <pool> is |
| 408 | not used and the convention is to pass NULL there. |
| 409 | |
| 410 | void dump_pools_to_trash(void) |
| 411 | Dump the current status of all pools into the trash buffer. This is |
| 412 | essentially used by the "show pools" CLI command or the SIGQUIT signal |
| 413 | handler to dump them on stderr. The total report size may not exceed |
| 414 | the size of the trash buffer. If it does, some entries will be missing. |
| 415 | |
| 416 | void dump_pools(void) |
| 417 | Dump the current status of all pools to stderr. This just calls |
| 418 | dump_pools_to_trash() and writes the trash to stderr. |
| 419 | |
| 420 | int pool_total_failures(void) |
| 421 | Report the total number of failed allocations. This is solely used to |
| 422 | report the "PoolFailed" metrics of the "show info" output. The total |
| 423 | is calculated on the fly by summing the number of failures in all pools |
| 424 | and is only meant to be used as an indicator rather than a precise |
| 425 | measure. |
| 426 | |
| 427 | ulong pool_total_allocated(void) |
| 428 | Report the total number of bytes allocated in all pools, for reporting |
| 429 | in the "PoolAlloc_MB" field of the "show info" output. The total is |
| 430 | calculated on the fly by summing the number of allocated bytes in all |
| 431 | pools and is only meant to be used as an indicator rather than a |
| 432 | precise measure. |
| 433 | |
| 434 | ulong pool_total_used(void) |
| 435 | Report the total number of bytes used in all pools, for reporting in |
| 436 | the "PoolUsed_MB" field of the "show info" output. The total is |
| 437 | calculated on the fly by summing the number of used bytes in all pools |
| 438 | and is only meant to be used as an indicator rather than a precise |
| 439 | measure. Note that objects present in caches are accounted as used. |
| 440 | |
| 441 | Some other functions exist and are only used by the pools code itself. While |
| 442 | not strictly forbidden to use outside of this code, it is generally recommended |
| 443 | to avoid touching them in order not to create undesired dependencies that will |
| 444 | complicate maintenance. |
| 445 | |
| 446 | A few macros exist to ease the declaration of pools: |
| 447 | |
| 448 | DECLARE_POOL(ptr, name, size) |
| 449 | Placed at the top level of a file, this declares a global memory pool |
| 450 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 451 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 452 | pointer to variable <ptr>. <ptr> will be created of type "struct |
| 453 | pool_head *". If the pool needs to be visible outside of the function |
| 454 | (which is likely), it will also need to be declared somewhere as |
| 455 | "extern struct pool_head *<ptr>;". It is recommended to place such |
| 456 | declarations very early in the source file so that the variable is |
| 457 | already known to all subsequent functions which may use it. |
| 458 | |
| 459 | DECLARE_STATIC_POOL(ptr, name, size) |
| 460 | Placed at the top level of a file, this declares a static memory pool |
| 461 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 462 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 463 | pointer to local variable <ptr>. <ptr> will be created of type "static |
| 464 | struct pool_head *". It is recommended to place such declarations very |
| 465 | early in the source file so that the variable is already known to all |
| 466 | subsequent functions which may use it. |
| 467 | |
| 468 | |
| 469 | 6. Build options |
| 470 | ---------------- |
| 471 | |
| 472 | A number of build-time defines allow to tune the pools behavior. All of them |
| 473 | have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG |
| 474 | variable. |
| 475 | |
| 476 | DEBUG_NO_POOLS |
| 477 | When this is set, pools are entirely disabled, and allocations are made |
| 478 | using malloc() instead. This is not recommended for production but may |
| 479 | be useful for tracing allocations. |
| 480 | |
| 481 | DEBUG_MEMORY_POOLS |
| 482 | When this is set, an extra pointer is allocated at the end of each |
| 483 | object to reference the pool the object was allocated from and detect |
| 484 | buffer overflows. Then, pool_free() will provoke a crash in case it |
| 485 | detects an anomaly (pointer at the end not matching the pool). |
| 486 | |
| 487 | DEBUG_FAIL_ALLOC |
| 488 | When enabled, a global setting "tune.fail-alloc" may be set to a non- |
| 489 | zero value representing a percentage of memory allocations that will be |
| 490 | made to fail in order to stress the calling code. |
| 491 | |
| 492 | DEBUG_DONT_SHARE_POOLS |
| 493 | When enabled, pools of similar sizes are not merged unless the have the |
| 494 | exact same name. |
| 495 | |
| 496 | DEBUG_UAF |
| 497 | When enabled, pools are disabled and all allocations and releases pass |
| 498 | through mmap() and munmap(). The memory usage significantly inflates |
| 499 | and the performance degrades, but this allows to detect a lot of |
| 500 | use-after-free conditions by crashing the program at the first abnormal |
| 501 | access. This should not be used in production. |
| 502 | |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 503 | DEBUG_POOL_INTEGRITY |
| 504 | When enabled, objects picked from the cache are checked for corruption |
| 505 | by comparing their contents against a pattern that was placed when they |
| 506 | were inserted into the cache. Objects are also allocated in the reverse |
| 507 | order, from the oldest one to the most recent, so as to maximize the |
| 508 | ability to detect such a corruption. The goal is to detect writes after |
| 509 | free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF |
| 510 | this cannot detect reads after free, but may possibly detect later |
| 511 | corruptions and will not consume extra memory. The CPU usage will |
| 512 | increase a bit due to the cost of filling/checking the area and for the |
| 513 | preference for cold cache instead of hot cache, though not as much as |
| 514 | with DEBUG_UAF. This option is meant to be usable in production. |
| 515 | |
Willy Tarreau | add43fa | 2022-01-24 15:52:51 +0100 | [diff] [blame] | 516 | DEBUG_POOL_TRACING |
| 517 | When enabled, the callers of pool_alloc() and pool_free() will be |
| 518 | recorded into an extra memory area placed after the end of the object. |
| 519 | This may only be required by developers who want to get a few more |
| 520 | hints about code paths involved in some crashes, but will serve no |
| 521 | purpose outside of this. It remains compatible (and completes well) |
| 522 | DEBUG_POOL_INTEGRITY above. Such information become meaningless once |
| 523 | the objects leave the thread-local cache. |
| 524 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 525 | DEBUG_MEM_STATS |
| 526 | When enabled, all malloc/calloc/realloc/strdup/free calls are accounted |
| 527 | for per call place (file+line number), and may be displayed or reset on |
| 528 | the CLI using "debug dev memstats". This is essentially used to detect |
| 529 | potential leaks or abnormal usages. When pools are enabled (default), |
| 530 | such calls are rare and the output will mostly contain calls induced by |
| 531 | libraries. When pools are disabled, about all calls to pool_alloc() and |
| 532 | pool_free() will also appear since they will be remapped to standard |
| 533 | functions. |
| 534 | |
| 535 | CONFIG_HAP_GLOBAL_POOLS |
| 536 | When enabled, process-wide shared pools will be forcefully enabled even |
| 537 | if not considered useful on the platform. The default is to let haproxy |
| 538 | decide based on the OS and C library. |
| 539 | |
| 540 | CONFIG_HAP_NO_GLOBAL_POOLS |
| 541 | When enabled, process-wide shared pools will be forcefully disabled |
| 542 | even if considered useful on the platform. The default is to let |
| 543 | haproxy decide based on the OS and C library. |
| 544 | |
| 545 | CONFIG_HAP_POOL_CACHE_SIZE |
| 546 | This allows one to define the size of the per-thread cache, in bytes. |
| 547 | The default value is 512 kB (524288). Smaller values will use less |
| 548 | memory at the expense of a possibly higher CPU usage when using many |
| 549 | threads. Higher values will give diminishing returns on performance |
| 550 | while using much more memory. Usually there is no benefit in using |
| 551 | more than a per-core L2 cache size. It would be better not to set this |
| 552 | value lower than a few times the size of a buffer (bufsize, defaults to |
| 553 | 16 kB). |
| 554 | |
| 555 | CONFIG_HAP_POOL_CLUSTER_SIZE |
| 556 | This allows one to define the maximum number of objects that will be |
| 557 | groupped together in an allocation from the shared pool. Values 4 to 8 |
| 558 | have experimentally shown good results with 16 threads. On systems with |
| 559 | more cores or losely coupled caches exhibiting slow atomic operations, |
| 560 | it could possibly make sense to slightly increase this value. |