Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 1 | 2022-02-24 - Pools structure and API |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 2 | |
| 3 | 1. Background |
| 4 | ------------- |
| 5 | |
| 6 | Memory allocation is a complex problem covered by a massive amount of |
| 7 | literature. Memory allocators found in field cover a broad spectrum of |
| 8 | capabilities, performance, fragmentation, efficiency etc. |
| 9 | |
| 10 | The main difficulty of memory allocation comes from finding the optimal chunks |
| 11 | for arbitrary sized requests, that will still preserve a low fragmentation |
| 12 | level. Doing this well is often expensive in CPU usage and/or memory usage. |
| 13 | |
| 14 | In programs like HAProxy that deal with a large number of fixed size objects, |
| 15 | there is no point having to endure all this risk of fragmentation, and the |
| 16 | associated costs (sometimes up to several milliseconds with certain minimalist |
| 17 | allocators) are simply not acceptable. A better approach consists in grouping |
| 18 | frequently used objects by size, knowing that due to the high repetitiveness of |
| 19 | operations, a freed object will immediately be needed for another operation. |
| 20 | |
| 21 | This grouping of objects by size is what is called a pool. Pools are created |
| 22 | for certain frequently allocated objects, are usually merged together when they |
| 23 | are of the same size (or almost the same size), and significantly reduce the |
| 24 | number of calls to the memory allocator. |
| 25 | |
| 26 | With the arrival of threads, pools started to become a bottleneck so they now |
| 27 | implement an optional thread-local lockless cache. Finally with the arrival of |
| 28 | really efficient memory allocator in modern operating systems, the shared part |
| 29 | has also become optional so that it doesn't consume memory if it does not bring |
| 30 | any value. |
| 31 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 32 | In 2.6-dev2, a number of debugging options that used to be configured at build |
| 33 | time only changed to boot-time and can be modified using keywords passed after |
| 34 | "-dM" on the command line, which sets or clears bits in the pool_debugging |
| 35 | variable. The build-time options still affect the default settings however. |
| 36 | Default values may be consulted using "haproxy -dMhelp". |
| 37 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 38 | |
| 39 | 2. Principles |
| 40 | ------------- |
| 41 | |
| 42 | The pools architecture is selected at build time. The main options are: |
| 43 | |
| 44 | - thread-local caches and process-wide shared pool enabled (1) |
| 45 | |
| 46 | This is the default situation on most operating systems. Each thread has |
| 47 | its own local cache, and when depleted it refills from the process-wide |
| 48 | pool that avoids calling the standard allocator too often. It is possible |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 49 | to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at |
| 50 | boot time with "-dMglobal". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 51 | |
| 52 | - thread-local caches only are enabled (2) |
| 53 | |
| 54 | This is the situation on operating systems where a fast and modern memory |
| 55 | allocator is detected and when it is estimated that the process-wide shared |
| 56 | pool will not bring any benefit. This detection is automatic at build time, |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 57 | but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS |
| 58 | or at boot time with "-dMno-global". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 59 | |
| 60 | - pass-through to the standard allocator (3) |
| 61 | |
| 62 | This is used when one absolutely wants to disable pools and rely on regular |
| 63 | malloc() and free() calls, essentially in order to trace memory allocations |
| 64 | by call points, either internally via DEBUG_MEM_STATS, or externally via |
| 65 | tools such as Valgrind. This mode of operation may be forced at build time |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 66 | by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 67 | |
| 68 | - pass-through to an mmap-based allocator for debugging (4) |
| 69 | |
| 70 | This is used only during deep debugging when trying to detect various |
| 71 | conditions such as use-after-free. In this case each allocated object's |
| 72 | size is rounded up to a multiple of a page size (4096 bytes) and an |
| 73 | integral number of pages is allocated for each object using mmap(), |
| 74 | surrounded by two unaccessible holes that aim to detect some out-of-bounds |
| 75 | accesses. Released objects are instantly freed using munmap() so that any |
| 76 | immediate subsequent access to the memory area crashes the process if the |
| 77 | area had not been reallocated yet. This mode can be enabled at build time |
| 78 | by setting DEBUG_UAF. It tends to consume a lot of memory and not to scale |
| 79 | at all with concurrent calls, that tends to make the system stall. The |
| 80 | watchdog may even trigger on some slow allocations. |
| 81 | |
| 82 | There are no more provisions for running with a shared pool but no thread-local |
| 83 | cache: the shared pool's main goal is to compensate for the expensive calls to |
| 84 | the memory allocator. This gain may be huge on tiny systems using basic |
| 85 | allocators, but the thread-local cache will already achieve this. And on larger |
| 86 | threaded systems, the shared pool's benefit is visible when the underlying |
| 87 | allocator scales poorly, but in this case the shared pool would suffer from |
| 88 | the same limitations without its thread-local cache and wouldn't provide any |
| 89 | benefit. |
| 90 | |
| 91 | Summary of the various operation modes: |
| 92 | |
| 93 | (1) (2) (3) (4) |
| 94 | |
| 95 | User User User User |
| 96 | | | | | |
| 97 | pool_alloc() V V | | |
| 98 | +---------+ +---------+ | | |
| 99 | | Thread | | Thread | | | |
| 100 | | Local | | Local | | | |
| 101 | | Cache | | Cache | | | |
| 102 | +---------+ +---------+ | | |
| 103 | | | | | |
| 104 | pool_refill*() V | | | |
| 105 | +---------+ | | | |
| 106 | | Shared | | | | |
| 107 | | Pool | | | | |
| 108 | +---------+ | | | |
| 109 | | | | | |
| 110 | malloc() V V V | |
| 111 | +---------+ +---------+ +---------+ | |
| 112 | | Library | | Library | | Library | | |
| 113 | +---------+ +---------+ +---------+ | |
| 114 | | | | | |
| 115 | mmap() V V V V |
| 116 | +---------+ +---------+ +---------+ +---------+ |
| 117 | | OS | | OS | | OS | | OS | |
| 118 | +---------+ +---------+ +---------+ +---------+ |
| 119 | |
| 120 | One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation |
| 121 | failure in pool_alloc() by randomly returning NULL, to test that callers |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 122 | properly handle allocation failures. It may also be enabled at boot time using |
| 123 | "-dMfail". In this case the desired average rate of allocation failures can be |
| 124 | fixed by global setting "tune.fail-alloc" expressed in percent. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 125 | |
| 126 | The thread-local caches contain the freshest objects whose total size amounts |
| 127 | to CONFIG_HAP_POOL_CACHE_SIZE bytes, which is typically was 1MB before 2.6 and |
| 128 | is 512kB after. The aim is to keep hot objects that still fit in the CPU core's |
| 129 | private L2 cache. Once these objects do not fit into the cache anymore, there's |
| 130 | no benefit keeping them local to the thread, so they'd rather be returned to |
| 131 | the shared pool or the main allocator so that any other thread may make use of |
| 132 | them. |
| 133 | |
| 134 | |
| 135 | 3. Storage in thread-local caches |
| 136 | --------------------------------- |
| 137 | |
| 138 | This section describes how objects are linked in thread local caches. This is |
| 139 | not meant to be a concern for users of the pools API but it can be useful when |
| 140 | inspecting post-mortem dumps or when trying to figure certain size constraints. |
| 141 | |
| 142 | Objects are stored in the local cache using a doubly-linked list. This ensures |
| 143 | that they can be visited by freshness order like a stack, while at the same |
| 144 | time being able to access them from oldest to newest when it is needed to |
| 145 | evict coldest ones first: |
| 146 | |
| 147 | - releasing an object to the cache always puts it on the top. |
| 148 | |
| 149 | - allocating an object from the cache always takes the topmost one, hence the |
| 150 | freshest one. |
| 151 | |
| 152 | - scanning for older objects to evict starts from the bottom, where the |
| 153 | oldest ones are located |
| 154 | |
| 155 | To that end, each thread-local cache keeps a list head in the "list" member of |
| 156 | its "pool_cache_head" descriptor, that links all objects cast to type |
| 157 | "pool_cache_item" via their "by_pool" member. |
| 158 | |
| 159 | Note that the mechanism described above only works for a single pool. When |
| 160 | trying to limit the total cache size to a certain value, all pools included, |
| 161 | there is also a need to arrange all objects from all pools together in the |
| 162 | local caches. For this, each thread_ctx maintains a list head of recently |
| 163 | released objects, all pools included, in its member "pool_lru_head". All items |
| 164 | in a thread-local cache are linked there via their "by_lru" member. |
| 165 | |
| 166 | This means that releasing an object using pool_free() consists in inserting |
| 167 | it at the beginning of two lists: |
| 168 | - the local pool_cache_head's "list" list head |
| 169 | - the thread context's "pool_lru_head" list head |
| 170 | |
| 171 | Allocating an object consists in picking the first entry from the pool's "list" |
| 172 | and deleting its "by_pool" and "by_lru" links. |
| 173 | |
| 174 | Evicting an object consists in scanning the thread context's "pool_lru_head" |
| 175 | backwards and deleting the object's "by_pool" and "by_lru" links. |
| 176 | |
| 177 | Given that entries are both inserted and removed synchronously, we have the |
| 178 | guarantee that the oldest object in the thread's LRU list is always the oldest |
| 179 | object in its pool, and that the next element is the cache's list head. This is |
| 180 | what allows the LRU eviction mechanism to figure what pool an object belongs to |
| 181 | when releasing it. |
| 182 | |
| 183 | Note: |
| 184 | | Since a pool_cache_item has two list entries, on 64-bit systems it will be |
| 185 | | 32-bytes long. This is the smallest size that a pool may be, and any smaller |
| 186 | | size will automatically be rounded up to this size. |
| 187 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 188 | When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option |
| 189 | "-dMintegrity" is passed on the command line, the area of the object between |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 190 | the two list elements and the end according to pool->size will be filled with |
| 191 | pseudo-random words during pool_put_to_cache(), and these words will be |
| 192 | compared between each other during pool_get_from_cache(), and the process will |
| 193 | crash in case any bit differs, as this would indicate that the memory area was |
| 194 | modified after the free. The pseudo-random pattern is in fact incremented by |
| 195 | (~0)/3 upon each free so that roughly half of the bits change each time and we |
| 196 | maximize the likelihood of detecting a single bit flip in either direction. In |
| 197 | order to avoid an immediate reuse and maximize the time the object spends in |
| 198 | the cache, when this option is set, objects are picked from the cache from the |
| 199 | oldest one instead of the freshest one. This way even late memory corruptions |
| 200 | have a chance to be detected. |
| 201 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 202 | When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag" |
| 203 | is passed on the executable's command line, pool objects are allocated with |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 204 | one extra pointer compared to the requested size, so that the bytes that follow |
| 205 | the memory area point to the pool descriptor itself as long as the object is |
| 206 | allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is |
| 207 | compared and the code will crash in if it differs. This allows to detect both |
| 208 | memory overflows and object released to the wrong pool (code bug resulting from |
| 209 | a copy-paste error typically). |
| 210 | |
| 211 | Thus an object will look like this depending whether it's in the cache or is |
| 212 | currently in use: |
| 213 | |
| 214 | in cache in use |
| 215 | +------------+ +------------+ |
| 216 | <--+ by_pool.p | | N bytes | |
| 217 | | by_pool.n +--> | | |
| 218 | +------------+ |N=16 min on | |
| 219 | <--+ by_lru.p | | 32-bit, | |
| 220 | | by_lru.n +--> | 32 min on | |
| 221 | +------------+ | 64-bit | |
| 222 | : : : : |
| 223 | | N bytes | | | |
| 224 | +------------+ +------------+ \ optional, only if |
| 225 | : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS |
| 226 | +------------+ +------------+ / is set at build time |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 227 | or -dMtag at boot time |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 228 | |
| 229 | Right now no provisions are made to return objects aligned on larger boundaries |
| 230 | than those currently covered by malloc() (i.e. two pointers). This need appears |
| 231 | from time to time and the layout above might evolve a little bit if needed. |
| 232 | |
| 233 | |
| 234 | 4. Storage in the process-wide shared pool |
| 235 | ------------------------------------------ |
| 236 | |
| 237 | In order for the shared pool not to be a contention point in a multi-threaded |
| 238 | environment, objects are allocated from or released to shared pools by clusters |
| 239 | of a few objects at once. The maximum number of objects that may be moved to or |
| 240 | from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build |
| 241 | time, and currently defaults to 8. |
| 242 | |
| 243 | In order to remain scalable, the shared pool has to make some tradeoffs to |
| 244 | limit the number of atomic operations and the duration of any locked operation. |
| 245 | As such, it's composed of a single-linked list of clusters, themselves made of |
| 246 | a single-linked list of objects. |
| 247 | |
| 248 | Clusters and objects are of the same type "pool_item" and are accessed from the |
| 249 | pool's "free_list" member. This member points to the latest pool_item inserted |
| 250 | into the pool by a release operation. And the pool_item's "next" member points |
| 251 | to the next pool_item, which was the one present in the pool's free_list just |
| 252 | before the pool_item was inserted, and the last pool_item in the list simply |
| 253 | has a NULL "next" field. |
| 254 | |
| 255 | The pool_item's "down" pointer points down to the next objects part of the same |
| 256 | cluster, that will be released or allocated at the same time as the first one. |
| 257 | Each of these items also has a NULL "next" field, and are chained by their |
| 258 | respective "down" pointers until the last one is detected by a NULL value. |
| 259 | |
| 260 | This results in the following layout: |
| 261 | |
| 262 | pool pool_item pool_item pool_item |
| 263 | +-----------+ +------+ +------+ +------+ |
| 264 | | free_list +--> | next +--> | next +--> | NULL | |
| 265 | +-----------+ +------+ +------+ +------+ |
| 266 | | down | | NULL | | down | |
| 267 | +--+---+ +------+ +--+---+ |
| 268 | | | |
| 269 | V V |
| 270 | +------+ +------+ |
| 271 | | NULL | | NULL | |
| 272 | +------+ +------+ |
| 273 | | down | | NULL | |
| 274 | +--+---+ +------+ |
| 275 | | |
| 276 | V |
| 277 | +------+ |
| 278 | | NULL | |
| 279 | +------+ |
| 280 | | NULL | |
| 281 | +------+ |
| 282 | |
| 283 | Allocating an entry is only a matter of performing two atomic allocations on |
| 284 | the free_list and reading the pool's "next" value: |
| 285 | |
| 286 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 287 | - read the first pool_item's "next" field |
| 288 | - atomically replace the free_list with this value |
| 289 | |
| 290 | This results in a fast operation that instantly retrieves a cluster at once. |
| 291 | Then outside of the critical section entries are walked over and inserted into |
| 292 | the local cache one at a time. In order to keep the code simple and efficient, |
| 293 | objects allocated from the shared pool are all placed into the local cache, and |
| 294 | only then the first one is allocated from the cache. This operation is |
| 295 | performed by the dedicated function pool_refill_local_from_shared() which is |
| 296 | called from pool_get_from_cache() when the cache is empty. It means there is an |
| 297 | overhead of two list insert/delete operations for the first object and that |
| 298 | could be avoided at the expense of more complex code in the fast path, but this |
| 299 | is negligible since it only concerns objects that need to be visited anyway. |
| 300 | |
| 301 | Freeing a group of objects consists in performing the operation the other way |
| 302 | around: |
| 303 | |
| 304 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 305 | - write the free_list value to the to-be-released item's "next" entry |
| 306 | - atomically replace the free_list with the pool_item's pointer |
| 307 | |
| 308 | The cluster will simply have to be prepared before being sent to the shared |
| 309 | pool. The operation of releasing a cluster at once is performed by function |
| 310 | pool_put_to_shared_cache() which is called from pool_evict_last_items() which |
| 311 | itself is responsible for building the clusters. |
| 312 | |
| 313 | Due to the way objects are stored, it is important to try to group objects as |
| 314 | much as possible when releasing them because this is what will condition their |
| 315 | retrieval as groups as well. This is the reason why pool_evict_last_items() |
| 316 | uses the LRU to find a first entry but tries to pick several items at once from |
| 317 | a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8 |
| 318 | achieves up to 6-6.5 objects on average per operation, which effectively |
| 319 | divides by as much the average time spent per object by each thread and pushes |
| 320 | the contention point further. |
| 321 | |
| 322 | Also, grouping items in clusters is a property of the process-wide shared pool |
| 323 | and not of the thread-local caches. This means that there is no grouped |
| 324 | operation when not using the shared pool (mode "2" in the diagram above). |
| 325 | |
| 326 | |
| 327 | 5. API |
| 328 | ------ |
| 329 | |
| 330 | The following functions are public and available for user code: |
| 331 | |
| 332 | struct pool_head *create_pool(char *name, uint size, uint flags) |
| 333 | Create a new pool named <name> for objects of size <size> bytes. Pool |
| 334 | names are truncated to their first 11 characters. Pools of very similar |
| 335 | size will usually be merged if both have set the flag MEM_F_SHARED in |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 336 | <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or |
| 337 | "-dMno-merge" is passed on the executable's command line, the pools |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 338 | also need to have the exact same name to be merged. In addition, unless |
| 339 | MEM_F_EXACT is set in <flags>, the object size will usually be rounded |
| 340 | up to the size of pointers (16 or 32 bytes). The name that will appear |
| 341 | in the pool upon merging is the name of the first created pool. The |
| 342 | returned pointer is the new (or reused) pool head, or NULL upon error. |
| 343 | Pools created this way must be destroyed using pool_destroy(). |
| 344 | |
| 345 | void *pool_destroy(struct pool_head *pool) |
| 346 | Destroy pool <pool>, that is, all of its unused objects are freed and |
| 347 | the structure is freed as well if the pool didn't have any used objects |
| 348 | anymore. In this case NULL is returned. If some objects remain in use, |
| 349 | the pool is preserved and its pointer is returned. This ought to be |
| 350 | used essentially on exit or in rare situations where some internal |
| 351 | entities that hold pools have to be destroyed. |
| 352 | |
| 353 | void pool_destroy_all(void) |
| 354 | Destroy all pools, without checking which ones still have used entries. |
| 355 | This is only meant for use on exit. |
| 356 | |
| 357 | void *__pool_alloc(struct pool_head *pool, uint flags) |
| 358 | Allocate an entry from the pool <pool>. The allocator will first look |
| 359 | for an object in the thread-local cache if enabled, then in the shared |
| 360 | pool if enabled, then will fall back to the operating system's default |
| 361 | allocator. NULL is returned if the object couldn't be allocated (due to |
| 362 | configured limits or lack of memory). Object allocated this way have to |
| 363 | be released using pool_free(). Like with malloc(), by default the |
| 364 | contents of the returned object are undefined. If memory poisonning is |
| 365 | enabled, the object will be filled with the poisonning byte. If the |
| 366 | global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is |
| 367 | enabled, a random number generator will be called to randomly return a |
| 368 | NULL. The allocator's behavior may be adjusted using a few flags passed |
| 369 | in <flags>: |
| 370 | - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when |
| 371 | pointless and expensive, like for buffers) |
| 372 | - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before |
| 373 | being returned, similar to what calloc() does |
| 374 | - POOL_F_NO_FAIL : when set, disables the random allocation failure, |
| 375 | e.g. for use during early init code or critical sections. |
| 376 | |
| 377 | void *pool_alloc(struct pool_head *pool) |
| 378 | This is an exact equivalent of __pool_alloc(pool, 0). It is the regular |
| 379 | way to allocate entries from a pool. |
| 380 | |
| 381 | void *pool_alloc_nocache(struct pool_head *pool) |
| 382 | Allocate an entry from the pool <pool>, bypassing the cache. If shared |
| 383 | pools are enabled, they will be consulted first. Otherwise the object |
| 384 | is allocated using the operating system's default allocator. This is |
| 385 | essentially used during early boot to pre-allocate a number of objects |
| 386 | for pools which require a minimum number of entries to exist. |
| 387 | |
| 388 | void *pool_zalloc(struct pool_head *pool) |
| 389 | This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO). |
| 390 | |
| 391 | void pool_free(struct pool_head *pool, void *ptr) |
| 392 | Free an entry allocate from one of the pool_alloc() functions above |
| 393 | from pool <pool>. The object will be placed into the thread-local cache |
| 394 | if enabled, or in the shared pool if enabled, or will be released using |
Willy Tarreau | af580f6 | 2022-02-23 11:45:09 +0100 | [diff] [blame] | 395 | the operating system's default allocator. When a local cache is |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 396 | enabled, if the local cache size becomes larger than 75% of the maximum |
| 397 | size configured at build time, some objects will be evicted to the |
| 398 | shared pool. Such objects are taken first from the same pool, but if |
| 399 | the total size is really huge, other pools might be checked as well. |
| 400 | Some extra checks enabled at build time may enforce extra checks so |
| 401 | that the process will immediately crash if the object was not allocated |
| 402 | from this pool or experienced an overflow or some memory corruption. |
| 403 | |
| 404 | void pool_flush(struct pool_head *pool) |
| 405 | Free all unused objects from shared pool <pool>. Thread-local caches |
| 406 | are not affected. This is essentially used when running low on memory |
| 407 | or when stopping, in order to release a maximum amount of memory for |
| 408 | the new process. |
| 409 | |
| 410 | void pool_gc(struct pool_head *pool) |
| 411 | Free all unused objects from all pools, but respecting the minimum |
| 412 | number of spare objects required for each of them. Then, for operating |
| 413 | systems which support it, indicate the system that all unused memory |
| 414 | can be released. Thread-local caches are not affected. This operation |
| 415 | differs from pool_flush() in that it is run locklessly, under thread |
| 416 | isolation, and on all pools in a row. It is called by the SIGQUIT |
| 417 | signal handler and upon exit. Note that the obsolete argument <pool> is |
| 418 | not used and the convention is to pass NULL there. |
| 419 | |
| 420 | void dump_pools_to_trash(void) |
| 421 | Dump the current status of all pools into the trash buffer. This is |
| 422 | essentially used by the "show pools" CLI command or the SIGQUIT signal |
| 423 | handler to dump them on stderr. The total report size may not exceed |
| 424 | the size of the trash buffer. If it does, some entries will be missing. |
| 425 | |
| 426 | void dump_pools(void) |
| 427 | Dump the current status of all pools to stderr. This just calls |
| 428 | dump_pools_to_trash() and writes the trash to stderr. |
| 429 | |
| 430 | int pool_total_failures(void) |
| 431 | Report the total number of failed allocations. This is solely used to |
| 432 | report the "PoolFailed" metrics of the "show info" output. The total |
| 433 | is calculated on the fly by summing the number of failures in all pools |
| 434 | and is only meant to be used as an indicator rather than a precise |
| 435 | measure. |
| 436 | |
| 437 | ulong pool_total_allocated(void) |
| 438 | Report the total number of bytes allocated in all pools, for reporting |
| 439 | in the "PoolAlloc_MB" field of the "show info" output. The total is |
| 440 | calculated on the fly by summing the number of allocated bytes in all |
| 441 | pools and is only meant to be used as an indicator rather than a |
| 442 | precise measure. |
| 443 | |
| 444 | ulong pool_total_used(void) |
| 445 | Report the total number of bytes used in all pools, for reporting in |
| 446 | the "PoolUsed_MB" field of the "show info" output. The total is |
| 447 | calculated on the fly by summing the number of used bytes in all pools |
| 448 | and is only meant to be used as an indicator rather than a precise |
| 449 | measure. Note that objects present in caches are accounted as used. |
| 450 | |
| 451 | Some other functions exist and are only used by the pools code itself. While |
| 452 | not strictly forbidden to use outside of this code, it is generally recommended |
| 453 | to avoid touching them in order not to create undesired dependencies that will |
| 454 | complicate maintenance. |
| 455 | |
| 456 | A few macros exist to ease the declaration of pools: |
| 457 | |
| 458 | DECLARE_POOL(ptr, name, size) |
| 459 | Placed at the top level of a file, this declares a global memory pool |
| 460 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 461 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 462 | pointer to variable <ptr>. <ptr> will be created of type "struct |
| 463 | pool_head *". If the pool needs to be visible outside of the function |
| 464 | (which is likely), it will also need to be declared somewhere as |
| 465 | "extern struct pool_head *<ptr>;". It is recommended to place such |
| 466 | declarations very early in the source file so that the variable is |
| 467 | already known to all subsequent functions which may use it. |
| 468 | |
| 469 | DECLARE_STATIC_POOL(ptr, name, size) |
| 470 | Placed at the top level of a file, this declares a static memory pool |
| 471 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 472 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 473 | pointer to local variable <ptr>. <ptr> will be created of type "static |
| 474 | struct pool_head *". It is recommended to place such declarations very |
| 475 | early in the source file so that the variable is already known to all |
| 476 | subsequent functions which may use it. |
| 477 | |
| 478 | |
| 479 | 6. Build options |
| 480 | ---------------- |
| 481 | |
| 482 | A number of build-time defines allow to tune the pools behavior. All of them |
| 483 | have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG |
| 484 | variable. |
| 485 | |
| 486 | DEBUG_NO_POOLS |
| 487 | When this is set, pools are entirely disabled, and allocations are made |
| 488 | using malloc() instead. This is not recommended for production but may |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 489 | be useful for tracing allocations. It corresponds to "-dMno-cache" at |
| 490 | boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 491 | |
| 492 | DEBUG_MEMORY_POOLS |
| 493 | When this is set, an extra pointer is allocated at the end of each |
| 494 | object to reference the pool the object was allocated from and detect |
| 495 | buffer overflows. Then, pool_free() will provoke a crash in case it |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 496 | detects an anomaly (pointer at the end not matching the pool). It |
| 497 | corresponds to "-dMtag" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 498 | |
| 499 | DEBUG_FAIL_ALLOC |
| 500 | When enabled, a global setting "tune.fail-alloc" may be set to a non- |
| 501 | zero value representing a percentage of memory allocations that will be |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 502 | made to fail in order to stress the calling code. It corresponds to |
| 503 | "-dMfail" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 504 | |
| 505 | DEBUG_DONT_SHARE_POOLS |
| 506 | When enabled, pools of similar sizes are not merged unless the have the |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 507 | exact same name. It corresponds to "-dMno-merge" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 508 | |
| 509 | DEBUG_UAF |
| 510 | When enabled, pools are disabled and all allocations and releases pass |
| 511 | through mmap() and munmap(). The memory usage significantly inflates |
| 512 | and the performance degrades, but this allows to detect a lot of |
| 513 | use-after-free conditions by crashing the program at the first abnormal |
| 514 | access. This should not be used in production. |
| 515 | |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 516 | DEBUG_POOL_INTEGRITY |
| 517 | When enabled, objects picked from the cache are checked for corruption |
| 518 | by comparing their contents against a pattern that was placed when they |
| 519 | were inserted into the cache. Objects are also allocated in the reverse |
| 520 | order, from the oldest one to the most recent, so as to maximize the |
| 521 | ability to detect such a corruption. The goal is to detect writes after |
| 522 | free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF |
| 523 | this cannot detect reads after free, but may possibly detect later |
| 524 | corruptions and will not consume extra memory. The CPU usage will |
| 525 | increase a bit due to the cost of filling/checking the area and for the |
| 526 | preference for cold cache instead of hot cache, though not as much as |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 527 | with DEBUG_UAF. This option is meant to be usable in production. It |
| 528 | corresponds to boot-time options "-dMcold-first,integrity". |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 529 | |
Willy Tarreau | add43fa | 2022-01-24 15:52:51 +0100 | [diff] [blame] | 530 | DEBUG_POOL_TRACING |
| 531 | When enabled, the callers of pool_alloc() and pool_free() will be |
| 532 | recorded into an extra memory area placed after the end of the object. |
| 533 | This may only be required by developers who want to get a few more |
| 534 | hints about code paths involved in some crashes, but will serve no |
| 535 | purpose outside of this. It remains compatible (and completes well) |
| 536 | DEBUG_POOL_INTEGRITY above. Such information become meaningless once |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 537 | the objects leave the thread-local cache. It corresponds to boot-time |
| 538 | option "-dMcaller". |
Willy Tarreau | add43fa | 2022-01-24 15:52:51 +0100 | [diff] [blame] | 539 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 540 | DEBUG_MEM_STATS |
| 541 | When enabled, all malloc/calloc/realloc/strdup/free calls are accounted |
| 542 | for per call place (file+line number), and may be displayed or reset on |
| 543 | the CLI using "debug dev memstats". This is essentially used to detect |
| 544 | potential leaks or abnormal usages. When pools are enabled (default), |
| 545 | such calls are rare and the output will mostly contain calls induced by |
| 546 | libraries. When pools are disabled, about all calls to pool_alloc() and |
| 547 | pool_free() will also appear since they will be remapped to standard |
| 548 | functions. |
| 549 | |
| 550 | CONFIG_HAP_GLOBAL_POOLS |
| 551 | When enabled, process-wide shared pools will be forcefully enabled even |
| 552 | if not considered useful on the platform. The default is to let haproxy |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 553 | decide based on the OS and C library. It corresponds to boot-time |
| 554 | option "-dMglobal". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 555 | |
| 556 | CONFIG_HAP_NO_GLOBAL_POOLS |
| 557 | When enabled, process-wide shared pools will be forcefully disabled |
| 558 | even if considered useful on the platform. The default is to let |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame^] | 559 | haproxy decide based on the OS and C library. It corresponds to |
| 560 | boot-time option "-dMno-global". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 561 | |
| 562 | CONFIG_HAP_POOL_CACHE_SIZE |
| 563 | This allows one to define the size of the per-thread cache, in bytes. |
| 564 | The default value is 512 kB (524288). Smaller values will use less |
| 565 | memory at the expense of a possibly higher CPU usage when using many |
| 566 | threads. Higher values will give diminishing returns on performance |
| 567 | while using much more memory. Usually there is no benefit in using |
| 568 | more than a per-core L2 cache size. It would be better not to set this |
| 569 | value lower than a few times the size of a buffer (bufsize, defaults to |
| 570 | 16 kB). |
| 571 | |
| 572 | CONFIG_HAP_POOL_CLUSTER_SIZE |
| 573 | This allows one to define the maximum number of objects that will be |
| 574 | groupped together in an allocation from the shared pool. Values 4 to 8 |
| 575 | have experimentally shown good results with 16 threads. On systems with |
| 576 | more cores or losely coupled caches exhibiting slow atomic operations, |
| 577 | it could possibly make sense to slightly increase this value. |