Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 1 | 2022-02-24 - Pools structure and API |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 2 | |
| 3 | 1. Background |
| 4 | ------------- |
| 5 | |
| 6 | Memory allocation is a complex problem covered by a massive amount of |
| 7 | literature. Memory allocators found in field cover a broad spectrum of |
| 8 | capabilities, performance, fragmentation, efficiency etc. |
| 9 | |
| 10 | The main difficulty of memory allocation comes from finding the optimal chunks |
| 11 | for arbitrary sized requests, that will still preserve a low fragmentation |
| 12 | level. Doing this well is often expensive in CPU usage and/or memory usage. |
| 13 | |
| 14 | In programs like HAProxy that deal with a large number of fixed size objects, |
| 15 | there is no point having to endure all this risk of fragmentation, and the |
| 16 | associated costs (sometimes up to several milliseconds with certain minimalist |
| 17 | allocators) are simply not acceptable. A better approach consists in grouping |
| 18 | frequently used objects by size, knowing that due to the high repetitiveness of |
| 19 | operations, a freed object will immediately be needed for another operation. |
| 20 | |
| 21 | This grouping of objects by size is what is called a pool. Pools are created |
| 22 | for certain frequently allocated objects, are usually merged together when they |
| 23 | are of the same size (or almost the same size), and significantly reduce the |
| 24 | number of calls to the memory allocator. |
| 25 | |
| 26 | With the arrival of threads, pools started to become a bottleneck so they now |
| 27 | implement an optional thread-local lockless cache. Finally with the arrival of |
| 28 | really efficient memory allocator in modern operating systems, the shared part |
| 29 | has also become optional so that it doesn't consume memory if it does not bring |
| 30 | any value. |
| 31 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 32 | In 2.6-dev2, a number of debugging options that used to be configured at build |
| 33 | time only changed to boot-time and can be modified using keywords passed after |
| 34 | "-dM" on the command line, which sets or clears bits in the pool_debugging |
| 35 | variable. The build-time options still affect the default settings however. |
| 36 | Default values may be consulted using "haproxy -dMhelp". |
| 37 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 38 | |
| 39 | 2. Principles |
| 40 | ------------- |
| 41 | |
| 42 | The pools architecture is selected at build time. The main options are: |
| 43 | |
| 44 | - thread-local caches and process-wide shared pool enabled (1) |
| 45 | |
| 46 | This is the default situation on most operating systems. Each thread has |
| 47 | its own local cache, and when depleted it refills from the process-wide |
| 48 | pool that avoids calling the standard allocator too often. It is possible |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 49 | to force this mode at build time by setting CONFIG_HAP_GLOBAL_POOLS or at |
| 50 | boot time with "-dMglobal". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 51 | |
| 52 | - thread-local caches only are enabled (2) |
| 53 | |
| 54 | This is the situation on operating systems where a fast and modern memory |
| 55 | allocator is detected and when it is estimated that the process-wide shared |
| 56 | pool will not bring any benefit. This detection is automatic at build time, |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 57 | but may also be forced at build tmie by setting CONFIG_HAP_NO_GLOBAL_POOLS |
| 58 | or at boot time with "-dMno-global". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 59 | |
| 60 | - pass-through to the standard allocator (3) |
| 61 | |
| 62 | This is used when one absolutely wants to disable pools and rely on regular |
| 63 | malloc() and free() calls, essentially in order to trace memory allocations |
| 64 | by call points, either internally via DEBUG_MEM_STATS, or externally via |
| 65 | tools such as Valgrind. This mode of operation may be forced at build time |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 66 | by setting DEBUG_NO_POOLS or at boot time with "-dMno-cache". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 67 | |
| 68 | - pass-through to an mmap-based allocator for debugging (4) |
| 69 | |
| 70 | This is used only during deep debugging when trying to detect various |
| 71 | conditions such as use-after-free. In this case each allocated object's |
| 72 | size is rounded up to a multiple of a page size (4096 bytes) and an |
| 73 | integral number of pages is allocated for each object using mmap(), |
| 74 | surrounded by two unaccessible holes that aim to detect some out-of-bounds |
| 75 | accesses. Released objects are instantly freed using munmap() so that any |
| 76 | immediate subsequent access to the memory area crashes the process if the |
| 77 | area had not been reallocated yet. This mode can be enabled at build time |
Willy Tarreau | 9192d20 | 2022-12-08 17:47:59 +0100 | [diff] [blame] | 78 | by setting DEBUG_UAF, or at run time by disabling pools and enabling UAF |
| 79 | with "-dMuaf". It tends to consume a lot of memory and not to scale at all |
| 80 | with concurrent calls, that tends to make the system stall. The watchdog |
| 81 | may even trigger on some slow allocations. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 82 | |
| 83 | There are no more provisions for running with a shared pool but no thread-local |
| 84 | cache: the shared pool's main goal is to compensate for the expensive calls to |
| 85 | the memory allocator. This gain may be huge on tiny systems using basic |
| 86 | allocators, but the thread-local cache will already achieve this. And on larger |
| 87 | threaded systems, the shared pool's benefit is visible when the underlying |
| 88 | allocator scales poorly, but in this case the shared pool would suffer from |
| 89 | the same limitations without its thread-local cache and wouldn't provide any |
| 90 | benefit. |
| 91 | |
| 92 | Summary of the various operation modes: |
| 93 | |
| 94 | (1) (2) (3) (4) |
| 95 | |
| 96 | User User User User |
| 97 | | | | | |
| 98 | pool_alloc() V V | | |
| 99 | +---------+ +---------+ | | |
| 100 | | Thread | | Thread | | | |
| 101 | | Local | | Local | | | |
| 102 | | Cache | | Cache | | | |
| 103 | +---------+ +---------+ | | |
| 104 | | | | | |
| 105 | pool_refill*() V | | | |
| 106 | +---------+ | | | |
| 107 | | Shared | | | | |
| 108 | | Pool | | | | |
| 109 | +---------+ | | | |
| 110 | | | | | |
| 111 | malloc() V V V | |
| 112 | +---------+ +---------+ +---------+ | |
| 113 | | Library | | Library | | Library | | |
| 114 | +---------+ +---------+ +---------+ | |
| 115 | | | | | |
| 116 | mmap() V V V V |
| 117 | +---------+ +---------+ +---------+ +---------+ |
| 118 | | OS | | OS | | OS | | OS | |
| 119 | +---------+ +---------+ +---------+ +---------+ |
| 120 | |
| 121 | One extra build define, DEBUG_FAIL_ALLOC, is used to enforce random allocation |
| 122 | failure in pool_alloc() by randomly returning NULL, to test that callers |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 123 | properly handle allocation failures. It may also be enabled at boot time using |
| 124 | "-dMfail". In this case the desired average rate of allocation failures can be |
| 125 | fixed by global setting "tune.fail-alloc" expressed in percent. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 126 | |
Willy Tarreau | 284cfc6 | 2022-12-19 08:15:57 +0100 | [diff] [blame] | 127 | The thread-local caches contain the freshest objects. Its total size amounts to |
| 128 | the number of bytes set in global.tune.pool_cache_size and that may be adjusted |
| 129 | by the "tune.memory.hot-size" global option, which itself defaults to build |
| 130 | time setting CONFIG_HAP_POOL_CACHE_SIZE, which was 1MB before 2.6 and 512kB |
| 131 | after. The aim is to keep hot objects that still fit in the CPU core's private |
| 132 | L2 cache. Once these objects do not fit into the cache anymore, there's no |
| 133 | benefit keeping them local to the thread, so they'd rather be returned to the |
| 134 | shared pool or the main allocator so that any other thread may make use of |
| 135 | them. Under extreme thread contention the cost of accessing shared structures |
| 136 | in the global cache or in malloc() may still be important and it may prove |
| 137 | useful to increase the thread-local cache size. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 138 | |
| 139 | |
| 140 | 3. Storage in thread-local caches |
| 141 | --------------------------------- |
| 142 | |
| 143 | This section describes how objects are linked in thread local caches. This is |
| 144 | not meant to be a concern for users of the pools API but it can be useful when |
| 145 | inspecting post-mortem dumps or when trying to figure certain size constraints. |
| 146 | |
| 147 | Objects are stored in the local cache using a doubly-linked list. This ensures |
| 148 | that they can be visited by freshness order like a stack, while at the same |
| 149 | time being able to access them from oldest to newest when it is needed to |
| 150 | evict coldest ones first: |
| 151 | |
| 152 | - releasing an object to the cache always puts it on the top. |
| 153 | |
| 154 | - allocating an object from the cache always takes the topmost one, hence the |
| 155 | freshest one. |
| 156 | |
| 157 | - scanning for older objects to evict starts from the bottom, where the |
| 158 | oldest ones are located |
| 159 | |
| 160 | To that end, each thread-local cache keeps a list head in the "list" member of |
| 161 | its "pool_cache_head" descriptor, that links all objects cast to type |
| 162 | "pool_cache_item" via their "by_pool" member. |
| 163 | |
| 164 | Note that the mechanism described above only works for a single pool. When |
| 165 | trying to limit the total cache size to a certain value, all pools included, |
| 166 | there is also a need to arrange all objects from all pools together in the |
| 167 | local caches. For this, each thread_ctx maintains a list head of recently |
| 168 | released objects, all pools included, in its member "pool_lru_head". All items |
| 169 | in a thread-local cache are linked there via their "by_lru" member. |
| 170 | |
| 171 | This means that releasing an object using pool_free() consists in inserting |
| 172 | it at the beginning of two lists: |
| 173 | - the local pool_cache_head's "list" list head |
| 174 | - the thread context's "pool_lru_head" list head |
| 175 | |
| 176 | Allocating an object consists in picking the first entry from the pool's "list" |
| 177 | and deleting its "by_pool" and "by_lru" links. |
| 178 | |
| 179 | Evicting an object consists in scanning the thread context's "pool_lru_head" |
| 180 | backwards and deleting the object's "by_pool" and "by_lru" links. |
| 181 | |
| 182 | Given that entries are both inserted and removed synchronously, we have the |
| 183 | guarantee that the oldest object in the thread's LRU list is always the oldest |
| 184 | object in its pool, and that the next element is the cache's list head. This is |
| 185 | what allows the LRU eviction mechanism to figure what pool an object belongs to |
| 186 | when releasing it. |
| 187 | |
| 188 | Note: |
| 189 | | Since a pool_cache_item has two list entries, on 64-bit systems it will be |
| 190 | | 32-bytes long. This is the smallest size that a pool may be, and any smaller |
| 191 | | size will automatically be rounded up to this size. |
| 192 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 193 | When build option DEBUG_POOL_INTEGRITY is set, or the boot-time option |
| 194 | "-dMintegrity" is passed on the command line, the area of the object between |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 195 | the two list elements and the end according to pool->size will be filled with |
| 196 | pseudo-random words during pool_put_to_cache(), and these words will be |
| 197 | compared between each other during pool_get_from_cache(), and the process will |
| 198 | crash in case any bit differs, as this would indicate that the memory area was |
| 199 | modified after the free. The pseudo-random pattern is in fact incremented by |
| 200 | (~0)/3 upon each free so that roughly half of the bits change each time and we |
| 201 | maximize the likelihood of detecting a single bit flip in either direction. In |
| 202 | order to avoid an immediate reuse and maximize the time the object spends in |
| 203 | the cache, when this option is set, objects are picked from the cache from the |
| 204 | oldest one instead of the freshest one. This way even late memory corruptions |
| 205 | have a chance to be detected. |
| 206 | |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 207 | When build option DEBUG_MEMORY_POOLS is set, or the boot-time option "-dMtag" |
| 208 | is passed on the executable's command line, pool objects are allocated with |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 209 | one extra pointer compared to the requested size, so that the bytes that follow |
| 210 | the memory area point to the pool descriptor itself as long as the object is |
| 211 | allocated via pool_alloc(). Upon releasing via pool_free(), the pointer is |
| 212 | compared and the code will crash in if it differs. This allows to detect both |
| 213 | memory overflows and object released to the wrong pool (code bug resulting from |
| 214 | a copy-paste error typically). |
| 215 | |
| 216 | Thus an object will look like this depending whether it's in the cache or is |
| 217 | currently in use: |
| 218 | |
| 219 | in cache in use |
| 220 | +------------+ +------------+ |
| 221 | <--+ by_pool.p | | N bytes | |
| 222 | | by_pool.n +--> | | |
| 223 | +------------+ |N=16 min on | |
| 224 | <--+ by_lru.p | | 32-bit, | |
| 225 | | by_lru.n +--> | 32 min on | |
| 226 | +------------+ | 64-bit | |
| 227 | : : : : |
| 228 | | N bytes | | | |
| 229 | +------------+ +------------+ \ optional, only if |
| 230 | : (unused) : : pool ptr : > DEBUG_MEMORY_POOLS |
| 231 | +------------+ +------------+ / is set at build time |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 232 | or -dMtag at boot time |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 233 | |
| 234 | Right now no provisions are made to return objects aligned on larger boundaries |
| 235 | than those currently covered by malloc() (i.e. two pointers). This need appears |
| 236 | from time to time and the layout above might evolve a little bit if needed. |
| 237 | |
| 238 | |
| 239 | 4. Storage in the process-wide shared pool |
| 240 | ------------------------------------------ |
| 241 | |
| 242 | In order for the shared pool not to be a contention point in a multi-threaded |
| 243 | environment, objects are allocated from or released to shared pools by clusters |
| 244 | of a few objects at once. The maximum number of objects that may be moved to or |
| 245 | from a shared pool at once is defined by CONFIG_HAP_POOL_CLUSTER_SIZE at build |
| 246 | time, and currently defaults to 8. |
| 247 | |
| 248 | In order to remain scalable, the shared pool has to make some tradeoffs to |
| 249 | limit the number of atomic operations and the duration of any locked operation. |
| 250 | As such, it's composed of a single-linked list of clusters, themselves made of |
| 251 | a single-linked list of objects. |
| 252 | |
| 253 | Clusters and objects are of the same type "pool_item" and are accessed from the |
| 254 | pool's "free_list" member. This member points to the latest pool_item inserted |
| 255 | into the pool by a release operation. And the pool_item's "next" member points |
| 256 | to the next pool_item, which was the one present in the pool's free_list just |
| 257 | before the pool_item was inserted, and the last pool_item in the list simply |
| 258 | has a NULL "next" field. |
| 259 | |
| 260 | The pool_item's "down" pointer points down to the next objects part of the same |
| 261 | cluster, that will be released or allocated at the same time as the first one. |
| 262 | Each of these items also has a NULL "next" field, and are chained by their |
| 263 | respective "down" pointers until the last one is detected by a NULL value. |
| 264 | |
| 265 | This results in the following layout: |
| 266 | |
| 267 | pool pool_item pool_item pool_item |
| 268 | +-----------+ +------+ +------+ +------+ |
| 269 | | free_list +--> | next +--> | next +--> | NULL | |
| 270 | +-----------+ +------+ +------+ +------+ |
| 271 | | down | | NULL | | down | |
| 272 | +--+---+ +------+ +--+---+ |
| 273 | | | |
| 274 | V V |
| 275 | +------+ +------+ |
| 276 | | NULL | | NULL | |
| 277 | +------+ +------+ |
| 278 | | down | | NULL | |
| 279 | +--+---+ +------+ |
| 280 | | |
| 281 | V |
| 282 | +------+ |
| 283 | | NULL | |
| 284 | +------+ |
| 285 | | NULL | |
| 286 | +------+ |
| 287 | |
| 288 | Allocating an entry is only a matter of performing two atomic allocations on |
| 289 | the free_list and reading the pool's "next" value: |
| 290 | |
| 291 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 292 | - read the first pool_item's "next" field |
| 293 | - atomically replace the free_list with this value |
| 294 | |
| 295 | This results in a fast operation that instantly retrieves a cluster at once. |
| 296 | Then outside of the critical section entries are walked over and inserted into |
| 297 | the local cache one at a time. In order to keep the code simple and efficient, |
| 298 | objects allocated from the shared pool are all placed into the local cache, and |
| 299 | only then the first one is allocated from the cache. This operation is |
| 300 | performed by the dedicated function pool_refill_local_from_shared() which is |
| 301 | called from pool_get_from_cache() when the cache is empty. It means there is an |
| 302 | overhead of two list insert/delete operations for the first object and that |
| 303 | could be avoided at the expense of more complex code in the fast path, but this |
| 304 | is negligible since it only concerns objects that need to be visited anyway. |
| 305 | |
| 306 | Freeing a group of objects consists in performing the operation the other way |
| 307 | around: |
| 308 | |
| 309 | - atomically mark the free_list as being updated by writing a "magic" pointer |
| 310 | - write the free_list value to the to-be-released item's "next" entry |
| 311 | - atomically replace the free_list with the pool_item's pointer |
| 312 | |
| 313 | The cluster will simply have to be prepared before being sent to the shared |
| 314 | pool. The operation of releasing a cluster at once is performed by function |
| 315 | pool_put_to_shared_cache() which is called from pool_evict_last_items() which |
| 316 | itself is responsible for building the clusters. |
| 317 | |
| 318 | Due to the way objects are stored, it is important to try to group objects as |
| 319 | much as possible when releasing them because this is what will condition their |
| 320 | retrieval as groups as well. This is the reason why pool_evict_last_items() |
| 321 | uses the LRU to find a first entry but tries to pick several items at once from |
| 322 | a single cache. Tests have shown that CONFIG_HAP_POOL_CLUSTER_SIZE set to 8 |
| 323 | achieves up to 6-6.5 objects on average per operation, which effectively |
| 324 | divides by as much the average time spent per object by each thread and pushes |
| 325 | the contention point further. |
| 326 | |
| 327 | Also, grouping items in clusters is a property of the process-wide shared pool |
| 328 | and not of the thread-local caches. This means that there is no grouped |
| 329 | operation when not using the shared pool (mode "2" in the diagram above). |
| 330 | |
| 331 | |
| 332 | 5. API |
| 333 | ------ |
| 334 | |
| 335 | The following functions are public and available for user code: |
| 336 | |
| 337 | struct pool_head *create_pool(char *name, uint size, uint flags) |
| 338 | Create a new pool named <name> for objects of size <size> bytes. Pool |
| 339 | names are truncated to their first 11 characters. Pools of very similar |
| 340 | size will usually be merged if both have set the flag MEM_F_SHARED in |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 341 | <flags>. When DEBUG_DONT_SHARE_POOLS was set at build time, or |
| 342 | "-dMno-merge" is passed on the executable's command line, the pools |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 343 | also need to have the exact same name to be merged. In addition, unless |
| 344 | MEM_F_EXACT is set in <flags>, the object size will usually be rounded |
| 345 | up to the size of pointers (16 or 32 bytes). The name that will appear |
| 346 | in the pool upon merging is the name of the first created pool. The |
| 347 | returned pointer is the new (or reused) pool head, or NULL upon error. |
| 348 | Pools created this way must be destroyed using pool_destroy(). |
| 349 | |
| 350 | void *pool_destroy(struct pool_head *pool) |
| 351 | Destroy pool <pool>, that is, all of its unused objects are freed and |
| 352 | the structure is freed as well if the pool didn't have any used objects |
| 353 | anymore. In this case NULL is returned. If some objects remain in use, |
| 354 | the pool is preserved and its pointer is returned. This ought to be |
| 355 | used essentially on exit or in rare situations where some internal |
| 356 | entities that hold pools have to be destroyed. |
| 357 | |
| 358 | void pool_destroy_all(void) |
| 359 | Destroy all pools, without checking which ones still have used entries. |
| 360 | This is only meant for use on exit. |
| 361 | |
| 362 | void *__pool_alloc(struct pool_head *pool, uint flags) |
| 363 | Allocate an entry from the pool <pool>. The allocator will first look |
| 364 | for an object in the thread-local cache if enabled, then in the shared |
| 365 | pool if enabled, then will fall back to the operating system's default |
| 366 | allocator. NULL is returned if the object couldn't be allocated (due to |
| 367 | configured limits or lack of memory). Object allocated this way have to |
| 368 | be released using pool_free(). Like with malloc(), by default the |
| 369 | contents of the returned object are undefined. If memory poisonning is |
| 370 | enabled, the object will be filled with the poisonning byte. If the |
| 371 | global "pool.fail-alloc" setting is non-zero and DEBUG_FAIL_ALLOC is |
| 372 | enabled, a random number generator will be called to randomly return a |
| 373 | NULL. The allocator's behavior may be adjusted using a few flags passed |
| 374 | in <flags>: |
| 375 | - POOL_F_NO_POISON : when set, disables memory poisonning (e.g. when |
| 376 | pointless and expensive, like for buffers) |
| 377 | - POOL_F_MUST_ZERO : when set, the memory area will be zeroed before |
| 378 | being returned, similar to what calloc() does |
| 379 | - POOL_F_NO_FAIL : when set, disables the random allocation failure, |
| 380 | e.g. for use during early init code or critical sections. |
| 381 | |
| 382 | void *pool_alloc(struct pool_head *pool) |
| 383 | This is an exact equivalent of __pool_alloc(pool, 0). It is the regular |
| 384 | way to allocate entries from a pool. |
| 385 | |
| 386 | void *pool_alloc_nocache(struct pool_head *pool) |
| 387 | Allocate an entry from the pool <pool>, bypassing the cache. If shared |
| 388 | pools are enabled, they will be consulted first. Otherwise the object |
| 389 | is allocated using the operating system's default allocator. This is |
| 390 | essentially used during early boot to pre-allocate a number of objects |
| 391 | for pools which require a minimum number of entries to exist. |
| 392 | |
| 393 | void *pool_zalloc(struct pool_head *pool) |
| 394 | This is an exact equivalent of __pool_alloc(pool, POOL_F_MUST_ZERO). |
| 395 | |
| 396 | void pool_free(struct pool_head *pool, void *ptr) |
| 397 | Free an entry allocate from one of the pool_alloc() functions above |
| 398 | from pool <pool>. The object will be placed into the thread-local cache |
| 399 | if enabled, or in the shared pool if enabled, or will be released using |
Willy Tarreau | af580f6 | 2022-02-23 11:45:09 +0100 | [diff] [blame] | 400 | the operating system's default allocator. When a local cache is |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 401 | enabled, if the local cache size becomes larger than 75% of the maximum |
| 402 | size configured at build time, some objects will be evicted to the |
| 403 | shared pool. Such objects are taken first from the same pool, but if |
| 404 | the total size is really huge, other pools might be checked as well. |
| 405 | Some extra checks enabled at build time may enforce extra checks so |
| 406 | that the process will immediately crash if the object was not allocated |
| 407 | from this pool or experienced an overflow or some memory corruption. |
| 408 | |
| 409 | void pool_flush(struct pool_head *pool) |
| 410 | Free all unused objects from shared pool <pool>. Thread-local caches |
| 411 | are not affected. This is essentially used when running low on memory |
| 412 | or when stopping, in order to release a maximum amount of memory for |
| 413 | the new process. |
| 414 | |
| 415 | void pool_gc(struct pool_head *pool) |
| 416 | Free all unused objects from all pools, but respecting the minimum |
| 417 | number of spare objects required for each of them. Then, for operating |
| 418 | systems which support it, indicate the system that all unused memory |
| 419 | can be released. Thread-local caches are not affected. This operation |
| 420 | differs from pool_flush() in that it is run locklessly, under thread |
| 421 | isolation, and on all pools in a row. It is called by the SIGQUIT |
| 422 | signal handler and upon exit. Note that the obsolete argument <pool> is |
| 423 | not used and the convention is to pass NULL there. |
| 424 | |
| 425 | void dump_pools_to_trash(void) |
| 426 | Dump the current status of all pools into the trash buffer. This is |
| 427 | essentially used by the "show pools" CLI command or the SIGQUIT signal |
| 428 | handler to dump them on stderr. The total report size may not exceed |
| 429 | the size of the trash buffer. If it does, some entries will be missing. |
| 430 | |
| 431 | void dump_pools(void) |
| 432 | Dump the current status of all pools to stderr. This just calls |
| 433 | dump_pools_to_trash() and writes the trash to stderr. |
| 434 | |
| 435 | int pool_total_failures(void) |
| 436 | Report the total number of failed allocations. This is solely used to |
| 437 | report the "PoolFailed" metrics of the "show info" output. The total |
| 438 | is calculated on the fly by summing the number of failures in all pools |
| 439 | and is only meant to be used as an indicator rather than a precise |
| 440 | measure. |
| 441 | |
Christopher Faulet | c960a3b | 2022-12-22 11:05:48 +0100 | [diff] [blame] | 442 | ullong pool_total_allocated(void) |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 443 | Report the total number of bytes allocated in all pools, for reporting |
| 444 | in the "PoolAlloc_MB" field of the "show info" output. The total is |
| 445 | calculated on the fly by summing the number of allocated bytes in all |
| 446 | pools and is only meant to be used as an indicator rather than a |
| 447 | precise measure. |
| 448 | |
Christopher Faulet | c960a3b | 2022-12-22 11:05:48 +0100 | [diff] [blame] | 449 | ullong pool_total_used(void) |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 450 | Report the total number of bytes used in all pools, for reporting in |
| 451 | the "PoolUsed_MB" field of the "show info" output. The total is |
| 452 | calculated on the fly by summing the number of used bytes in all pools |
| 453 | and is only meant to be used as an indicator rather than a precise |
| 454 | measure. Note that objects present in caches are accounted as used. |
| 455 | |
| 456 | Some other functions exist and are only used by the pools code itself. While |
| 457 | not strictly forbidden to use outside of this code, it is generally recommended |
| 458 | to avoid touching them in order not to create undesired dependencies that will |
| 459 | complicate maintenance. |
| 460 | |
| 461 | A few macros exist to ease the declaration of pools: |
| 462 | |
| 463 | DECLARE_POOL(ptr, name, size) |
| 464 | Placed at the top level of a file, this declares a global memory pool |
| 465 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 466 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 467 | pointer to variable <ptr>. <ptr> will be created of type "struct |
| 468 | pool_head *". If the pool needs to be visible outside of the function |
| 469 | (which is likely), it will also need to be declared somewhere as |
| 470 | "extern struct pool_head *<ptr>;". It is recommended to place such |
| 471 | declarations very early in the source file so that the variable is |
| 472 | already known to all subsequent functions which may use it. |
| 473 | |
| 474 | DECLARE_STATIC_POOL(ptr, name, size) |
| 475 | Placed at the top level of a file, this declares a static memory pool |
| 476 | as variable <ptr>, name <name> and size <size> bytes per element. This |
| 477 | is made via a call to REGISTER_POOL() and by assigning the resulting |
| 478 | pointer to local variable <ptr>. <ptr> will be created of type "static |
| 479 | struct pool_head *". It is recommended to place such declarations very |
| 480 | early in the source file so that the variable is already known to all |
| 481 | subsequent functions which may use it. |
| 482 | |
| 483 | |
| 484 | 6. Build options |
| 485 | ---------------- |
| 486 | |
| 487 | A number of build-time defines allow to tune the pools behavior. All of them |
| 488 | have to be enabled using "-Dxxx" or "-Dxxx=yyy" in the makefile's DEBUG |
| 489 | variable. |
| 490 | |
| 491 | DEBUG_NO_POOLS |
| 492 | When this is set, pools are entirely disabled, and allocations are made |
| 493 | using malloc() instead. This is not recommended for production but may |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 494 | be useful for tracing allocations. It corresponds to "-dMno-cache" at |
| 495 | boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 496 | |
| 497 | DEBUG_MEMORY_POOLS |
| 498 | When this is set, an extra pointer is allocated at the end of each |
| 499 | object to reference the pool the object was allocated from and detect |
| 500 | buffer overflows. Then, pool_free() will provoke a crash in case it |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 501 | detects an anomaly (pointer at the end not matching the pool). It |
| 502 | corresponds to "-dMtag" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 503 | |
| 504 | DEBUG_FAIL_ALLOC |
| 505 | When enabled, a global setting "tune.fail-alloc" may be set to a non- |
| 506 | zero value representing a percentage of memory allocations that will be |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 507 | made to fail in order to stress the calling code. It corresponds to |
| 508 | "-dMfail" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 509 | |
| 510 | DEBUG_DONT_SHARE_POOLS |
| 511 | When enabled, pools of similar sizes are not merged unless the have the |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 512 | exact same name. It corresponds to "-dMno-merge" at boot time. |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 513 | |
| 514 | DEBUG_UAF |
| 515 | When enabled, pools are disabled and all allocations and releases pass |
| 516 | through mmap() and munmap(). The memory usage significantly inflates |
| 517 | and the performance degrades, but this allows to detect a lot of |
| 518 | use-after-free conditions by crashing the program at the first abnormal |
Willy Tarreau | 9192d20 | 2022-12-08 17:47:59 +0100 | [diff] [blame] | 519 | access. This should not be used in production. It corresponds to |
| 520 | boot-time options "-dMuaf". Caching is disabled but may be re-enabled |
| 521 | using "-dMcache". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 522 | |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 523 | DEBUG_POOL_INTEGRITY |
| 524 | When enabled, objects picked from the cache are checked for corruption |
| 525 | by comparing their contents against a pattern that was placed when they |
| 526 | were inserted into the cache. Objects are also allocated in the reverse |
| 527 | order, from the oldest one to the most recent, so as to maximize the |
| 528 | ability to detect such a corruption. The goal is to detect writes after |
| 529 | free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF |
| 530 | this cannot detect reads after free, but may possibly detect later |
| 531 | corruptions and will not consume extra memory. The CPU usage will |
| 532 | increase a bit due to the cost of filling/checking the area and for the |
| 533 | preference for cold cache instead of hot cache, though not as much as |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 534 | with DEBUG_UAF. This option is meant to be usable in production. It |
| 535 | corresponds to boot-time options "-dMcold-first,integrity". |
Willy Tarreau | 0575d8f | 2022-01-21 19:00:25 +0100 | [diff] [blame] | 536 | |
Willy Tarreau | add43fa | 2022-01-24 15:52:51 +0100 | [diff] [blame] | 537 | DEBUG_POOL_TRACING |
| 538 | When enabled, the callers of pool_alloc() and pool_free() will be |
| 539 | recorded into an extra memory area placed after the end of the object. |
| 540 | This may only be required by developers who want to get a few more |
| 541 | hints about code paths involved in some crashes, but will serve no |
| 542 | purpose outside of this. It remains compatible (and completes well) |
| 543 | DEBUG_POOL_INTEGRITY above. Such information become meaningless once |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 544 | the objects leave the thread-local cache. It corresponds to boot-time |
| 545 | option "-dMcaller". |
Willy Tarreau | add43fa | 2022-01-24 15:52:51 +0100 | [diff] [blame] | 546 | |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 547 | DEBUG_MEM_STATS |
| 548 | When enabled, all malloc/calloc/realloc/strdup/free calls are accounted |
| 549 | for per call place (file+line number), and may be displayed or reset on |
| 550 | the CLI using "debug dev memstats". This is essentially used to detect |
| 551 | potential leaks or abnormal usages. When pools are enabled (default), |
| 552 | such calls are rare and the output will mostly contain calls induced by |
| 553 | libraries. When pools are disabled, about all calls to pool_alloc() and |
| 554 | pool_free() will also appear since they will be remapped to standard |
| 555 | functions. |
| 556 | |
| 557 | CONFIG_HAP_GLOBAL_POOLS |
| 558 | When enabled, process-wide shared pools will be forcefully enabled even |
| 559 | if not considered useful on the platform. The default is to let haproxy |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 560 | decide based on the OS and C library. It corresponds to boot-time |
| 561 | option "-dMglobal". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 562 | |
| 563 | CONFIG_HAP_NO_GLOBAL_POOLS |
| 564 | When enabled, process-wide shared pools will be forcefully disabled |
| 565 | even if considered useful on the platform. The default is to let |
Willy Tarreau | 0722d5d | 2022-02-24 08:58:04 +0100 | [diff] [blame] | 566 | haproxy decide based on the OS and C library. It corresponds to |
| 567 | boot-time option "-dMno-global". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 568 | |
| 569 | CONFIG_HAP_POOL_CACHE_SIZE |
Willy Tarreau | 284cfc6 | 2022-12-19 08:15:57 +0100 | [diff] [blame] | 570 | This allows one to define the default size of the per-thread cache, in |
| 571 | bytes. The default value is 512 kB (524288). Smaller values will use |
| 572 | less memory at the expense of a possibly higher CPU usage when using |
| 573 | many threads. Higher values will give diminishing returns on |
| 574 | performance while using much more memory. Usually there is no benefit |
| 575 | in using more than a per-core L2 cache size. It would be better not to |
| 576 | set this value lower than a few times the size of a buffer (bufsize, |
| 577 | defaults to 16 kB). In addition, keep in mind that this option may be |
| 578 | changed at runtime using "tune.memory.hot-size". |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 579 | |
| 580 | CONFIG_HAP_POOL_CLUSTER_SIZE |
| 581 | This allows one to define the maximum number of objects that will be |
| 582 | groupped together in an allocation from the shared pool. Values 4 to 8 |
| 583 | have experimentally shown good results with 16 threads. On systems with |
Ilya Shipitsin | 3b64a28 | 2022-07-29 22:26:53 +0500 | [diff] [blame] | 584 | more cores or loosely coupled caches exhibiting slow atomic operations, |
Willy Tarreau | b64ef3e | 2022-01-11 14:48:20 +0100 | [diff] [blame] | 585 | it could possibly make sense to slightly increase this value. |