Key points are not available for this paper at this time.
Modern, high-performance memory allocators must scale to a wide array of uses, including producer-consumer workloads. In such workloads, objects are allocated by one thread and deallocated by another, which we call remote deallocations. These remote deallocations lead to contention on the allocator's synchronization mechanisms. Message-passing allocators, such as mimalloc and snmalloc, use message queues to communicate remote deallocations between threads. These queues work well for producer-consumer workloads, but there is room for optimization. We propose and characterize BatchIt, a conceptually simple optimization for such allocators: a per-slab cache of remote deallocations that enables batching of objects destined for the same slab. This optimization aims to exploit naturally-arising locality of allocations, and it generalizes across particular implementations; we have implementations for both mimalloc and snmalloc. Multi-threaded, producer-consumer benchmarks show improved performance from reduced rates of atomic operations and cache misses in the underlying allocator. Experimental results using the mimalloc-bench suite and a custom message-passing workload show that some producer-consumer workloads see over 20% performance improvement even atop the high performance these allocators already provide.
Filardo et al. (Thu,) studied this question.