What question did this study set out to answer?

The aim is to improve inbound network I/O performance by optimizing cache usage in memory paths.

March 28, 2026

Sumeru: Towards Understanding and Achieving Cache-Optimal Inbound Network I/O

Key Points

The aim is to improve inbound network I/O performance by optimizing cache usage in memory paths.
Developed a novel cache model to analyze software queue dynamics.
Designed Sumeru, incorporating a dual-path stack architecture for large data flows.
Implemented cache-aware buffer pools to optimize cache trajectories.
Utilized active queue management techniques to prevent bufferbloat.
Applied dynamic cache partitioning based on trajectory awareness.
Achieved near-100% cache hit rates across various workloads.
Eliminated memory-induced intra-host congestion, enhancing overall data movement.
Boosted SPDK NVMe/TCP goodput by up to 51.2%.
Improved co-located SPEC CPU 2017 scores by up to 30.1%.

Abstract

The slow growth of DRAM performance and ever-increasing memory bandwidth demands have made receiver-side memory a critical bottleneck for end-to-end data movement in cutting-edge data centers. Although Direct Cache Access (DCA) allows for memory-bypass I/O, existing implementations like Intel's Data Direct I/O (DDIO) have proven ineffective on 100 Gbps links, leading to a widespread belief that current processor caches are simply too small to serve modern high-speed links. This paper challenges this conclusion, arguing that the fundamental problem is not insufficient cache capacity, but inefficient cache usage. Our novel cache model reveals that software queue dynamics determine a receive buffer's path through the non-inclusive cache hierarchy (i.e., its ''cache trajectory''), opening the path toward cache-optimal DRAM-bypass inbound I/O on commodity hardware with pure software modifications. Guided by the model, we design and implement Sumeru, which approaches cache-optimal I/O through four synergistic innovations: (1) a dual-path stack architecture with a shallow fast path for large flows, (2) cache-aware buffer pools enforcing optimal trajectories, (3) host-based active queue management preventing bufferbloat, and (4) trajectory-aware dynamic cache partitioning. These designs work together to consistently keep network buffers on their optimal trajectory. The result is near-100% cache hit rates on a wide range of workloads and network settings. This eliminates memory-induced intra-host congestion, improving performance for both the target throughput-bound application and co-located latency-sensitive or memory-intensive neighbors. On real-world resource-contending deployments, Sumeru achieves a Pareto improvement: It boosts SPDK NVMe/TCP goodput by up to 51.2% while simultaneously boosting co-located SPEC CPU 2017 suite scores by up to 30.1%.

Bookmark

Sumeru: Towards Understanding and Achieving Cache-Optimal Inbound Network I/O

Key Points

Abstract

Cite This Study