The slow growth of DRAM performance and ever-increasing memory bandwidth demands have made receiver-side memory a critical bottleneck for end-to-end data movement in cutting-edge data centers. Although Direct Cache Access (DCA) allows for memory-bypass I/O, existing implementations like Intel's Data Direct I/O (DDIO) have proven ineffective on 100 Gbps links, leading to a widespread belief that current processor caches are simply too small to serve modern high-speed links. This paper challenges this conclusion, arguing that the fundamental problem is not insufficient cache capacity, but inefficient cache usage. Our novel cache model reveals that software queue dynamics determine a receive buffer's path through the non-inclusive cache hierarchy (i.e., its ''cache trajectory''), opening the path toward cache-optimal DRAM-bypass inbound I/O on commodity hardware with pure software modifications. Guided by the model, we design and implement Sumeru, which approaches cache-optimal I/O through four synergistic innovations: (1) a dual-path stack architecture with a shallow fast path for large flows, (2) cache-aware buffer pools enforcing optimal trajectories, (3) host-based active queue management preventing bufferbloat, and (4) trajectory-aware dynamic cache partitioning. These designs work together to consistently keep network buffers on their optimal trajectory. The result is near-100% cache hit rates on a wide range of workloads and network settings. This eliminates memory-induced intra-host congestion, improving performance for both the target throughput-bound application and co-located latency-sensitive or memory-intensive neighbors. On real-world resource-contending deployments, Sumeru achieves a Pareto improvement: It boosts SPDK NVMe/TCP goodput by up to 51.2% while simultaneously boosting co-located SPEC CPU 2017 suite scores by up to 30.1%.
Wang et al. (Thu,) studied this question.