What does this research mean for the field?

Repurposing idle GPU compute resources to perform massively parallel page table walks (cuPTW) significantly accelerates address translation, achieving an average performance speedup of 4.43× over baseline architectures. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

May 29, 2026

cuPTW: Leveraging Idle Compute Units for Massively Parallel GPU Page Table Walks

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Virtual memory has become a cornerstone of modern GPUs, enabling unified address spaces and advanced memory management techniques. However, the performance of address translation has emerged as a critical bottleneck, particularly under irregular workloads with massive memory footprints, where frequent TLB misses and costly page table walks dominate total memory access latency. Prior work has primarily focused on improving TLB effectiveness or optimizing page table walks through batching and coalescing, but these approaches remain limited by the lack of locality and memory bandwidth constraints. In this work, we propose Compute Unit Page Table Walk ( cuPTW ), a novel address translation architecture that repurposes idle GPU compute resources to accelerate page table walks. Specifically, we introduce (1) a single-threaded synchronous cuPTW , which offloads translation requests to idle functional units within compute units, and (2) two optimizations that further reduce latency and improve throughput by caching page table walks in local data store memory ( cuPTW-SW ) and parallelizing them across multiple SIMD lanes ( cuPTW-MT ). When combined, cuPTW-Full transforms low-parallelism page table walks into a massively parallel computation task. Our evaluation across 15 representative GPU workloads, including deep learning, graph analytics and scientific simulations, demonstrates that cuPTW-Full achieves a performance speedup of 4.43× on average (up to 76.09×) by improving page table walk throughput by 9.92× on average over our baseline GPU architecture. Compared to state-of-the-art GPU address translation proposals, cuPTW-Full achieves a 1.97× to 2.08× average speedup.

Me gusta

Guardar

Cite This Study

Chen et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1bc8a60a1f7575939ce704 https://doi.org/https://doi.org/10.1145/3805633

Me gusta

Guardar