March 3, 2026Open Access

Accelerating GPGPU simulation by strategically parallelizing the compute bottleneck

Key Points

The average simulation wall-time speedup achieved is 3.58x using 8 worker threads, significantly reducing runtimes.
The method identifies the SIMT-Core cluster as the main performance bottleneck, leading to targeted improvements.
Implementation of a parallelization scheme with a thread pool allows effective concurrent execution of SIMT-Core clusters.
This approach ensures long-term maintainability with minimal modifications to the existing GPGPU-Sim codebase.

Abstract

Cycle-accurate GPGPU simulators like GPGPU-Sim provide invaluable insights for hardware architecture research but suffer from extremely long runtimes, hindering research productivity. This paper addresses this critical bottleneck by proposing a strategy to accelerate GPGPU-Sim. We first perform a holistic profiling analysis across diverse GPGPU benchmarks to identify the primary performance bottleneck, pinpointing the SIMT-Core cluster execution within the CORE-clock cycle. Based on this, we implement a parallelization scheme that strategically targets this hotspot, utilizing a thread pool to manage concurrent execution of SIMT-Core clusters. Our approach prioritizes minimal modifications to the existing GPGPU-Sim codebase to ensure long-term maintainability. Evaluation of a simulated NVIDIA H100 model demonstrates an average simulation wall-time speedup of 3.58x with 8 worker threads, and a maximum up to 4.38x, while incurring a maximum cycle count error of 3.22%, with some other benchmarks exhibiting no error at all.

Accelerating GPGPU simulation by strategically parallelizing the compute bottleneck

Key Points

Abstract

Cite This Study