This study takes the flagship NVIDIA CMP 170HX as its research subject, employing microbenchmarking methods to systematically investigate the instruction-level performance limiting mechanisms of its Tensor Cores. The core findings and contributions are threefold: First, experiments reveal for the first time a 256 fixed-cycle instruction execution throttling phenomenon in the CMP 170HX's Tensor Cores. The latency of a single MMA instruction is unaffected by the degree of Instruction Level Parallelism (ILP) and cannot be hidden through pipeline overlap. Furthermore, only 4 warps per Streaming Multiprocessor (SM) can simultaneously issue Tensor Core instructions, ultimately resulting in its FP16 Tensor Core realistic computing power being only 1/32 of its theoretical peak. Second, through multiple controlled experiments including ILP scaling, warp scaling, dependency chain construction, and cross-pipeline interference, the throttling mechanism is precisely pinpointed as a dispatch-level hardware gating limitation, rather than physical damage to the execution units or decoding delays. Third, based on experimental results, a theoretical model from microarchitecture to macroscopic computing power is constructed, completing a full theoretical close-line from the 256-cycle fixed latency and 4-warp issue limit to the measured total computing power of 6.3 TFLOPS.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kangwei Xing
Oldham Council
Building similarity graph...
Analyzing shared references across papers
Loading...
Kangwei Xing (Mon,) studied this question.
synapsesocial.com/papers/69b5ff8083145bc643d1c0d4 — DOI: https://doi.org/10.5281/zenodo.19002983