What question did this study set out to answer?

March 15, 2026Open Access

Microbenchmarking Instruction-Level Tensor Core Throttling in NVIDIA CMP 170HX

Key Points

This research aims to investigate the instruction-level performance limiting mechanisms of Tensor Cores in NVIDIA CMP 170HX.
Microbenchmarking methods were employed to systematically analyze instruction execution.
Controlled experiments examined ILP scaling, warp scaling, dependency chains, and cross-pipeline interference.
A fixed-cycle instruction execution throttling of 256 cycles was revealed for the CMP 170HX's Tensor Cores.
Only 4 warps per SM can issue Tensor Core instructions simultaneously, limiting FP16 power to 1/32 of the theoretical peak.
A theoretical model was constructed linking microarchitecture to total computing power of 6.3 TFLOPS.

Abstract

This study takes the flagship NVIDIA CMP 170HX as its research subject, employing microbenchmarking methods to systematically investigate the instruction-level performance limiting mechanisms of its Tensor Cores. The core findings and contributions are threefold: First, experiments reveal for the first time a 256 fixed-cycle instruction execution throttling phenomenon in the CMP 170HX's Tensor Cores. The latency of a single MMA instruction is unaffected by the degree of Instruction Level Parallelism (ILP) and cannot be hidden through pipeline overlap. Furthermore, only 4 warps per Streaming Multiprocessor (SM) can simultaneously issue Tensor Core instructions, ultimately resulting in its FP16 Tensor Core realistic computing power being only 1/32 of its theoretical peak. Second, through multiple controlled experiments including ILP scaling, warp scaling, dependency chain construction, and cross-pipeline interference, the throttling mechanism is precisely pinpointed as a dispatch-level hardware gating limitation, rather than physical damage to the execution units or decoding delays. Third, based on experimental results, a theoretical model from microarchitecture to macroscopic computing power is constructed, completing a full theoretical close-line from the 256-cycle fixed latency and 4-warp issue limit to the measured total computing power of 6.3 TFLOPS.

AI에게 질문

Bookmark

View Full Paper