Three big semiconductor companies in HPC are currently competing in the race for the best CPU: AMD, Intel, and NVIDIA. There are significant differences among their state-of-the-art CPU designs, spanning the entire range from instruction execution to cache behavior and main memory bandwidth. In this work, we analyze the performance of CPUs based on the Zen 4, Golden Cove, and Neoverse V2 microarchitectures. We create accurate in-core performance models for use with the Open Source Architecture Code Analyzer (OSACA) tool and compare its prediction accuracy with llvm-mca. Beyond the tool aspect, this reveals interesting differences in in-core design points but also some commonalities. Beyond the single core, we extend our comparison by measuring data-transfer behavior through the memory hierarchy using a variety of microbenchmarks. We thoroughly investigate the “write-allocate (WA) evasion” feature, which can automatically reduce the memory traffic caused by write misses. We show that the Grace Superchip has a next-to-optimal implementation of WA evasion while the Sapphire Rapids CPU can avoid write allocates completely only in specific scenarios. The only way to eliminate WAs on AMD Genoa is the explicit use of non-temporal stores. Finally, we study the cache hierarchy of the CPUs in view of the Execution-Cache-Memory (ECM) performance model, revealing overlapping cache hierarchies on Genoa and Grace in contrast to Sapphire Rapids.
Laukemann et al. (Thu,) studied this question.