What question did this study set out to answer?

The aim is to evaluate VDR, a fixed-basis integer arithmetic method, for improved computational performance over traditional floating-point methods in machine learning tasks.

May 20, 2026Open Access

VDR in Zig SIMD and GPU Performance versus Floating Point: Fixed-Basis Integer Arithmetic on Production Hardware

Key Points

The aim is to evaluate VDR, a fixed-basis integer arithmetic method, for improved computational performance over traditional floating-point methods in machine learning tasks.
Implemented VDR targeting AVX-512 SIMD and NVIDIA H100 GPU tensor cores.
Retuned the basis for various machine register widths: 8-bit, 16-bit, and 64-bit.
Projected performance improvements based on published hardware specifications.
1.6-1.8× throughput improvement on GEMM via INT8 tensor cores.
3-4× improvement on softmax by eliminating Special Function Unit bottleneck.
Projected 2× throughput for a full transformer forward pass on a 7B parameter model versus optimized FP16.

Abstract

Every floating-point operation can round. One rounding is negligible. Millions compound. This paper presents performance projections for VDR — Value, Denominator, Remainder — exact integer arithmetic implemented in Zig targeting AVX-512 SIMD and NVIDIA H100 GPU tensor cores. VDR eliminates accumulated arithmetic error by replacing floating-point operations with integer multiply, shift, and mask on a fixed power-of-two denominator basis, storing exact remainders rather than discarding them. The reference VDR implementation (vdr-math, Python) uses a 335-bit basis tuned for physics and transcendental computation. This paper retunes the basis to match machine register widths for LLM inference and diffusion model workloads: 8-bit for weights, 16-bit for activations, 64-bit for gradient accumulation. At these widths, VDR's divmod operation reduces to a bit shift and mask — native hardware operations on all modern processors. Projected results on H100: 1.6-1.8× throughput improvement on GEMM via INT8 tensor cores, 3-4× improvement on softmax via elimination of the Special Function Unit bottleneck, 2× effective memory bandwidth from half-size weights, and zero accumulated drift over arbitrarily long operation chains. Full transformer forward pass for a 7B parameter model projects to approximately 2× throughput versus optimized FP16, with exact results at every step. All projections are conservative estimates based on published hardware specifications. VDR delivers exact arithmetic not by trading performance for correctness, but by targeting integer execution units that are faster than their floating-point counterparts for the operations ML pipelines actually perform.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Geoffrey Howland (Fri,) studied this question.

synapsesocial.com/papers/6a0d4fbff03e14405aa9b2d1 https://doi.org/https://doi.org/10.5281/zenodo.20266675

Bookmark

View Full Paper