What question did this study set out to answer?

This study explores the transition of VDR-LLM-Prolog from FPGA to dedicated silicon architecture for integer arithmetic.

May 19, 2026Open Access

VDR-LLM-Prolog on Dedicated Silicon: From FPGA Proof-of-Concept to Integer-Native GPU Architecture

Key Points

This study explores the transition of VDR-LLM-Prolog from FPGA to dedicated silicon architecture for integer arithmetic.
Developed an integer-native processor architecture designed for GPUs with 384-bit ALUs.
Projected performance using over 5 trillion Q335 multiplications per second with efficient memory bandwidth utilization.
Specified core microarchitecture, memory hierarchy, programming model, and die area estimates.
Achieved approximately 5 trillion Q335 multiplications per second, making arithmetic memory-bound.
Reduced multiplication cycles from 9 on FPGA to 1-2 on the integer-native GPU.
Optimized die area for integer processing, reclaiming space from unused floating-point units.

Abstract

The VDR-LLM-Prolog FPGA implementation HOWL-VDR-21-2026 validates an architectural principle: Q335 exact integer arithmetic — where every value is a 384-bit numerator over an implicit fixed denominator of 2³35, and division by that denominator is bit extraction requiring zero logic — maps naturally to parallel hardware. The FPGA achieves this at 150 MHz on 10 custom cores in a 200 system-on-chip. This paper asks what happens when that architecture moves to dedicated silicon designed for it. Modern GPU fabrication at 4-5nm provides transistor budgets exceeding 80 billion, clock speeds of 2-2. 5 GHz, and memory bandwidths of 3-5 TB/s via HBM3. Current GPUs dedicate substantial die area to floating-point units, tensor cores with float accumulation, and special function units for transcendentals (sin, cos, exp, rsqrt) — none of which VDR-LLM-Prolog uses. This paper specifies an integer-native processor that reclaims that area for wide integer arithmetic: 384-bit ALUs with 1-2 cycle multiply (versus 9 cycles on FPGA), SHR335 as a routing decision (zero gates, zero power, zero latency beyond wire delay — the same property that makes it zero logic on FPGA, now at thousands of units), and a reduction network that produces exact results at every level. The design targets 5, 120 Q335 cores organized into 80 streaming multiprocessors, projecting approximately 5 trillion Q335 multiplications per second — sufficient that the arithmetic is memory-bandwidth-bound, not compute-bound, on workloads where VDR-18 showed total multiply counts of thousands to millions per prompt. The paper specifies the core microarchitecture, the memory hierarchy, the programming model, the die area estimates, and the performance projections, treating the FPGA's validated ISA principles as the architectural contract and modern GPU fabrication as the implementation technology.

VDR-LLM-Prolog on Dedicated Silicon: From FPGA Proof-of-Concept to Integer-Native GPU Architecture

Key Points

Abstract

Cite This Study