The VDR-LLM-Prolog FPGA implementation HOWL-VDR-21-2026 validates an architectural principle: Q335 exact integer arithmetic — where every value is a 384-bit numerator over an implicit fixed denominator of 2³35, and division by that denominator is bit extraction requiring zero logic — maps naturally to parallel hardware. The FPGA achieves this at 150 MHz on 10 custom cores in a 200 system-on-chip. This paper asks what happens when that architecture moves to dedicated silicon designed for it. Modern GPU fabrication at 4-5nm provides transistor budgets exceeding 80 billion, clock speeds of 2-2. 5 GHz, and memory bandwidths of 3-5 TB/s via HBM3. Current GPUs dedicate substantial die area to floating-point units, tensor cores with float accumulation, and special function units for transcendentals (sin, cos, exp, rsqrt) — none of which VDR-LLM-Prolog uses. This paper specifies an integer-native processor that reclaims that area for wide integer arithmetic: 384-bit ALUs with 1-2 cycle multiply (versus 9 cycles on FPGA), SHR335 as a routing decision (zero gates, zero power, zero latency beyond wire delay — the same property that makes it zero logic on FPGA, now at thousands of units), and a reduction network that produces exact results at every level. The design targets 5, 120 Q335 cores organized into 80 streaming multiprocessors, projecting approximately 5 trillion Q335 multiplications per second — sufficient that the arithmetic is memory-bandwidth-bound, not compute-bound, on workloads where VDR-18 showed total multiply counts of thousands to millions per prompt. The paper specifies the core microarchitecture, the memory hierarchy, the programming model, the die area estimates, and the performance projections, treating the FPGA's validated ISA principles as the architectural contract and modern GPU fabrication as the implementation technology.
Geoffrey Howland (Fri,) studied this question.