What question did this study set out to answer?

This research aims to improve VDR arithmetic systems through the implementation of a hardware-native Functional Remainder Unit.

May 19, 2026Open Access

VDR-LLM-Prolog: Functional Remainder Hardware: Adaptive Precision Through Structural Information in Silicon

Key Points

This research aims to improve VDR arithmetic systems through the implementation of a hardware-native Functional Remainder Unit.
Development of the FRU to compute exact remainders in the VDR-22 integer-native ASIC without relying on the host processor.
Analysis of the microarchitecture of the FRU and its integration with existing hardware components.
Evaluation of latency and throughput performance at datacenter scale with millions of concurrent sessions.
The FRU enables exact exponential softmax computations at competitive latency compared to floating-point implementations (25-40 ns vs. milliseconds).
Continuous per-step remainder resolution drastically reduces stalls in training that would otherwise last milliseconds.
Maintaining the execution path on-chip eliminates bottlenecks, significantly increasing throughput as Prolog rules autonomously handle work.

Abstract

The VDR arithmetic system represents every value as a triple Value, Denominator, Remainder where the Remainder slot can hold a callable function that produces exact rational values at any requested depth. This functional remainder mechanism — specified in HOWL-VDR-1-2026 and used throughout the system for transcendental evaluation — has implications for hardware that prior papers in the series did not explore. On the VDR-22 integer-native ASIC HOWL-VDR-22-2026, each Q335 Integer Unit already contains a 384-bit ALU with 1-2 cycle multiply and free power-of-two division via fixed wiring. This paper specifies a Functional Remainder Unit (FRU) that extends each QIU to evaluate functional remainders — Taylor recurrences, Newton iterations, and series summations — in hardware using the existing ALU, without round-tripping to the host processor. The FRU adds approximately 500,000 transistors per QIU (3.4% die area increase across 5,120 units) and enables three capabilities that the base VDR-22 chip cannot provide: hardware-native exact exponential softmax at competitive latency with float implementations (25-40 nanoseconds for 1,024 logits versus host-bound milliseconds without the FRU), continuous per-step remainder resolution during training that replaces periodic Q-basis reprojection stalls with microsecond-level maintenance, and complete Prolog unification over active VDR values carrying nonzero remainders. At single-query scale, the FRU does not change wall-clock latency — the language model forward pass at approximately 30 microseconds per command token dominates primitive execution at 1-100 nanoseconds by 300-30,000×. At datacenter scale with millions of concurrent sessions, the FRU eliminates host round-trips for remainder resolution, keeping the entire rule-driven execution path on the data-plane chip and removing the serialization bottleneck that would otherwise limit throughput as accumulated Prolog rules handle an increasing fraction of work autonomously. The paper specifies the FRU microarchitecture, traces the full inference chain with adaptive precision, analyzes the datacenter throughput implications, and identifies the boundary between what the FRU changes (capability and throughput at scale) and what it does not (single-query latency).

VDR-LLM-Prolog: Functional Remainder Hardware: Adaptive Precision Through Structural Information in Silicon

Key Points

Abstract

Cite This Study

Also Consider

Also Consider