May 17, 2026Open Access

KRAQX K-Core and K-OS: A Co-Designed Transformer Inference Chip and Operating System Purpose-Built for Local LLM Inference

Key Points

Key points are not available for this paper at this time.

Abstract

Version 9.1 (May 14, 2026) revises the v9 manuscript by replacing extrapolated reference-platform benchmark anchors with direct H100 SXM5 (50.5 t/s) and NVIDIA B200 SXM (71.4 t/s) single-stream measurements on documented open-stack configurations (vLLM 0.7.3 and 0.20.1, compressed-tensors FP8). The K-Core architecture and the 70B-FP8 single-chip projection (265 t/s) are unchanged from v9. A new Section 5.4 introduces a locality-aware sensitivity case (284 t/s) as an exploratory upper-bound estimate alongside the conservative canonical baseline. The full revision history is in the manuscript's Revision Note. The companion simulator codebase is publicly available at https://github.com/Kraqx/kraqx-sim-public. Large language model (LLM) inference workloads are fundamentally memory-bandwidth-bound, yet existing hardware (GPUs, CPUs, and even Apple Silicon) was designed for general-purpose computation and adapted to inference as a secondary use case. We present the KRAQX platform, comprising the K-CORE inference chip and K-OS, a co-designed operating system. K-CORE combines three innovations: Eight HBM4 memory stacks bonded directly to the logic die via TSMC SoIC-X hybrid bonding at 6 µm pitch, delivering 16 TB/s aggregate bandwidth at 512 GB capacity, a combination unavailable in any current product. 128 dedicated Transformer Layer Engines (TLEs) implementing attention and feed-forward operations in hardwired silicon on TSMC A16. K-OS, a co-designed operating system of 5,300 lines that eliminates all software abstraction layers the hardware renders unnecessary. All throughput comparisons are scoped to single-stream (batch=1) dense FP8 decode on Llama3.1-70B unless otherwise noted. Cycle-accurate simulation calibrated against published H100, B200, and M3 Ultra benchmarks projects 265 tokens/second on Llama3.1-70B-FP8 at 30 watts, compared to 60 tokens/second at 700 watts for the NVIDIA H100 SXM5 and approximately 144 tokens/second at 1,000 watts for the NVIDIA B200 SXM in the same single-stream regime. Against the H100 this represents a 4.4× throughput improvement at 23× lower power; against the B200, K-CORE delivers 1.84× more throughput at 33× lower power. The 4-chip cluster configuration projects 824 tokens/second at 140 watts. The eight-stage TLE pipeline has been validated bit-exactly end-to-end with Verilator 5.020 against an independent Python reference model: all 84 cycles close exactly, maximum absolute error of zero Q4.12 LSBs on every observable output of every stage, and the stage-4 causal-mask invariant holds with zero violations. All simulation parameters are grounded in published fabrication specifications, and 8/8 physics constraint checks pass. The complete simulation codebase and the Verilator validation flow are released as open-source software at github.com/Kraqx/kraqx-sim. This work is the subject of U.S. Provisional Patent Applications No. 64/053,100 (filed April 28, 2026) and No. 64/058,503 (filed May 6, 2026).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Robert Fields

Actions

Institutions

Quality Systems (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

KRAQX K-Core and K-OS: A Co-Designed Transformer Inference Chip and Operating System Purpose-Built for Local LLM Inference

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study