What question did this study set out to answer?

The aim is to enhance the performance of a 30-billion-parameter Mixture-of-Experts model on low-cost Raspberry Pi hardware.

May 25, 2026Open Access

Pushing Four Raspberry Pis to the Memory Wall: Bit-Exact 30B Mixture-of-Experts Inference, +16% over the Public Record

Key Points

The aim is to enhance the performance of a 30-billion-parameter Mixture-of-Experts model on low-cost Raspberry Pi hardware.
Utilized four Raspberry Pi 5 boards for inference without GPU or NPU.
Implemented optimizations in distributed-llama v0.16.5 with numerous source-level changes.
Conducted profiling to identify memory bandwidth as the bottleneck.
Achieved a decode rate of 15.143 tok/s, representing a +15.2% improvement over the baseline.
Results indicated a +16.1% increase compared to the highest documented performance in the same class.
Co-located telemetry agents reduced decode performance by 5.18%, highlighting bandwidth limitations.

Abstract

A 30-billion-parameter Mixture-of-Experts language model (Qwen3-30B-A3B) runs on four Raspberry Pi 5 boards — roughly €500 of CPU-only hardware, with no GPU or NPU — at 15.143 tok/s decode, bit-exact. On identical 16 GB silicon this is +15.2% over the vanilla distributed-llama baseline (13.15 → 15.143 tok/s, n=20), a confound-free figure attributable to the software and configuration work reported here. It is also +16.1% above the highest publicly documented result for this model and hardware class (13.04 tok/s; b4rtaz #255, 8 GB SKU), a cross-SKU comparison we show carries no measurable hardware confound. We adopt distributed-llama v0.16.5 with twelve source-level changes (eight framework patches plus four bit-exact kernel and op-fusion optimisations) and persistent runtime kernel tuning. Every optimisation stage is validated bit-exact (SHA-256 of the generated token-id sequence at seed=42, temperature=0). The trajectory runs from a 5.70 tok/s Llama-3.1-8B dense baseline, through an 11.40 tok/s Qwen3-MoE baseline, to the result above; end-to-end sustained serving throughput (prefill included) is 14.449 tok/s. ARM PMU profiling locates the residual bottleneck in LPDDR4X memory bandwidth (49% backend-stalled cycles, 11.4 GB/s of ~17GB/s per node). We catalogue twenty-six unsuccessful configurations under strict constraints (bit-exact, no kernel rebuild, no model change, no overclock), and find that co-located telemetry agents tax this memory-bound decode by 5.18% while leaving compute-bound prefill unchanged — a clean confirmation that decode is bandwidth-bound. Code, patches, systemd units and reproducibility scripts: https://github.com/hellomatik-org/distributed-llama

Pushing Four Raspberry Pis to the Memory Wall: Bit-Exact 30B Mixture-of-Experts Inference, +16% over the Public Record

Key Points

Abstract

Cite This Study