A 30-billion-parameter Mixture-of-Experts language model (Qwen3-30B-A3B) runs on four Raspberry Pi 5 boards — roughly €500 of CPU-only hardware, with no GPU or NPU — at 15.143 tok/s decode, bit-exact. On identical 16 GB silicon this is +15.2% over the vanilla distributed-llama baseline (13.15 → 15.143 tok/s, n=20), a confound-free figure attributable to the software and configuration work reported here. It is also +16.1% above the highest publicly documented result for this model and hardware class (13.04 tok/s; b4rtaz #255, 8 GB SKU), a cross-SKU comparison we show carries no measurable hardware confound. We adopt distributed-llama v0.16.5 with twelve source-level changes (eight framework patches plus four bit-exact kernel and op-fusion optimisations) and persistent runtime kernel tuning. Every optimisation stage is validated bit-exact (SHA-256 of the generated token-id sequence at seed=42, temperature=0). The trajectory runs from a 5.70 tok/s Llama-3.1-8B dense baseline, through an 11.40 tok/s Qwen3-MoE baseline, to the result above; end-to-end sustained serving throughput (prefill included) is 14.449 tok/s. ARM PMU profiling locates the residual bottleneck in LPDDR4X memory bandwidth (49% backend-stalled cycles, 11.4 GB/s of ~17GB/s per node). We catalogue twenty-six unsuccessful configurations under strict constraints (bit-exact, no kernel rebuild, no model change, no overclock), and find that co-located telemetry agents tax this memory-bound decode by 5.18% while leaving compute-bound prefill unchanged — a clean confirmation that decode is bandwidth-bound. Code, patches, systemd units and reproducibility scripts: https://github.com/hellomatik-org/distributed-llama
Daniel Correa Villa (Sat,) studied this question.