What question did this study set out to answer?

This research aims to demonstrate the effectiveness of the Product of Experts framework as an alternative to traditional backpropagation for local learning.

April 22, 2026Open Access

Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters

Read Full Paperexternally

Key Points

This research aims to demonstrate the effectiveness of the Product of Experts framework as an alternative to traditional backpropagation for local learning.
Utilized a modular architecture with 1.3B parameters and clustered PoE design (4 stages × 6 layers).
Evaluated performance against a backpropagation baseline using cross-entropy losses and various inference capabilities.
Conducted extensive training to analyze performance gaps and architectural trade-offs.
Achieved a 6.52% gap in performance compared to the backpropagation baseline, indicating a significant trade-off for local learning.
Showed superior performance in commonsense reasoning and algorithmic pattern recognition tasks, exceeding backpropagation results by up to 11.4pp.
Identified potential for architectural improvements, such as stage prefix pruning and adaptive depth, enhancing efficiency without sacrificing accuracy.

Abstract

We present Product of Experts (PoE) as a scalable local learning framework that replaces end-to-end backpropagation with per-stage detached cross-entropy losses projected through a shared output head. At 1.3B parameters on the ClimbMix pretraining corpus, clustered PoE (4 stages × 6 layers) produces a bounded architectural trade-off: a 6.52% BPB gap versus a matched backpropagation baseline (PoE: 0.720935, BP: 0.676788), in exchange for a family of inference-time capabilities that a standard BP-trained model cannot access without retraining or accuracy loss. The gap widens convexly through training (+4.32% at step 1k → +6.52% at step 26,430 final), with 31% of the widening concentrated in the final 6K warmdown steps. Combined with the non-compressing r=10 → r=20 budget response (6.0% → 6.52%), the evidence supports a structural-floor interpretation (H-S): the gap reflects a bounded architectural cost of local learning rather than a training-budget artifact. Architectural consequences released with this paper include: stage prefix pruning (4× compute reduction at 87.5% factual accuracy), WAND adaptive depth (1.82× wall-clock at 100% top-1 agreement), speculative decoding with zero added parameters (1.87× speedup at 88% acceptance), parallel stage composition (+2.4 logit margin via log-space expert combination), and post-hoc specialist stages via dual-head construction that preserves the base bit-identically (Δlogit = 0.0000 across 12 checkpoints). CORE benchmark results are task-polarized: PoE underperforms on rare-fact retrieval (Jeopardy −16.2pp, SQuAD −18.4pp, LAMBADA −15.0pp) but exceeds BP on commonsense reasoning (PIQA +5.0pp, CommonsenseQA +5.8pp) and algorithmic pattern recognition (BigBench CS Algorithms +11.4pp). Deployment positioning: datacenter quality-critical inference favors BP; on-device inference favors PoE's architectural elasticity.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jaepil Jeong

Cognizant (United States)

Actions

Institutions

Cognizant (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Product of Experts as Scalable Local Learning: Modular Construction at 1.3B Parameters

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider