We present Adaptive-K routing, a method that dynamically selects the number of experts in Mixture-of-Experts (MoE) models based on routing entropy. Instead of using a fixed top-k experts per token, our approach uses fewer experts when the router is confident (low entropy) and more experts when uncertain (high entropy). Results on production MoE models:- Mixtral 8x7B: 52.5% compute reduction- Qwen-MoE: 32.4% compute reduction - OLMoE-1B-7B: 24.7% compute reduction When combined with quantization and speculative decoding, we achieve up to 96% total compute savings through multiplicative composition. Code: https://github.com/Gabrobals/sbm-efficientPyPI: pip install adaptive-k-routing
Building similarity graph...
Analyzing shared references across papers
Loading...
Gabriele Balsamo
Building similarity graph...
Analyzing shared references across papers
Loading...
Gabriele Balsamo (Sat,) studied this question.
synapsesocial.com/papers/696f1a9f9e64f732b51eee13 — DOI: https://doi.org/10.5281/zenodo.18282008
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: