We present Adaptive-K routing, a method that dynamically selects the number of experts in Mixture-of-Experts (MoE) models based on routing entropy. Instead of using a fixed top-k experts per token, our approach uses fewer experts when the router is confident (low entropy) and more experts when uncertain (high entropy). Results on production MoE models:- Mixtral 8x7B: 52.5% compute reduction- Qwen-MoE: 32.4% compute reduction - OLMoE-1B-7B: 24.7% compute reduction When combined with quantization and speculative decoding, we achieve up to 96% total compute savings through multiplicative composition. Code: https://github.com/Gabrobals/sbm-efficientPyPI: pip install adaptive-k-routing
Gabriele Balsamo (Sat,) studied this question.