As recent Multi-Layer Perceptron (MLP) mixer models have achieved state-of-the-art performance in time series forecasting, modeling each MLP-mixer as a separate expert within a mixture is expected to extend the representational capacity of the model, allowing each expert to be activated in response to time-varying inputs. However, extending MLP-mixers into a Mixture-of-Experts (MoE) architecture introduces a significant increase in the number of trainable parameters, rendering the model more challenging to train. To mitigate this problem, we propose a method that composes a fully trainable global expert and multiple non-trainable local experts. Specifically, our approach clones the weights of the global expert into the local experts and then modifies their weight distributions using moment learning, a recently proposed unconventional method for training neural networks. Concretely, each local expert is produced by applying moment-based transformations to a shared copy of the global expert’s weights, so that expert specialization is obtained without independently training the additional experts. Experimental results using a lightweight Time Series Mixer (TSMixer) architecture demonstrate that our method achieves performance competitive with fully trainable MoE counterparts, without introducing a significant increase in trainable parameters. Across multiple benchmark settings, the proposed model attains forecasting accuracy on par with, and in several cases favorable to, a fully trainable multi-expert baseline while adding only a small fraction of the extra trainable parameters that such a baseline requires, and this efficiency is further corroborated by measurements of memory footprint as well as an effect-size-based assessment of the observed differences.
Hong et al. (Mon,) studied this question.