What question did this study set out to answer?

The aim is to enhance the forecasting ability of time series models by introducing a Mixture-of-Experts architecture that minimizes the number of trainable parameters.

June 17, 2026Open Access

Mixture of TSMixer Experts for Time Series Forecasting

Key Points

The aim is to enhance the forecasting ability of time series models by introducing a Mixture-of-Experts architecture that minimizes the number of trainable parameters.
Proposed a method combining a fully trainable global expert with multiple non-trainable local experts.
Cloned weights of the global expert into local experts and modified weight distributions using moment learning.
Evaluated the performance of the model using a lightweight Time Series Mixer architecture across multiple benchmark settings.
Achieved forecasting accuracy comparable to fully trainable MoE models, with competitive performance metrics.
Introduced only a small fraction of extra trainable parameters compared to a fully trainable multi-expert baseline.
Demonstrated improved efficiency with measurements of memory footprint and effect-size assessments.

Abstract

As recent Multi-Layer Perceptron (MLP) mixer models have achieved state-of-the-art performance in time series forecasting, modeling each MLP-mixer as a separate expert within a mixture is expected to extend the representational capacity of the model, allowing each expert to be activated in response to time-varying inputs. However, extending MLP-mixers into a Mixture-of-Experts (MoE) architecture introduces a significant increase in the number of trainable parameters, rendering the model more challenging to train. To mitigate this problem, we propose a method that composes a fully trainable global expert and multiple non-trainable local experts. Specifically, our approach clones the weights of the global expert into the local experts and then modifies their weight distributions using moment learning, a recently proposed unconventional method for training neural networks. Concretely, each local expert is produced by applying moment-based transformations to a shared copy of the global expert’s weights, so that expert specialization is obtained without independently training the additional experts. Experimental results using a lightweight Time Series Mixer (TSMixer) architecture demonstrate that our method achieves performance competitive with fully trainable MoE counterparts, without introducing a significant increase in trainable parameters. Across multiple benchmark settings, the proposed model attains forecasting accuracy on par with, and in several cases favorable to, a fully trainable multi-expert baseline while adding only a small fraction of the extra trainable parameters that such a baseline requires, and this efficiency is further corroborated by measurements of memory footprint as well as an effect-size-based assessment of the observed differences.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper