Motor imagery (MI) EEG classification is a key component of noninvasive brain–computer interfaces (BCIs) and often must satisfy strict latency constraints in online or edge deployments. Although ensembling can reliably improve MI decoding accuracy, its inference cost grows linearly with the number of ensemble members, making it impractical for low-latency applications. To address these issues, we propose an entropy-based dual-teacher distillation framework that transfers ensemble teacher knowledge to a single deployable backbone. From an information theoretic perspective, two failure modes are common in small and noisy MI datasets: elevated predictive entropy (noisy decisions) and large fluctuation across late training epochs (unstable convergence and unreliable checkpoint selection). Thus, we introduce an exponential moving average (EMA) teacher with entropy-gated activation as a low-pass filter in parameter space to reduce the student’s prediction noise. In addition, a two-stage cosine annealing schedule is employed to suppress late-stage oscillations and improve the robustness of final checkpoint selection. Experiments on two public MI benchmarks (BCI Competition IV-2a and IV-2b) with three representative backbones (EEGNet, ShallowConvNet, and ATCNet) under the subject dependent protocol show consistent accuracy gains over the ensemble teacher and strong distillation baselines. On IV-2a, our method achieves an average accuracy of 0.7713 across the backbones, surpassing both the original models (0.7222) and the corresponding ensembles (0.7482); on IV-2b, it achieves 0.8583 versus 0.8432 (original) and 0.8529 (ensemble).
Xu et al. (Tue,) studied this question.