We propose the Multimodal Spatio-Temporal Hypergraph Convolutional Network (MST-HGCN), a unified framework for automatic infantile spasm detection using synchronized video and EEG recordings. Each 5-second segment is divided into ten 0.5-second windows, within which video and EEG nodes are constructed and fused through synchronous hyperedges. The video skeleton is partitioned into five anatomical limb regions, while sixteen EEG electrodes are grouped into five cortical regions to form aggregated nodes. Temporal hyperedges link adjacent windows. To address class imbalance, the training objective combines Focal Loss with a dynamic-margin triplet loss. The dataset consists of 1,358 five-second segments from synchronized video-EEG recordings of 30 infants, enabling accurate detection of spasms and non-spasms across modalities.Under five-fold cross-validation, the fusion model with the detector enabled achieves 99.19% accuracy, 98.02% precision, 98.82% recall, and 98.39% F1-score. In the independent test, the model attains 89.12% accuracy, 75.27% precision, 89.74% recall, and 81.82% F1-score, substantially reducing both missed detections and false alarms compared with single-modality baselines.
Wang et al. (Mon,) studied this question.