A multimodal Transformer-based late-fusion model using pre-trained ECG and EEG signals achieved an arousal F1 score of 0.89 and valence F1 score of 0.85, significantly outperforming single-modality models.
Transformer-based models with multimodal pre-training can improve emotion recognition from physiological signals.
Absolute Event Rate: 0.89% vs 0.87%
p-value: p=<1e-3
In this paper, we address the problem of multimodal emotion recognition from multiple physiological signals. We demonstrate that a Transformer-based approach is suitable for this task. In addition, we present how such models may be pre-trained in a multimodal scenario to improve emotion recognition performances. We evaluate the benefits of using multimodal inputs and pre-training with our approach on a state-of-the-art dataset.
Vazquez-Rodriguez et al. (Tue,) conducted a other in Emotion recognition (n=40). Multimodal Transformer-based late-fusion model with pre-training (ECG + EEG) vs. Single-modality models (ECG only or EEG only) was evaluated on Arousal F1 score (p=<1e-3). A multimodal Transformer-based late-fusion model using pre-trained ECG and EEG signals achieved an arousal F1 score of 0.89 and valence F1 score of 0.85, significantly outperforming single-modality models.