Multimodal emotion recognition with high-level feature fusion of audio and text via cross-attention | Synapse