March 18, 2024Open Access

Cross-Modal Alignment for End-to-End Spoken Language Understanding Based on Momentum Contrastive Learning

BZBeida ZhengXinjiang University MAMijit AblimitXinjiang University AHAskar HamdullaXinjiang University

Key Points

Key points are not available for this paper at this time.

Abstract

The end-to-end spoken language understanding system extracts the semantic intent directly from an input speech. It effectively avoids problems such as semantic drift in traditional cascade models. However, the lack of semantically labeled speech data makes the model training process diffi-cult. Several recent multi-modal research perspectives have demonstrated that aligning speech and text embeddings based on space distance can improve the model's performance. In this study, inspired by the work related to contrastive learning, a speech and text aligning method using momentum contrast learning is proposed, and a momentum distillation method is also used in the model to learn from imperfectly matched speech and text data. The proposed method has improved intent detection accuracy by 2.14% and 5.98% on Fluent Speech Command and SmartLights datasets.

Ask AI

Helpful

Bookmark

View Full Paper