Key points are not available for this paper at this time.
The end-to-end spoken language understanding system extracts the semantic intent directly from an input speech. It effectively avoids problems such as semantic drift in traditional cascade models. However, the lack of semantically labeled speech data makes the model training process diffi-cult. Several recent multi-modal research perspectives have demonstrated that aligning speech and text embeddings based on space distance can improve the model's performance. In this study, inspired by the work related to contrastive learning, a speech and text aligning method using momentum contrast learning is proposed, and a momentum distillation method is also used in the model to learn from imperfectly matched speech and text data. The proposed method has improved intent detection accuracy by 2.14% and 5.98% on Fluent Speech Command and SmartLights datasets.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zheng et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e7397eb6db6435876b29e9 — DOI: https://doi.org/10.1109/icassp48485.2024.10448143
Beida Zheng
Mijit Ablimit
Askar Hamdulla
Xinjiang University
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context: