March 18, 2024Open Access

Soft Alignment of Modality Space for End-to-End Speech Translation

YZYuhao ZhangKunming Medical University KKKaiqi KouNortheastern University BLBei LiBeijing Institute of Fashion Technology

Key Points

Key points are not available for this paper at this time.

Abstract

End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.

Ask AI

Helpful

Bookmark

View Full Paper