Key points are not available for this paper at this time.
End-to-end Speech Translation (ST) aims to convert speech into target text within a unified model. The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer. Existing methods typically employ hard alignment (H-Align) of individual speech and text segments, which can degrade textual representations. To address this, we introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities. S-Align creates a modality-invariant space while preserving individual modality quality. Experiments on three languages from the MuST-C dataset show S-Align outperforms H-Align across multiple tasks and offers translation capabilities on par with specialized translation models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e7397eb6db6435876b2a12 — DOI: https://doi.org/10.1109/icassp48485.2024.10447494
Yuhao Zhang
Kaiqi Kou
Bei Li
Northeastern University
Harbin Engineering University
Building similarity graph...
Analyzing shared references across papers
Loading...