Direct speech-to-speech translation (S2ST) systems have emerged as a promising approach for real-time cross-lingual communication. However, these systems face significant challenges in balancing translation quality with decoding efficiency. In this paper, we present DESpeech, a novel direct S2ST model that effectively addresses this challenge through a dual-pass encoder architecture. Our architecture decomposes translation into two specialized stages: acoustic feature extraction via a speech encoder and semantic understanding via a text encoder. This modular design enables optimal resource allocation while maintaining cross-modal information flow. To enhance performance, DESpeech employs discrete units as intermediate representations and adopts a multi-task learning framework that integrates automatic speech recognition and speech-to-text translation as auxiliary tasks. The dual-pass architecture allows for efficient pre-training integration and provides a natural framework for balancing computational efficiency with translation accuracy. Experiments on the CVSS-C and GigaS2S datasets show that DESpeech consistently outperforms or matches existing methods in terms of translation quality while achieving clear improvements in inference speed, indicating a promising approach for efficient S2ST with minimal quality degradation.
Li et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: