Abstract The goal of expressive speech-to-speech translation (S2ST) is to provide accurate translations while preserving the source speaker’s characteristics. However, despite recent progress, two challenges remain. First, most studies release only one model size; even when multiple sizes are provided, they are often trained separately, which limits flexible deployment across compute budgets. Second, preserving speaker identity typically requires providing speaker-related acoustic information as input to the translation model, thereby raising privacy concerns. To address these issues, we propose a cloud-edge collaborative S2ST framework that balances translation quality, efficiency, and privacy. Specifically, we attach early-exit (EE) heads to the backbone so that inference can adapt to resource constraints. To further improve translation quality, we introduce a teacher-guided difficulty classifier to label training samples by difficulty. We then use the labeled data to train a model that predicts the optimal EE layer. During inference, this model passes the predicted optimal EE layer to the translation model. Finally, to reduce bandwidth overhead and protect user privacy, we propose a retrieval-based voice preservation module. We extract acoustic features and perform similarity matching on the sender side, and reconstruct the speaker’s acoustic features on the receiver side. Experiments show that our EE strategy consistently achieves better translation quality than other methods. In addition, compared with directly transmitting the source speech, our retrieval-based method achieves better results on key metrics while improving communication efficiency and privacy.
Zhu et al. (Thu,) studied this question.