What question did this study set out to answer?

The study aims to enhance speech-to-speech translation quality while addressing privacy and resource constraints.

March 14, 2026Open Access

Speech to speech translation system based on cloud-edge collaboration

Puntos clave

The study aims to enhance speech-to-speech translation quality while addressing privacy and resource constraints.
Propose a cloud-edge collaborative framework for S2ST.
Implement early-exit heads in the model for adaptive resource usage.
Introduce a teacher-guided difficulty classifier for training samples.
Employ a retrieval-based voice preservation module for speaker identity protection.
Achieved better translation quality than existing methods with the early-exit strategy.
The retrieval-based method improved communication efficiency and privacy over direct speech transmission.
Demonstrated enhanced results on key metrics while ensuring user privacy.

Resumen

Abstract The goal of expressive speech-to-speech translation (S2ST) is to provide accurate translations while preserving the source speaker’s characteristics. However, despite recent progress, two challenges remain. First, most studies release only one model size; even when multiple sizes are provided, they are often trained separately, which limits flexible deployment across compute budgets. Second, preserving speaker identity typically requires providing speaker-related acoustic information as input to the translation model, thereby raising privacy concerns. To address these issues, we propose a cloud-edge collaborative S2ST framework that balances translation quality, efficiency, and privacy. Specifically, we attach early-exit (EE) heads to the backbone so that inference can adapt to resource constraints. To further improve translation quality, we introduce a teacher-guided difficulty classifier to label training samples by difficulty. We then use the labeled data to train a model that predicts the optimal EE layer. During inference, this model passes the predicted optimal EE layer to the translation model. Finally, to reduce bandwidth overhead and protect user privacy, we propose a retrieval-based voice preservation module. We extract acoustic features and perform similarity matching on the sender side, and reconstruct the speaker’s acoustic features on the receiver side. Experiments show that our EE strategy consistently achieves better translation quality than other methods. In addition, compared with directly transmitting the source speech, our retrieval-based method achieves better results on key metrics while improving communication efficiency and privacy.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo