In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Harveen Singh Chadha
Aswin Shanmugam Subramanian
Vikas Joshi
Building similarity graph...
Analyzing shared references across papers
Loading...
Chadha et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68e6f342f8145af55aeacad0 — DOI: https://doi.org/10.48550/arxiv.2506.00740