What question did this study set out to answer?

The aim is to improve non-autoregressive speech translation (NAR-ST) through a new framework that enhances translation fluency and stability.

May 21, 2026Open Access

Non-autoregressive speech translation with understanding, translation, reordering and LLM-augmented correction

Key Points

The aim is to improve non-autoregressive speech translation (NAR-ST) through a new framework that enhances translation fluency and stability.
Developed UTRo-NAST framework decomposing speech translation into source speech understanding, word-by-word translation, and target-side reordering.
Introduced LLM-augmented post-correction strategy to refine outputs using prompting mechanisms.
Tested on MuST-C benchmark across eight language pairs to evaluate performance.
UTRo-NAST outperforms existing NAR-ST models with comparable translation quality to strong autoregressive baselines.
LLM-augmented variant shows competitive or superior results against recent LLM-integrated speech translation systems.
Maintained faster decoding speeds while enhancing interpretability and training stability.

Abstract

Non-autoregressive speech translation (NAR-ST) offers low latency through parallel decoding but often falls short of autoregressive (AR) systems in producing fluent and well-structured translations. The authors present UTRo-NAST, a novel NAR-ST framework that decomposes speech translation into three consecutive subtasks: source speech understanding, word-by-word translation and target-side reordering. This divide-and-conquer design enhances interpretability and training stability while preserving fully parallel inference. To further improve performance, the authors introduce a plug-and-play large language models (LLM)-augmented post-correction strategy that refines UTRo-NAST outputs with prompting mechanisms. Experiments on the MuST-C benchmark across eight language pairs show that UTRo-NAST consistently outperforms existing NAR-ST models and delivers translation quality comparable to strong AR baselines, while maintaining faster decoding. The LLM-augmented variant achieves competitive or superior results compared to recent LLM-integrated ST systems, offering a practical path to stronger NAR-ST without iterative refinement or costly LLM fine-tuning. Overall, these results demonstrate the effectiveness and scalability of UTRo-NAST for practical speech translation.

Non-autoregressive speech translation with understanding, translation, reordering and LLM-augmented correction

Key Points

Abstract

Cite This Study