Non-autoregressive speech translation (NAR-ST) offers low latency through parallel decoding but often falls short of autoregressive (AR) systems in producing fluent and well-structured translations. The authors present UTRo-NAST, a novel NAR-ST framework that decomposes speech translation into three consecutive subtasks: source speech understanding, word-by-word translation and target-side reordering. This divide-and-conquer design enhances interpretability and training stability while preserving fully parallel inference. To further improve performance, the authors introduce a plug-and-play large language models (LLM)-augmented post-correction strategy that refines UTRo-NAST outputs with prompting mechanisms. Experiments on the MuST-C benchmark across eight language pairs show that UTRo-NAST consistently outperforms existing NAR-ST models and delivers translation quality comparable to strong AR baselines, while maintaining faster decoding. The LLM-augmented variant achieves competitive or superior results compared to recent LLM-integrated ST systems, offering a practical path to stronger NAR-ST without iterative refinement or costly LLM fine-tuning. Overall, these results demonstrate the effectiveness and scalability of UTRo-NAST for practical speech translation.
Kuan et al. (Mon,) studied this question.