Abstract Overcoming the trade-off between quality and latency is a challenge in simultaneous speech translation. Common approaches in previous works have been to segment the source sentence or align target sentences with the source’s syntax as closely as possible, enabling faster translations while maintaining quality. However, a major limitation in these studies is the reliance on existing translation test data, which often include reordering and are unsuitable for simultaneous settings with low latency. Alternatively, some use interpretation data transcribed from interpreters, which is also problematic due to translation errors and omissions, making both inadequate for fully evaluating simultaneous models. In this work, we introduce a construction, verification, and analysis of a new test set specifically designed for simultaneous settings, with a focus on maintaining word and phrase order consistency with the source. The test set comprises three language pairs representing different levels of word order similarity to the source by leveraging large language models, with quality verified by professional interpreters. This provides an interpreter-grounded perspective, and empirically shows the ideal level of monotonicity and the other sentence style characteristics including syntax simplicity and sentence length. It also reveals the capabilities and limitations of LLMs on monotonic translation. Experiments revealed that existing test data tends to underestimate a model’s performance, while the proposed test set, simul-tst-COMMON, offers a more appropriate evaluation of simultaneous models. Moreover, the quality gap between wait-k and Local Agreement suggests that the adaptive policy more closely resembles the monotonic translation behavior of human interpreters. Finally, the analysis highlights the limitations of current metrics, which may not be fully suitable for evaluating simultaneous tasks.
Makinae et al. (Wed,) studied this question.