In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gabriel Pîrlogeanu
Universitatea Națională de Știință și Tehnologie Politehnica București
Alexandru-Lucian Georgescu
Universitatea Națională de Știință și Tehnologie Politehnica București
Horia Cucu
Universitatea Națională de Știință și Tehnologie Politehnica București
Building similarity graph...
Analyzing shared references across papers
Loading...
Pîrlogeanu et al. (Wed,) studied this question.
synapsesocial.com/papers/690fdcdaf60c54d04ea3815e — DOI: https://doi.org/10.48550/arxiv.2511.03361