While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sergei Katkov
Antonio Liotta
Alessandro Vietti
Free University of Bozen-Bolzano
Building similarity graph...
Analyzing shared references across papers
Loading...
Katkov et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69db36e64fe01fead37c4dce — DOI: https://doi.org/10.1186/s13636-026-00458-1