What question did this study set out to answer?

The aim is to evaluate the robustness of various ASR models against environmental noise and synthetic distortions.

April 12, 2026Open Access

Speech recognition in adverse conditions: synthetic transformations and real environmental noise

SKSergei KatkovFree University of Bozen-Bolzano ALAntonio Liotta AVAlessandro ViettiFree University of Bozen-Bolzano

Key Points

The aim is to evaluate the robustness of various ASR models against environmental noise and synthetic distortions.
Compared multiple ASR models including Whisper, QuartzNet, and Conformer under real and synthetic noise conditions.
Used the Common Voice 17.0 dataset in English, Italian, and German for testing.
Calculated word error rate across different signal-to-noise ratios using bootstrap resampling for confidence intervals.
Performed a glimpse proportion analysis to understand recognition failures due to acoustic masking.
Larger ASR models showed greater robustness to environmental noise but remained affected by changes.
Higher stability was observed for Italian and German models, possibly due to better phoneme-to-grapheme mappings.
The study provided insights into failure modes under distortion, helping to improve future ASR system designs.

Abstract

While automatic speech recognition (ASR) systems have achieved impressive performance under clean conditions, their reliability in acoustically challenging environments remains an open concern—particularly in multilingual settings. This work presents a comparative evaluation of several modern ASR models, including Whisper variants, QuartzNet, and Conformer-based architectures, under a range of controlled synthetic transformations and real-world environmental noises. Using the Common Voice 17.0 dataset in English, Italian, and German, we assess recognition robustness under additive white noise, pitch shifts, time-stretching, and ecologically valid background recordings (office, cafe, traffic) from the DEMAND dataset. Word error rate (WER) is computed across a spectrum of signal-to-noise ratios, with confidence intervals derived via bootstrap resampling to estimate variability. Unlike many studies that evaluate complete speech pipelines (enhancement front-ends followed by ASR or task-specific fine-tuning), we deliberately focus on off-the-shelf pretrained models without additional front-end processing or adaptation. This design isolates the intrinsic robustness of the ASR architectures themselves and provides a clean baseline against which future enhancement or fine-tuning strategies can be compared. To support the interpretation of extreme-noise regimes, we additionally incorporate a perceptually motivated glimpse proportion analysis, which quantifies the amount of locally audible speech under different noise types and signal-to-noise ratios. This auxiliary analysis is used to contextualize recognition failures in terms of acoustic masking rather than model performance alone. Finally, we include a limited supervised fine-tuning study on English speech for a subset of models, not as a primary contribution, but to illustrate how standard adaptation shifts robustness trends relative to the inference-only baseline. Our analysis highlights model- and language-specific response patterns to distortion, with larger models generally exhibiting greater robustness, yet still susceptible to temporal and spectral changes. Notably, models showed higher stability on Italian and German, which we hypothesize may be due to more regular phoneme-to-grapheme mappings in these languages. The findings provide actionable insights into failure modes under distortion, informing the design of more robust ASR systems for deployment in diverse auditory scenarios.

Ask AI

Helpful

Bookmark

View Full Paper