While automatic speech recognition (ASR) models are advancing rapidly, they still involve various systematic biases. Understanding these biases can produce fairer and inclusive ASR pipelines. The main objective of this exploratory paper is to investigate linguistic-related bias in an ASR system, Whisper large-v3, when processing non-native Arabic speech, as compared to human perception of three constructs: intelligibility, comprehensibility, and foreign-accentedness. We compared word error rate (WER) across ten human listeners and the ASR system using linear mixed effects model analysis, and conducted phoneme error rate (PER) analysis to identify potential sources of linguistic bias. The analysis revealed that the ASR system (WER=66%) performed almost as human raters (average WER=67%). There was a significant relationship between WER and intelligibility, indicating that higher intelligibility ratings were associated with lower WER. In addition, higher accentedness ratings are associated with higher WER while comprehensibility did not predict WER despite the existence of a marginal positive association. These findings are further supported by the system’s bias toward unmarked phonemes, such as emphatic and guttural sounds, highlighting persistent recognition challenges with acoustically complex segments. These findings matter for explainable and fair ASR systems and contribute to the ASR interpretability and explainability research.
Issa et al. (Thu,) studied this question.