Pretrained multilingual End-to-End (E2E) Automatic Speech Recognition (ASR) models demonstrate remarkable capabilities but struggle with domain-specific terminologies and underrepresented dialects. Fine-tuning requires expensive paired audio-transcript data, creating barriers to practical adaptation. This paper investigates language model (LM) integration via shallow fusion as an efficient, text-only adaptation method. We introduce Confidence Gated Fusion (CGF), a novel approach that dynamically determines the LM weight during decoding based on ASR model uncertainty, eliminating the expensive validation-set-dependent grid search required by traditional shallow fusion. We validate our approach using OpenAI’s Whisper across multiple model sizes on Arabic ASR, evaluating Modern Standard Arabic (MSA), Egyptian dialect (EGY), and a specialized judiciary domain. Integrating a domain-specific LM achieved up to 40.92% relative WER reduction on dialectal speech and 32.96% on the judiciary domain. Our CGF method achieved comparable performance to tuned baselines while requiring no hyperparameter optimization, with particular advantages in specialized domains (28.92% relative WER reduction on judiciary) and on smaller models where static weights often lack robustness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nader Essam
Wael Ali
Khaled Wassif
Procedia Computer Science
Cairo University
Benha University
Future University in Egypt
Building similarity graph...
Analyzing shared references across papers
Loading...
Essam et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69c0de74fddb9876e79c137e — DOI: https://doi.org/10.1016/j.procs.2026.01.037