What question did this study set out to answer?

The aim is to enhance multilingual ASR performance in specialized domains using text-only data for adaptation.

March 23, 2026Open Access

Confidence Gated Fusion: Dynamic Language Model Integration for Adapting Pretrained Multilingual ASR Models with Text-Only Data

NENader EssamCairo University WAWael Ali KWKhaled WassifCairo University

Key Points

The aim is to enhance multilingual ASR performance in specialized domains using text-only data for adaptation.
Investigated language model integration through shallow fusion
Introduced Confidence Gated Fusion (CGF) to dynamically adjust LM weight
Evaluated performance on OpenAI's Whisper with various Arabic dialects and domains
Measured word error rate (WER) reductions with and without domain-specific LM
Achieved up to 40.92% relative WER reduction on dialectal speech
Demonstrated 32.96% relative WER reduction in the judiciary domain
CGF method showed comparable performance to traditional fine-tuning without hyperparameter optimization
Particular advantages observed in specialized domains and smaller ASR models with robust performance

Abstract

Pretrained multilingual End-to-End (E2E) Automatic Speech Recognition (ASR) models demonstrate remarkable capabilities but struggle with domain-specific terminologies and underrepresented dialects. Fine-tuning requires expensive paired audio-transcript data, creating barriers to practical adaptation. This paper investigates language model (LM) integration via shallow fusion as an efficient, text-only adaptation method. We introduce Confidence Gated Fusion (CGF), a novel approach that dynamically determines the LM weight during decoding based on ASR model uncertainty, eliminating the expensive validation-set-dependent grid search required by traditional shallow fusion. We validate our approach using OpenAI’s Whisper across multiple model sizes on Arabic ASR, evaluating Modern Standard Arabic (MSA), Egyptian dialect (EGY), and a specialized judiciary domain. Integrating a domain-specific LM achieved up to 40.92% relative WER reduction on dialectal speech and 32.96% on the judiciary domain. Our CGF method achieved comparable performance to tuned baselines while requiring no hyperparameter optimization, with particular advantages in specialized domains (28.92% relative WER reduction on judiciary) and on smaller models where static weights often lack robustness.

Ask AI

Helpful

Bookmark

View Full Paper