What question did this study set out to answer?

This research aims to improve the efficiency and interpretability of Arabic automatic speech recognition systems using Kolmogorov-Arnold Networks.

June 3, 2026Open Access

KANWhisper: leveraging learnable activation functions for interpretable and efficient arabic automatic speech recognition

Key Points

This research aims to improve the efficiency and interpretability of Arabic automatic speech recognition systems using Kolmogorov-Arnold Networks.
Introduced KANWhisper by replacing MLP layers in Whisper with KAN layers featuring learnable B-spline activation functions.
Conducted extensive experiments on the Common Voice Arabic dataset to evaluate performance metrics like WER and CER.
Performed phoneme-level evaluations to assess error rates for specific Arabic phonological features.
KANWhisper achieved a word error rate (WER) of 8.02% and character error rate (CER) of 2.78%, outperforming standard Whisper fine-tuning and other models.
Demonstrated a 33.3% relative reduction in error rates for Arabic confusable emphatic consonant pairs.
Layer-wise probing indicated KAN representations had up to 8 percentage points higher accuracy than MLP baselines for emphatic distinctions.

Abstract

Automatic speech recognition (ASR) for Arabic poses persistent challenges due to morphological complexity, dialectal diversity, and limited annotated resources. While transformer-based models such as OpenAI’s Whisper have achieved strong baselines through transfer learning, their feed-forward sub-layers universally employ Multi-Layer Perceptrons (MLPs) with fixed activation functions, constraining both expressiveness and interpretability. This paper introduces KANWhisper, the first application of Kolmogorov-Arnold Networks (KANs) to automatic speech recognition. By replacing the MLP feed-forward layers in Whisper’s encoder and decoder with KAN layers featuring learnable B-spline activation functions, KANWhisper simultaneously enhances recognition accuracy and provides intrinsic model interpretability. Extensive experiments on the Common Voice Arabic dataset demonstrate that KANWhisper achieves a word error rate (WER) of 8.02% and character error rate (CER) of 2.78%, outperforming standard Whisper fine-tuning (8.61% WER), LoRA-adapted Whisper (8.10% WER), wav2vec2 XLSR-53 (11.50% WER), and SeamlessM4T v2-Large (13.20% WER), while using 16M fewer parameters (228M vs. 244M). Analysis of the learned activation functions reveals hierarchical specialization: lower encoder layers retain GELU-like activations for generic acoustic processing, while higher layers develop novel transformations that capture Arabic-specific phonological phenomena including emphatic consonant distinctions. Phoneme-level evaluation demonstrates a 33.3% relative reduction in error rates for Arabic confusable emphatic consonant pairs. Layer-wise representation probing confirms that KAN-enhanced representations encode emphatic distinctions with up to 8 percentage points higher accuracy than MLP baselines. These findings establish Kolmogorov-Arnold Networks as a viable and advantageous paradigm for speech recognition in morphologically complex languages, opening new avenues for interpretable, parameter-efficient, and accurate Arabic ASR.

Bookmark

View Full Paper

Cite This Study

Saeed et al. (Mon,) studied this question.

synapsesocial.com/papers/6a1fc730dee9eb8c0dce8053 https://doi.org/https://doi.org/10.1038/s41598-026-55863-5

Bookmark

View Full Paper