What type of study is this?

This is a Quantitative Study study.

October 23, 2025Open Access

Lightweight End-to-End Diacritical Arabic Speech Recognition Using CTC-Transformer with Relative Positional Encoding

HAHaifa AlaqelImam Mohammad ibn Saud Islamic University KHKhalil El HindiSaudi Electronic University

Key Points

The proposed model achieves a WER of 22.01% on the SASSC dataset, improving over traditional systems.
Fine-tuning on diacritical Arabic datasets after initial training on Modern Standard Arabic optimizes performance.
Training involves innovative encoding techniques that enhance Arabic speech recognition while being resource-efficient.
These developments make this model a viable solution for environments with limited computational resources.

Abstract

Arabic automatic speech recognition (ASR) faces distinct challenges due to its complex morphology, dialectal variations, and the presence of diacritical marks that strongly influence pronunciation and meaning. This study introduces a lightweight approach for diacritical Arabic ASR that employs a Transformer encoder architecture enhanced with Relative Positional Encoding (RPE) and Connectionist Temporal Classification (CTC) loss, eliminating the need for a conventional decoder. A two-stage training process was applied: initial pretraining on Modern Standard Arabic (MSA), followed by progressive three-phase fine-tuning on diacritical Arabic datasets. The proposed model achieves a WER of 22.01% on the SASSC dataset, improving over traditional systems (best 28.4% WER) while using only ≈14 M parameters. In comparison, XLSR-Large (300 M parameters) achieves a WER of 12.17% but requires over 20× more parameters and substantially higher training and inference costs. Although XLSR attains lower error rates, the proposed model is far more practical for resource-constrained environments, offering reduced complexity, faster training, and lower memory usage while maintaining competitive accuracy. These results show that encoder-only Transformers with RPE, combined with CTC training and systematic architectural optimization, can effectively model Arabic phonetic structure while maintaining computational efficiency. This work establishes a new benchmark for resource-efficient diacritical Arabic ASR, making the technology more accessible for real-world deployment.

Ask AI

Helpful

Bookmark

View Full Paper