Most of the world’s languages remain low-resource for automatic speech recognition (ASR). The bottleneck is not only the scarcity of labeled speech. It also includes strong variation in pronunciation, prosody, dialect, and domain, as well as the lack of linguistic tools and infrastructure. This survey reviews low-resource ASR through a unified transfer-learning perspective. We organize the literature into four connected strands: pretrain–adapt pipelines, parameter-efficient and domain-aware fine-tuning, task-based transfer through meta-learning and multi-task learning, and corpus expansion through augmentation and multimodal supervision. To compare methods that are usually reported on different tasks and metrics, we further summarize a normalized cross-study synthesis and relate it to a unified operational risk analysis based on effective sample size meff, structural compatibility Ceff, and domain shift Γ. Beyond generic scaling trends, we pay particular attention to language-specific structure that is often under-modeled in low-resource ASR, especially tonal contrasts, tone sandhi, and rich or agglutinative morphology. We show how tone-aware constraints, F0-conditioned representations, morphology-aware output spaces, and auxiliary linguistic losses can complement self-supervised learning, parameter-efficient fine-tuning, and large multilingual speech models rather than replace them. The survey concludes with a synthesis roadmap that links large-scale speech foundation models, task-based transfer, language-specific inductive bias, and deployable adaptation, and with a set of concrete research questions for the next stage of low-resource ASR.
Qin et al. (Mon,) studied this question.