We present a novel speech data collection system tailored for the Korean language, combining a mobile recording app with a web-based processing backend. The platform enables large-scale crowd-sourcing of speech samples and is designed to address the underrepresentation of Korean in existing AI corpora. Unlike conventional datasets, our system actively targets atypical and diverse speakers—children, elderly adults, and individuals from clinical populations—whose voices are often absent in publicly available resources. The collected corpus covers the full range of Korean phonemes and reflects dialectal and stylistic variation, including spontaneous and informal speech, which is crucial given Korean’s complex phonology and sociolinguistic diversity. Critically, all collected data undergo structured post-processing: recordings are segmented, denoised, transcribed, and annotated through the web interface to support both detailed acoustic analysis and the training of robust ASR models. This pipeline ensures data quality while enabling efficient monitoring, correction, and labeling. Beyond improving ASR accuracy for underrepresented speaker groups, the corpus also provides valuable resources for clinical speech assessment and diagnosis, particularly for developmental and age-related speech conditions. Our work underscores the importance of inclusive, language-specific data collection and processing frameworks for advancing AI speech technologies in non-English and culturally unique linguistic contexts.
Yoon et al. (Wed,) studied this question.