The effectiveness of Speech-to-Text (STT) models depends heavily on dataset-level audio and speech characteristics, yet the quantitative influence of these factors remains insufficiently explored, particularly for low-resource lauguages, such as Vietnamese. This study examines how specific audio and speech characteristics, including Speech Rate, Naturalness, Signal-To-Noise Ratio, Audio Coloration and Environmental Reverberation, affect STT performance for Vietnamese. Amongst them, naturalness is notably picked as a new evaluative characteristic with a dedicated metric for dataset selection. Experiments in a real-world setting with a social robots how that tailoring datasets based on these characteristics can respectively improve the accuracy of the trained models by approximately 2.66%, 4.72%, 8.36%, 5.89%, 5.00% compared to training on untailored ones. Additionally, models trained on curated datasets can outperform conventional pre-trained models by up to approximately 8.7% accuracy-wise, highlighting the effectiveness of our approach. The methodology is most useful in practical deployments - such as social robots, voice assistants, and contact-center systems - where field audio is noisier, reverberant, and produced by diverse, non-uniform speakers; its benefit diminishes once sufficiently large, representative training datasets exist.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gia et al. (Mon,) studied this question.
synapsesocial.com/papers/69df2c01e4eeef8a2a6b0f07 — DOI: https://doi.org/10.1145/3797912
Kiet Pham Gia
Saigon International University
Tin Huynh
University Of Information Technology
Kiem Hoang
University Of Information Technology
ACM Transactions on Asian and Low-Resource Language Information Processing
University Of Information Technology
Wyższa Szkoła Technologii Informatycznych w Warszawie
Saigon International University
Building similarity graph...
Analyzing shared references across papers
Loading...