October 7, 2024Open Access

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Key Points

Key points are not available for this paper at this time.

Abstract

This paper introduces the USPDATRO dataset. This is a speech dataset, in the Romanian language, constructed from open data, focusing on under-represented voice types (children, young and old people, and female voices). The paper covers the methodology behind the dataset construction, specific details regarding the dataset, and evaluation of existing Romanian Automatic Speech Recognition (ASR) systems, with different architectures. Results indicate that more under-represented speech content is needed in the training of ASR systems. Our approach can be extended to other low-resourced languages, as long as open data are available.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Vasile Păiș

Artificial Intelligence Research Institute

Verginica Barbu Mititelu

Academy of Romanian Scientists

Elena Irimia

Artificial Intelligence Research Institute

Journals

Applied Sciences

Actions

Institutions

Romanian Academy

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study