What question did this study set out to answer?

This research aims to enhance automatic speech recognition of multilingual audio, particularly in challenging environments and with limited data.

May 14, 2026

Universal Speech and Acoustic Processing

Key Points

This research aims to enhance automatic speech recognition of multilingual audio, particularly in challenging environments and with limited data.
Collaborative development of self-supervised learning models for various speech processing tasks.
Creation of a neural blind source separation and diarization system trained on multichannel mixture signals.
Analysis of performance in noisy environments and with multiple speakers.
Achieved improved speech recognition performance in multilingual settings.
Successfully demonstrated effective separation and diarization in challenging acoustic conditions.
Developed a system capable of processing multilingual speech for meetings and conferences.

Abstract

Processing multilingual speech by multiple speakers has been a challenge. Especially for under-represented languages, a limited amount of transcribed speech data degrades the automatic speech recognition performance. Multiple speakers in a noisy environment also pose a challenge in separation and diarization for the frontend process of automatic speech recognition. The research team at the National Institute of Advanced Industrial Science and Technology (AIST) has been collaborating with Carnegie Mellon University (CMU) to tackle these problems to achieve universal speech and audio processing. To improve performance with a limited amount of transcribed data, we are developing and analyzing self-supervised learning models for speech to be applied to various speech processing tasks such as automatic speech recognition, voice conversion, and speech emotion recognition. For the separation and diarization of meeting recordings, we have developed a neural blind source separation and diarization that can be trained only on multichannel mixture signals and speaker activities. Through the challenges, we seek to release a speech and acoustic processing system that works in multilingual situations such as meetings and conferences.

Bookmark

Cite This Study

Satoru Fukayama (Wed,) studied this question.

synapsesocial.com/papers/6a056899a550a87e60a20f8c https://doi.org/https://doi.org/10.1121/10.0040511

Bookmark