Processing multilingual speech by multiple speakers has been a challenge. Especially for under-represented languages, a limited amount of transcribed speech data degrades the automatic speech recognition performance. Multiple speakers in a noisy environment also pose a challenge in separation and diarization for the frontend process of automatic speech recognition. The research team at the National Institute of Advanced Industrial Science and Technology (AIST) has been collaborating with Carnegie Mellon University (CMU) to tackle these problems to achieve universal speech and audio processing. To improve performance with a limited amount of transcribed data, we are developing and analyzing self-supervised learning models for speech to be applied to various speech processing tasks such as automatic speech recognition, voice conversion, and speech emotion recognition. For the separation and diarization of meeting recordings, we have developed a neural blind source separation and diarization that can be trained only on multichannel mixture signals and speaker activities. Through the challenges, we seek to release a speech and acoustic processing system that works in multilingual situations such as meetings and conferences.
Satoru Fukayama (Wed,) studied this question.