Key points are not available for this paper at this time.
Speaker diarization involves the automated division and recognition of distinct speakers within an audio recording. It aims to categorize the audio stream into coherent segments, with each segment representing either a particular speaker or a speaker turn. Many speaker diarization systems face challenges such as overlapped speech, background noise, and the quality of speaker embeddings. Clustering speaker embedding is a widespread method for speaker diarization, that approach, being unsupervised, lacks direct optimization for minimizing diarization errors and struggles to address issues related to overlapping speech. As a consequence, employing end-to-end neural building blocks can offer a solution, as this approach involves a neural network directly producing speaker diarization outcomes for a multi-speaker recording. The end-to-end method adeptly handles speaker overlaps throughout both training and inferencing. Especially as, overlapped speech involves an increasing number of speakers in audio recordings. This study presents two advancements: firstly, enhancing the M-Diarization Dataset with long audio files that are closely natural live stream version included more overlapped, background noise; and secondly, by introducing an improved end-to-end architecture. Furthermore, the analysis incorporates three sets of test data, including Testset1 with 2 speakers, Testset2 with 4 speakers, and Testset3 with 15 speakers. Testset3 demonstrates superior performance compared to the other test datasets, achieving a Diarization Error Rate (DER) of 4.9% in the pre-trained pipeline and 1.4% in the fine-tuned pipeline.
Aung et al. (Sat,) studied this question.