March 3, 2026Open Access

On the use of DiaPer models and matching algorithm for RTVE speaker diarization 2024 dataset

Puntos clave

The proposed DiaPer framework achieves a competitive diarization error rate of 17.34% on the RTVE dataset.
A 63.6% relative improvement over the baseline DiaPer model demonstrates enhanced computational efficiency.
By employing an optimal chunk-based processing method, the framework addresses challenges in speaker diarization for long recordings.
ReDimNet embeddings have shown to outperform other architectures in ensuring speaker identity consistency across audio segments.

Resumen

Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging Radio Televisión Española (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency. • Comprehensive DiaPer framework for broadcast media using chunk-based processing. • Optimal chunking: 1-min training with 2-min inference for RTVE diarization. • ReDimNet outperforms ECAPA-TDNN and ResNet in speaker embedding for chunk matching. • 63.6% boost over baseline DiaPer on challenging RTVE 2024 hour-long recordings.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Álvarez-Trejos et al. (Fri,) studied this question.

synapsesocial.com/papers/69a76875badf0bb9e87e4b4a https://doi.org/https://doi.org/10.1016/j.csl.2026.101948

Me gusta

Guardar

Ver artículo completo