Speaker diarization in broadcast media presents significant challenges due to long-duration recordings, numerous speakers, and complex acoustic conditions. End-to-end neural diarization models like DiaPer (Diarization with Perceiver), which directly predict speaker activity from audio features without intermediate clustering steps, have shown promising results. However, their application to extended recordings remains computationally prohibitive due to quadratic complexity with respect to input length. This paper addresses these limitations by proposing a framework that applies DiaPer to short audio chunks and subsequently reconciles speaker identities across segments using a matching algorithm. We systematically analyze optimal chunk durations for DiaPer processing and introduce an enhanced chunk-matching algorithm leveraging state-of-the-art speaker embeddings, comparing Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN), Residual Networks (ResNet), and Reshape Dimensions Network (ReDimNet) architectures. Our experimental evaluation on the challenging Radio Televisión Española (RTVE) datasets shows that ReDimNet embeddings consistently outperform alternatives, achieving substantial improvements in speaker identity consistency across segments. The proposed approach yields a Diarization Error Rate (DER) of 17.34% on the RTVE 2024 test set, which is competitive with state-of-the-art systems while achieving a 63.6% relative improvement over the baseline DiaPer model applied directly to complete audio recordings. This demonstrates that end-to-end neural approaches can be successfully extended to hour-long recordings while maintaining computational efficiency. • Comprehensive DiaPer framework for broadcast media using chunk-based processing. • Optimal chunking: 1-min training with 2-min inference for RTVE diarization. • ReDimNet outperforms ECAPA-TDNN and ResNet in speaker embedding for chunk matching. • 63.6% boost over baseline DiaPer on challenging RTVE 2024 hour-long recordings.
Álvarez-Trejos et al. (Fri,) studied this question.