November 29, 2025Open Access

AVCLNet: Multimodal Multispeaker Tracking Network Using Audio‐Visual Contrastive Learning

Key Points

Key points are not available for this paper at this time.

Abstract

ABSTRACT Audio‐visual speaker tracking aims to determine the locations of multiple speakers in the scene by leveraging signals captured from multisensor platforms. Multimodal fusion methods can improve both the accuracy and robustness of speaker tracking. However, in complex multispeaker tracking scenarios, critical challenges such as cross‐modal feature discrepancy, weak sound source localisation ambiguity and frequent identity switch errors remain unresolved, which severely hinder the modelling of speaker identity consistency and consequently lead to degraded tracking accuracy and unstable tracking trajectories. To this end, this paper proposes a multimodal multispeaker tracking network using audio‐visual contrastive learning (AVCLNet). By integrating heterogeneous modal representations into a unified space through audio‐visual contrastive learning, which facilitates cross‐modal feature alignment, mitigates cross‐modal feature bias and enhances identity‐consistent representations. In the audio‐visual measurement stage, we design a vision‐guided weak sound source weighted enhancement method, which leverages visual cues to establish cross‐modal mappings and employs a spatiotemporal dynamic weighted mechanism to improve the detectability of weak sound sources. Furthermore, in the data association phase, a dual geometric constraint strategy is introduced by combining the 2D and 3D spatial geometric information, reducing frequent identity switch errors. Experiments on the AV16.3 and CAV3D datasets show that AVCLNet outperforms state‐of‐the‐art methods, demonstrating superior robustness in multispeaker scenarios.

AVCLNet: Multimodal Multispeaker Tracking Network Using Audio‐Visual Contrastive Learning

Key Points

Abstract

Cite This Study

Also Consider

Also Consider