Pirated videos cause substantial economic losses to video platforms and harm content creators. However, existing video copy detection methods are often visual-only and perform poorly on short-duration copied segments. To address these limitations, we propose a multimodal video copy detection framework that integrates visual and auditory deep features. We further enhance cross-video matching by applying a Transformer-based attention module with self-attention and cross-attention, producing more discriminative similarity maps. For robust localization of short copied segments, we introduce an interleaved subsampling module (ISM) within the localization stage. Experiments on VCDB and VCSL demonstrate the effectiveness of our approach. At the segment level, our method achieves F1-scores of 77.32% on VCDB and 67.17% on VCSL, respectively. On VCSL, the average video-level FRR/FAR score is 7.175%. Overall, the proposed method consistently outperforms prior video copy detection approaches.
Wang et al. (Fri,) studied this question.