What question did this study set out to answer?

The aim is to develop a more effective method for detecting copied video segments using both visual and auditory features.

March 1, 2026Open Access

Transformer -based audio-visual features for video copy detection

Key Points

The aim is to develop a more effective method for detecting copied video segments using both visual and auditory features.
Proposed a multimodal video copy detection framework incorporating visual and auditory deep features.
Implemented a Transformer-based attention module to enhance cross-video feature matching.
Introduced an interleaved subsampling module for robust localization of short copied segments.
Conducted experiments using datasets VCDB and VCSL to validate method effectiveness.
Achieved an F1-score of 77.32% on VCDB and 67.17% on VCSL at the segment level.
Obtained an average video-level FRR/FAR score of 7.175% on VCSL.
Demonstrated consistent superiority over previous video copy detection methods.

Abstract

Pirated videos cause substantial economic losses to video platforms and harm content creators. However, existing video copy detection methods are often visual-only and perform poorly on short-duration copied segments. To address these limitations, we propose a multimodal video copy detection framework that integrates visual and auditory deep features. We further enhance cross-video matching by applying a Transformer-based attention module with self-attention and cross-attention, producing more discriminative similarity maps. For robust localization of short copied segments, we introduce an interleaved subsampling module (ISM) within the localization stage. Experiments on VCDB and VCSL demonstrate the effectiveness of our approach. At the segment level, our method achieves F1-scores of 77.32% on VCDB and 67.17% on VCSL, respectively. On VCSL, the average video-level FRR/FAR score is 7.175%. Overall, the proposed method consistently outperforms prior video copy detection approaches.

Bookmark

View Full Paper

Cite This Study

Wang et al. (Fri,) studied this question.

synapsesocial.com/papers/69a3d79dec16d51705d2de5c https://doi.org/https://doi.org/10.1007/s44443-026-00578-w

Bookmark

View Full Paper