Efficiently and accurately extracting key information from videos has become a core challenge in the current computer vision. Traditional methods rely on manual features and shallow models, making it difficult to capture complex spatiotemporal dynamics, while existing deep learning schemes still have shortcomings in detection accuracy and precision. To optimize the accuracy and robustness of key information detection in videos, an improved 3D Dense Net is proposed. This method introduces the P3D module to decompose spatiotemporal convolution to reduce computational complexity, integrates the channel-spatial and the time dual-attention mechanism to enhance feature expression, and combines self-distillation structure and cross-modal attention mechanism to effectively integrate visual and auditory information. The accuracy of the proposed method reached 95.4% and 93.1% on the UCF-Crime and Surveillance Fight datasets, which was significantly higher than that of traditional models. The proposed method had the lowest error of only 0.18 and 0.19, and the highest AUC values reached 91.8% and 96.0%. Moreover, after introducing P3D and attention mechanism, the accuracy of the proposed method was improved by 25%. The method has improved the accuracy of key information detection in videos, providing a new solution for multi-modal video understanding in complex scenes.
Changjia Liu (Mon,) studied this question.