• Proposed a hierarchical VAD model capturing multi-granularity temporal information. • Proposed learning comprehensive textual descriptions using VLMs and LLMs. • Achieved robust detection of both fine and coarse-grained anomalies. • Introduced a training-free VAD framework based on similarity of aligned concepts. • Demonstrate superior performance on various datasets and supervision levels. Video Anomaly Detection (VAD) is a crucial task in computer vision, with applications in surveillance, transportation, and industrial monitoring. Recent advancements in Vision-Language Models (VLMs) have shown promising direction toward VAD in weakly supervised and unsupervised settings by leveraging visual and textual modalities. However, existing VLM-based methods often overlook coarse-to-fine temporal information, limiting their ability to handle complex anomalies. To address this issue, we propose a hierarchical VLM that enhances visual-textual feature representation by capturing video content at multiple levels of abstraction. Our algorithm generates a hierarchical view of the video, dividing it into short and long views. We extract hierarchical visual features and construct a bag containing comprehensive textual descriptions of anomalies using existing VLMs without relying on ground truth data. Our model fuses these modalities and is fine-tuned for anomaly score prediction in weakly supervised, unsupervised, and one-class settings. We also introduce a training-free VAD framework based on similarity scores. By aligning complex concepts across hierarchical views, our model captures both fine-grained details and high-level contextual information, leading to robust feature representations. Extensive experiments on UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the superior performance of our method compared to State-Of-The-Art VAD methods. The code is available on GitHub at: https://github.com/MR81224/HVLMCLD-VAD .
Radi et al. (Mon,) studied this question.