March 3, 2026Open Access

Hierarchical vision-language model with comprehensive language description for video anomaly detection

Key Points

Robust detection of fine and coarse-grained anomalies leads to improved outcomes in video anomaly detection.
The model shows superior performance on datasets like UCF-Crime and ShanghaiTech, achieving state-of-the-art results.
Observational analysis introduces a hierarchical approach, capturing multi-granularity temporal information for better anomaly representation.
Potential applications span various fields, including surveillance and transportation, highlighting the importance of effective VAD methods.

Abstract

• Proposed a hierarchical VAD model capturing multi-granularity temporal information. • Proposed learning comprehensive textual descriptions using VLMs and LLMs. • Achieved robust detection of both fine and coarse-grained anomalies. • Introduced a training-free VAD framework based on similarity of aligned concepts. • Demonstrate superior performance on various datasets and supervision levels. Video Anomaly Detection (VAD) is a crucial task in computer vision, with applications in surveillance, transportation, and industrial monitoring. Recent advancements in Vision-Language Models (VLMs) have shown promising direction toward VAD in weakly supervised and unsupervised settings by leveraging visual and textual modalities. However, existing VLM-based methods often overlook coarse-to-fine temporal information, limiting their ability to handle complex anomalies. To address this issue, we propose a hierarchical VLM that enhances visual-textual feature representation by capturing video content at multiple levels of abstraction. Our algorithm generates a hierarchical view of the video, dividing it into short and long views. We extract hierarchical visual features and construct a bag containing comprehensive textual descriptions of anomalies using existing VLMs without relying on ground truth data. Our model fuses these modalities and is fine-tuned for anomaly score prediction in weakly supervised, unsupervised, and one-class settings. We also introduce a training-free VAD framework based on similarity scores. By aligning complex concepts across hierarchical views, our model captures both fine-grained details and high-level contextual information, leading to robust feature representations. Extensive experiments on UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the superior performance of our method compared to State-Of-The-Art VAD methods. The code is available on GitHub at: https://github.com/MR81224/HVLMCLD-VAD .

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Radi et al. (Mon,) studied this question.

synapsesocial.com/papers/69a765debadf0bb9e87dacae https://doi.org/https://doi.org/10.1016/j.knosys.2026.115466

Bookmark

View Full Paper