What question did this study set out to answer?

The study aims to develop an annotation-free framework for video anomaly detection that enhances interpretability and performance.

March 15, 2026

SOR-BDNet: Semantic-Optical Representation for Boundary-Aware Video Anomaly Detection with GPT-4o

Key Points

The study aims to develop an annotation-free framework for video anomaly detection that enhances interpretability and performance.
Proposed SOR-BDNet framework integrates semantic and motion information for video anomaly detection.
Utilized RAFT for estimating dense motion fields combined with RGB images to form spatiotemporal inputs.
Implemented a GPT-4o module for generating semantic captions and a memory bank for normal captions.
Achieved frame-level accuracies of 97.96% on UCSD-Ped2, 82.86% on Avenue, 87.36% on ShanghaiTech, and 85.64% on UCF-Crime.
Demonstrated significant improvements in interpretability and generalization across diverse surveillance scenarios.

Abstract

In recent years, video anomaly detection (VAD) has shifted from conventional appearance-based modeling to semantically driven frameworks empowered by LLMs. Traditional reconstruction- and prediction-based methods, relying on motion or appearance patterns learned from normal data, often misclassify previously unseen yet semantically normal events as anomalies. To address this limitation, we propose SOR-BDNet (Semantic-Optical Representation with Boundary Detection Network), an annotation-free multimodal VAD framework that jointly leverages visual appearance and motion dynamics to generate interpretable semantic representations at the frame level. Specifically, we employ RAFT to estimate dense motion fields and concatenate the resulting flow maps with RGB images to form unified spatiotemporal inputs. These fused representations are fed into a GPT-4o-based module that generates semantic captions capturing object semantics and motion cues. Anomalies are detected by measuring semantic deviations from a memory bank constructed from normal captions. To further refine temporal boundaries, we design a boundary refinement module that integrates visual continuity constraints with contrastive feature learning based on a Swin Transformer backbone. Extensive experiments on four challenging benchmarks—UCSD-Ped2, Avenue, ShanghaiTech, and UCF-Crime—demonstrate that SOR-BDNet achieves frame-level accuracies of 97.96%, 82.86%, 87.36%, and 85.64%, respectively. These results highlight the robustness and scalability of the proposed framework, while significantly improving interpretability and generalization across diverse real-world surveillance scenarios. The source code and pretrained models are available at https://github.com/syi-coder/SOR-BDNet-Semantic-Optical-Representation-for-Boundary-Aware-Video-Anomaly-Detection-with-GPT-4o .

Bookmark

SOR-BDNet: Semantic-Optical Representation for Boundary-Aware Video Anomaly Detection with GPT-4o

Key Points

Abstract

Cite This Study