In recent years, video anomaly detection (VAD) has shifted from conventional appearance-based modeling to semantically driven frameworks empowered by LLMs. Traditional reconstruction- and prediction-based methods, relying on motion or appearance patterns learned from normal data, often misclassify previously unseen yet semantically normal events as anomalies. To address this limitation, we propose SOR-BDNet (Semantic-Optical Representation with Boundary Detection Network), an annotation-free multimodal VAD framework that jointly leverages visual appearance and motion dynamics to generate interpretable semantic representations at the frame level. Specifically, we employ RAFT to estimate dense motion fields and concatenate the resulting flow maps with RGB images to form unified spatiotemporal inputs. These fused representations are fed into a GPT-4o-based module that generates semantic captions capturing object semantics and motion cues. Anomalies are detected by measuring semantic deviations from a memory bank constructed from normal captions. To further refine temporal boundaries, we design a boundary refinement module that integrates visual continuity constraints with contrastive feature learning based on a Swin Transformer backbone. Extensive experiments on four challenging benchmarks—UCSD-Ped2, Avenue, ShanghaiTech, and UCF-Crime—demonstrate that SOR-BDNet achieves frame-level accuracies of 97.96%, 82.86%, 87.36%, and 85.64%, respectively. These results highlight the robustness and scalability of the proposed framework, while significantly improving interpretability and generalization across diverse real-world surveillance scenarios. The source code and pretrained models are available at https://github.com/syi-coder/SOR-BDNet-Semantic-Optical-Representation-for-Boundary-Aware-Video-Anomaly-Detection-with-GPT-4o .
Sun et al. (Fri,) studied this question.