基于语义增强与规则引导的弱监督视频异常检测方法
Weakly-Supervised Video Anomaly Detection Method Based on Semantic Augmentation and Rule-Guided Learning
DOI: 10.12677/csa.2026.162034, PDF,    科研立项经费支持
作者: 王津秋渝, 宋春林:同济大学电子与信息工程学院信息与通信工程系,上海;徐旭辉:同济大学海洋地质国家重点实验室,上海
关键词: 视频异常检测弱监督学习视觉语言预训练模型多维语义提示Video Anomaly Detection Weakly-Supervised Learning Cross-Modal Pre-Trained Models Multi-Dimensional Semantic Prompts
摘要: 视频异常检测(Video Anomaly Detection, VAD)旨在从长时间监控视频中自动识别异常事件,是智能安防与智能交通等场景中的关键技术。受限于异常事件的稀有性与标注成本,现有方法多采用弱监督学习范式,但仍普遍面临异常语义表达不足、跨模态对齐失效以及标签噪声导致训练不稳定等问题。针对上述挑战,本文提出基于语义增强与规则引导的SAGE-VAD (Semantic-Augmented & Guided Enhancement for VAD)框架。设计混合提示集成(Hybrid Prompt Ensemble, HPE)机制,融合人工模板与大模型描述,构建高覆盖度的类别原型。并引入帧级规则分数(Teacher Score)作为先验,通过一致性约束抑制噪声激活并优化关键帧筛选。实验结果表明,本文方法在UCF-Crime和XD-Violence数据集上均取得了显著性能提升。其中,在UCF-Crime数据集上,本文法的视频级AUC达到87.47%,在XD-Violence数据集上,视频级AP提升至85.08%,验证了语义增强与规则引导机制在弱监督异常检测任务中的有效性。
Abstract: Video Anomaly Detection (VAD) seeks to automatically detect abnormal events in long-duration surveillance videos and plays a critical role in applications such as intelligent surveillance and smart transportation. Owing to the rarity of anomalous events and the prohibitive cost of fine-grained annotations, most existing methods rely on weakly supervised learning. Nevertheless, they often struggle with limited anomaly semantic expressiveness, suboptimal cross-modal alignment, and unstable optimization induced by noisy supervision. To address these challenges, this paper proposes the Semantic-Augmented & Guided Enhancement for Video Anomaly Detection (SAGE-VAD) framework. First, we design a Hybrid Prompt Ensemble (HPE) mechanism that integrates manual templates with multi-dimensional descriptions generated by LLMs to construct high-coverage category prototypes. And Frame-level Teacher Scores are incorporated as rule-based priors to impose consistency constraints, thereby suppressing noise activations and optimizing keyframe selection in the selector branch. Experimental results demonstrate that SAGE-VAD achieves significant performance gains on the UCF-Crime and XD-Violence datasets, reaching a video-level AUC of 87.47% and an Average Precision of 85.08%, respectively. These results validate the effectiveness of the proposed semantic augmentation and rule-guided mechanisms in weakly-supervised anomaly detection tasks.
文章引用:王津秋渝, 宋春林, 徐旭辉. 基于语义增强与规则引导的弱监督视频异常检测方法[J]. 计算机科学与应用, 2026, 16(2): 1-14. https://doi.org/10.12677/csa.2026.162034

参考文献

[1] Liu, Y., Yang, D., Wang, Y., Liu, J., Liu, J., Boukerche, A., et al. (2024) Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models. ACM Computing Surveys, 56, 1-38. [Google Scholar] [CrossRef
[2] Sultani, W., Chen, C. and Shah, M. (2018) Real-World Anomaly Detection in Surveillance Videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6479-6488. [Google Scholar] [CrossRef
[3] Wu, P., Liu, J., Shi, Y., Sun, Y., Shao, F., Wu, Z., et al. (2020) Not Only Look, but Also Listen: Learning Multimodal Violence Detection under Weak Supervision. In: Lecture Notes in Computer Science, Springer, 322-339. [Google Scholar] [CrossRef
[4] Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[5] Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6299-6308. [Google Scholar] [CrossRef
[6] Dosovitskiy, A. (2020) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. https://arxiv.org/pdf/2010.11929/1000 [Google Scholar] [CrossRef
[7] Radford, A., Kim, J.W., Hallacy, C., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. International Conference on Machine Learning, Online, 18-24 July 2021, 8748-8763.
[8] 张琳, 陈兆波, 马晓轩, 等. 无监督和弱监督视频异常检测方法回顾与前瞻[J]. 科学技术与工程, 2024, 24(19): 7941-7955.
[9] Giambastiani, B.M.S. (2007) Evoluzione Idrologica ed Idrogeologica della Pineta di San Vitale (Ravenna). Ph.D. Thesis, Bologna University.
[10] 苏文浩. 基于弱监督学习的视频异常检测方法研究[D]: [硕士学位论文]. 济南: 山东大学, 2024.
[11] Yao, H., Zhang, R. and Xu, C. (2023) Visual-Language Prompt Tuning with Knowledge-Guided Context Optimization. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6757-6767. [Google Scholar] [CrossRef
[12] Wang, J. and Cherian, A. (2019) GODS: Generalized One-Class Discriminative Subspaces for Anomaly Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 8200-8210. [Google Scholar] [CrossRef
[13] Joo, H.K., Vo, K., Yamazaki, K. and Le, N. (2023) CLIP-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection. 2023 IEEE International Conference on Image Processing (ICIP), Kuala, 8-11 October 2023, 3230-3234. [Google Scholar] [CrossRef
[14] Tian, Y., Pang, G., Chen, Y., Singh, R., Verjans, J.W. and Carneiro, G. (2021) Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 4955-4966. [Google Scholar] [CrossRef
[15] Wu, P., Liu, X. and Liu, J. (2023) Weakly Supervised Audio-Visual Violence Detection. IEEE Transactions on Multimedia, 25, 1674-1685. [Google Scholar] [CrossRef
[16] Zhou, H., Yu, J. and Yang, W. (2023) Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3769-3777. [Google Scholar] [CrossRef
[17] Lv, H., Yue, Z., Sun, Q., Luo, B., Cui, Z. and Zhang, H. (2023) Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 8022-8031. [Google Scholar] [CrossRef
[18] Xu, C., Xu, K., Jiang, X. and Sun, T. (2025) PLOVAD: Prompting Vision-Language Models for Open Vocabulary Video Anomaly Detection. IEEE Transactions on Circuits and Systems for Video Technology, 35, 5925-5938. [Google Scholar] [CrossRef