基于双阶段协同数据增强的暴力行为视频识别算法
Violence Video Recognition Based on Two-Stage Collaborative Data Augmentation
DOI: 10.12677/jisp.2026.151008, PDF,    科研立项经费支持
作者: 文晨曦, 杨善铮:赣南师范大学数学与计算机科学学院,江西 赣州
关键词: 暴力行为视频识别数据增强双阶段数据增强Violence Video Recognition Data Augmentation Two-Stage Collaborative Data Augmentation
摘要: 暴力行为视频识别是现代公共安全领域中一项至关重要的技术。合理设计的数据增强方法可以提升暴力行为视频识别精度。针对现有数据增强方法难以全面覆盖时域和空域暴力行为信息的问题,提出双阶段协同数据增强网络。在训练阶段提出时空随机裁剪策略,生成具有时空域暴力行为信息的多样化背景和动作表达,提高模型对时空域暴力行为特征学习的鲁棒性。在测试阶段通过十字区域裁剪策略,扩大裁剪视角,提高暴力行为特征区域覆盖度。在VSD2015数据集上的大量实验验证,双阶段协同数据增强网络仅视觉模态的结果超过先进方法,取得领先性能。本研究通过双阶段协同增强机制,为暴力行为视频识别任务中的数据增强方法提供新的方案。
Abstract: Violence video recognition is a crucial technology in the field of modern public security. Well-designed data augmentation methods can improve the precision of violence recognition. To address the problem that existing data augmentation methods are difficult to fully cover violence information in the temporal and spatial domains, a Two-stage Collaborative Data Augmentation Network (TCDANet) is proposed. In the training phase, a Spatiotemporal Random Crop (STRCrop) strategy is proposed to generate diverse backgrounds and action representations, which containing violence information in the spatiotemporal domains, enhancing the model’s robustness in learning spatiotemporal violence features. In the testing phase, a Cross Area Crop (CACrop) strategy is adopted to expand the cropping perspective, improving the coverage of violence feature regions. Extensive experiments are conducted on the VSD2015 dataset. The results of the two-stage collaborative data augmentation network with only visual modality outperform advanced methods, acquiring leading performance. This study provides a new solution for data augmentation methods in violence video recognition tasks through the two-stage collaborative augmentation.
文章引用:文晨曦, 杨善铮. 基于双阶段协同数据增强的暴力行为视频识别算法[J]. 图像与信号处理, 2026, 15(1): 89-101. https://doi.org/10.12677/jisp.2026.151008

参考文献

[1] Garcia-Cobo, G. and SanMiguel, J.C. (2023) Human Skeletons and Change Detection for Efficient Violence Detection in Surveillance Videos. Computer Vision and Image Understanding, 233, Article ID: 103739. [Google Scholar] [CrossRef
[2] Li, C., Yang, X. and Liang, G. (2023) Keyframe-Guided Video Swin Transformer with Multi-Path Excitation for Violence Detection. The Computer Journal, 67, 1826-1837. [Google Scholar] [CrossRef
[3] Hachiuma, R., Sato, F. and Sekii, T. (2023) Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 22962-22971. [Google Scholar] [CrossRef
[4] Asad, M., Yang, J., He, J., Shamsolmoali, P. and He, X. (2020) Multi-Frame Feature-Fusion-Based Model for Violence Detection. The Visual Computer, 37, 1415-1431. [Google Scholar] [CrossRef
[5] Contardo, P., Tomassini, S., Falcionelli, N., et al. (2023) Combining a Mobile Deep Neural Network and a Recurrent Layer for Violence Detection in Videos. CEUR Workshop Proceedings. CEUR-WS, Vol. 3402, 35-43.
[6] Mumtaz, N., Ejaz, N., Aladhadh, S., Habib, S. and Lee, M.Y. (2022) Deep Multi-Scale Features Fusion for Effective Violence Detection and Control Charts Visualization. Sensors, 22, Article No. 9383. [Google Scholar] [CrossRef] [PubMed]
[7] Aarthy, K. and Nithya, A.A. (2022) Crowd Violence Detection in Videos Using Deep Learning Architecture. 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, 16-17 October 2022, 1-6. [Google Scholar] [CrossRef
[8] Gupta, H. and Ali, S.T. (2022) Violence Detection Using Deep Learning Techniques. 2022 International Conference on Emerging Techniques in Computational Intelligence (ICETCI), Hyderabad, 25-27 August 2022, 121-124. [Google Scholar] [CrossRef
[9] Islam, M.S., Hasan, M.M., Abdullah, S., Akbar, J.U.M., Arafat, N.H.M. and Murad, S.A. (2021) A Deep Spatio-Temporal Network for Vision-Based Sexual Harassment Detection. 2021 Emerging Technology in Computing, Communication and Electronics (ETCCE), Dhaka, 21-23 December 2021, 1-6. [Google Scholar] [CrossRef
[10] Jahlan, H.M.B. and Elrefaei, L.A. (2021) Mobile Neural Architecture Search Network and Convolutional Long Short-Term Memory-Based Deep Features toward Detecting Violence from Video. Arabian Journal for Science and Engineering, 46, 8549-8563. [Google Scholar] [CrossRef
[11] Singh, N., Prasad, O. and Sujithra, T. (2022) Deep Learning-Based Violence Detection from Videos. In: Satapathy, S.C., et al., Eds., Intelligent Data Engineering and Analytics, Springer, 323-332. [Google Scholar] [CrossRef
[12] Srivastava, A., Badal, T., Saxena, P., Vidyarthi, A. and Singh, R. (2022) UAV Surveillance for Violence Detection and Individual Identification. Automated Software Engineering, 29, Article No. 28. [Google Scholar] [CrossRef
[13] Jeevan, R. and Avanthika, B. (2025) Intelligent Video Surveillance Systems with Violence Detection. 2025 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI), Chennai, 28-29 March 2025, 1-6. [Google Scholar] [CrossRef
[14] Chandane, S., Nadar, A.T., Lokhande, M., Kanthakumar, D. and Shaikh, R. (2024) Violence Detection Using Deep Learning. 2024 International Conference on Innovations and Challenges in Emerging Technologies (ICICET), Nagpur, 7-8 June 2024, 1-6. [Google Scholar] [CrossRef
[15] Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T., Shlens, J. and Le, Q.V. (2020) Learning Data Augmentation Strategies for Object Detection. In: Vedaldi, A., et al., Eds., Computer VisionECCV 2020, Springer International Publishing, 566-583. [Google Scholar] [CrossRef
[16] Senadeera, D.C., Yang, X., Kollias, D. and Slabaugh, G. (2024) CUE-Net: Violence Detection Video Analytics with Spatial Cropping, Enhanced UniformerV2 and Modified Efficient Additive Attention. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 17-18 June 2024, 4888-4897. [Google Scholar] [CrossRef
[17] Cubuk, E.D., Zoph, B., Shlens, J. and Le, Q.V. (2020) Randaugment: Practical Automated Data Augmentation with a Reduced Search Space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 14-19 June 2020, 702-703. [Google Scholar] [CrossRef
[18] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., et al. (2023) VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14549-14560. [Google Scholar] [CrossRef
[19] Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. [Google Scholar] [CrossRef
[20] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[21] Ahmed, M., Ramzan, M., Ullah Khan, H., Iqbal, S., Attique Khan, M., Choi, J., et al. (2021) Real-Time Violent Action Recognition Using Key Frames Extraction and Deep Learning. Computers, Materials & Continua, 69, 2217-2230. [Google Scholar] [CrossRef
[22] Sharma, S., Sudharsan, B., Naraharisetti, S., Trehan, V. and Jayavel, K. (2021) A Fully Integrated Violence Detection System Using CNN and LSTM. International Journal of Electrical and Computer Engineering (IJECE), 11, 3374-3380. [Google Scholar] [CrossRef
[23] de Oliveira Lima, J.P. and Figueiredo, C.M.S. (2021) Temporal Fusion Approach for Video Classification with Convolutional and LSTM Neural Networks Applied to Violence Detection. Inteligencia Artificial, 24, 40-50. [Google Scholar] [CrossRef
[24] Traoré, A. and Akhloufi, M.A. (2020) 2D Bidirectional Gated Recurrent Unit Convolutional Neural Networks for End-To-End Violence Detection in Videos. In: Campilho, A., et al., Eds., Image Analysis and Recognition, Springer International Publishing, 152-160. [Google Scholar] [CrossRef
[25] Rendón-Segador, F.J., Álvarez-García, J.A., Enríquez, F. and Deniz, O. (2021) ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence. Electronics, 10, 1601. [Google Scholar] [CrossRef
[26] Abdali, A.R. (2021) Data Efficient Video Transformer for Violence Detection. 2021 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), Purwokerto, 17-18 July 2021, 195-199. [Google Scholar] [CrossRef
[27] Dosovitskiy, A. (2020) An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale.
[28] Li, K., Wang, Y., Gao, P., et al. (2022) Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning.
[29] Zumerle, F., Comanducci, L., Zanoni, M., Bernardini, A., Antonacci, F. and Sarti, A. (2023) Procedural Music Generation for Videogames Conditioned through Video Emotion Recognition. 2023 4th International Symposium on the Internet of Sounds, Pisa, 26-27 October 2023, 1-8. [Google Scholar] [CrossRef
[30] Huynh, V.T., Yang, H., Lee, G. and Kim, S. (2023) Prediction of Evoked Expression from Videos with Temporal Position Fusion. Pattern Recognition Letters, 172, 245-251. [Google Scholar] [CrossRef
[31] Duja, K.U., Khan, I.A. and Alsuhaibani, M. (2024) Video Surveillance Anomaly Detection: A Review on Deep Learning Benchmarks. IEEE Access, 12, 164811-164842. [Google Scholar] [CrossRef
[32] Sjöberg, M., Baveye, Y., Wang, H., et al. (2015) The MediaEval 2015 Affective Impact of Movies Task. MediaEval, Wurzen, 14-15 September 2015, 1436.
[33] Dai, Q., Zhao, R.W., Wu, Z., et al. (2015) Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning. MediaEval, Wurzen, 14-15 September 2015, 1436.
[34] Trigeorgis, G., Ringeval, F., Marchi, E., et al. (2015) The ICL-TUM-PASSAU Approach for the MediaEval 2015 “Affective Impact of Movies” Task.
[35] Lam, V., Le, S.P., Le, D.D., et al. (2015) NII-UIT at MediaEval 2015 Affective Impact of Movies Task. MediaEval, Wurzen, 14-15 September 2015, 1436.
[36] Marin Vlastelica, P., Hayrapetyan, S., Tapaswi, M., et al. (2015) KIT at MediaEval 2015-Evaluating Visual Cues for Affective Impact of Movies Task. MediaEval, Wurzen, 14-15 September 2015.
[37] Li, X., Huo, Y., Jin, Q. and Xu, J. (2016) Detecting Violence in Video Using Subclasses. Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, 15-19 October 2016, 586-590. [Google Scholar] [CrossRef
[38] Peixoto, B.M., Avila, S., Dias, Z. and Rocha, A. (2018) Breaking down Violence: A Deep-Learning Strategy to Model and Classify Violence in Videos. Proceedings of the 13th International Conference on Availability, Reliability and Security, Hamburg, 27-30 August 2018, 1-7. [Google Scholar] [CrossRef
[39] Peixoto, B., Lavi, B., Pereira Martin, J.P., Avila, S., Dias, Z. and Rocha, A. (2019) Toward Subjective Violence Detection in Videos. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, 12-17 May 2019, 8276-8280. [Google Scholar] [CrossRef
[40] Freire-Obregón, D., Barra, P., Castrillón-Santana, M. and Marsico, M.D. (2021) Inflated 3D Convnet Context Analysis for Violence Detection. Machine Vision and Applications, 33, 15. [Google Scholar] [CrossRef
[41] Zheng, Z., Zhong, W., Ye, L., Fang, L. and Zhang, Q. (2021) Violent Scene Detection of Film Videos Based on Multi-Task Learning of Temporal-Spatial Features. 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Tokyo, 8-10 September 2021, 360-365. [Google Scholar] [CrossRef
[42] Gu, C., Wu, X. and Wang, S. (2020) Violent Video Detection Based on Semantic Correspondence. IEEE Access, 8, 85958-85967. [Google Scholar] [CrossRef
[43] 吴晓雨, 蒲禹江, 王生进, 刘子豪. 基于语义嵌入学习的特类视频识别[J]. 电子学报, 2023, 51(11): 3225-3237.
[44] Pu, Y., Wu, X., Wang, S., Huang, Y., Liu, Z. and Gu, C. (2022) Semantic Multimodal Violence Detection Based on Local-to-Global Embedding. Neurocomputing, 514, 148-161. [Google Scholar] [CrossRef
[45] Wang, Q., Xiang, X., Zhao, J. and Deng, X. (2022) P2SL: Private-Shared Subspaces Learning for Affective Video Content Analysis. 2022 IEEE International Conference on Multimedia and Expo (ICME), 18-22 July 2022, 1-6. [Google Scholar] [CrossRef
[46] Savadogo, W.A.R., Lin, C., Hung, C., Chen, C., Liu, Z. and Liu, T. (2023) A Study on Constructing an Elderly Abuse Detection System by Convolutional Neural Networks. Journal of the Chinese Institute of Engineers, 46, 118-127. [Google Scholar] [CrossRef
[47] Negre, P., Alonso, R.S., González-Briones, A., Prieto, J. and Rodríguez-González, S. (2024) Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors, 24, Article No. 4016. [Google Scholar] [CrossRef] [PubMed]
[48] Vaishy, A., Basak, S. and Gautam, A. (2025) Early Violence Recognition Using Knowledge Distillation. In: Kakarla, J., et al., Eds., Computer Vision and Image Processing, Springer, 57-70. [Google Scholar] [CrossRef
[49] Hanief Wani, M. and Faridi, A.R. (2024) Deep Learning-Based Video Surveillance System for Suspicious Activity Detection. Journal of Intelligent & Fuzzy Systems, 47, 71-82. [Google Scholar] [CrossRef
[50] Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., et al. (2023) UniFormerV2: Unlocking the Potential of Image Vits for Video Understanding. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 1632-1643. [Google Scholar] [CrossRef
[51] Padilla, R., Netto, S.L. and da Silva, E.A.B. (2020) A Survey on Performance Metrics for Object-Detection Algorithms. 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, 1-3 July 2020, 237-242. [Google Scholar] [CrossRef
[52] Loshchilov, I. and Hutter, F. (2017) Fixing Weight Decay Regularization in Adam.