基于时空采样的视频行为识别

doi:10.12677/airr.2024.132032

期刊菜单

基于时空采样的视频行为识别
Video Action Recognition Based on Spatiotemporal Sampling

DOI: 10.12677/airr.2024.132032, PDF,
作者: 王冠, 彭梦昊, 陶应诚, 徐浩, 景圣恩：合肥工业大学计算机与信息学院，安徽合肥
关键词: 视频行为识别；时空采样；视频Transformer；Video Action Recognition； Saptio-Temporal Sampling； Video Transformer

摘要: 视频特征包含了行为执行时的时间、空间冗余信息。该信息和行为类别无关，会干扰行为识别，造成行为类别的错误判断。本文提出了一种基于时空采样的视频行为识别模型。模型包括关键帧采样和Token采样的视频Transformer。关键帧采样过程，通过量化相邻帧间的像素差异，识别出包含显著变化的关键帧，累积多个连续帧的更新概率处理两个关键帧间的可能存在的长时间间隔，引入一个可训练的采样概率阈值从而将更新概率二值化，增强对于关键帧的建模能力。因此该过程保证了视频关键信息的获取。本文认为不同的Token对识别任务的重要性会有所不同，因此在时空Transformer块中，本文采用一种数据依赖的Token采样策略，通过分层减少Token的数量有效降低空间冗余信息，同时也减少了模型计算量。最终通过全连接层完成视频行为识别。实验在ActivityNet-v1.3、Mini-Kinetics数据集上进行验证。实验表明，本文基于时空采样的视频行为识别方法，具有较小计算量的同时，能够达到现有行为识别方法的准确率。

Abstract: Video features contain the time and space redundancy information when the action is executed. This information has nothing to do with the action category, which will interfere with the action identification and cause the wrong judgment of the action category. This thesis proposes a video action recognition model based on spatiotemporal sampling. The model includes key frame sampling and Token sampling video Transformer. Key frame sampling, by quantifying the pixel difference between adjacent frames, identifies key frames with significant changes, accumulates the update probability of multiple consecutive frames, processes the possible long time interval between two key frames, introduces a trained sampling probability threshold to binarize the update probability, enhances the modeling ability of key frames, and ensures the acquisition of video key information. This thesis believes that different tokens have different importance to recognition tasks. Therefore, in the Transformer block, this thesis adopts a data-dependent Token sampling strategy to reduce the number of tokens by layers to effectively reduce spatial redundancy information and reduce the amount of computation. Finally, the video action recognition is completed through the fully-connected layer. The experiments are validated on ActivityNet-v1.3, Mini-Kinetics dataset. The experiments show that in this thesis, the action recognition method based on spatiotemporal sampling, can achieve the accuracy of existing action recognition methods with less computation.

文章引用：王冠, 彭梦昊, 陶应诚, 徐浩, 景圣恩. 基于时空采样的视频行为识别[J]. 人工智能与机器人研究, 2024, 13(2): 300-312. https://doi.org/10.12677/airr.2024.132032

参考文献

[1]	Karpathy, A., Toderici, G., Shetty, S., et al. (2014) Large-Scale Video Classification with Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1725-1732. [Google Scholar] [CrossRef]
[2]	Goyal, R., Ebrahimi Kahou, S., Michalski, V., et al. (2017) The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 5842-5850. [Google Scholar] [CrossRef]
[3]	Chen, J., Li, K., Deng, Q., et al. (2019) Distributed Deep Learning Model for Intelligent Video Surveillance Systems with Edge Computing. IEEE Transactions on Industrial Informatics. [Google Scholar] [CrossRef]
[4]	Bertasius, G., Wang, H. and Torresani, L. (2021) Is Space-Time Attention All You Need for Video Understanding? The 38th International Conference on Machine Learning (ICML 2021), 18-24 July 2021, 1-12.
[5]	Arnab, A., Dehghani, M., Heigold, G., et al. (2021) Vivit: A Video Vision Transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 11-17 October 2021, 6836-6846. [Google Scholar] [CrossRef]
[6]	Caba Heilbron, F., Escorcia, V., Ghanem, B. and Carlos Niebles, J. (2015) Activitynet: A Large-Scale Video Benchmark for Human Activity Understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 961-970. [Google Scholar] [CrossRef]
[7]	Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S. and Zisserman, A. (2017) The Kinetics Human Action Video Dataset.
[8]	Yeung, S., Russakovsky, O., Mori, G. and Fei-Fei, L. (2016) End-to-End Learning of Action Detection from Frame Glimpses in Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 2678-2687. [Google Scholar] [CrossRef]
[9]	Wu, Z., Xiong, C., Ma, C.Y., Socher, R. and Davis, L.S. (2019) Adaframe: Adaptive Frame Selection for Fast Video Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 1278-1287. [Google Scholar] [CrossRef]
[10]	Gao, R., Oh, T.H., Grauman, K. and Torresani, L. (2020) Listen to Look: Action Recognition by Previewing Audio. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 14-19 June 2020, 10457-10467. [Google Scholar] [CrossRef]
[11]	Ghodrati, A., Bejnordi, B.E. and Habibian, A. (2021) Frameexit: Conditional Early Exiting for Efficient Video Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 15608-15618. [Google Scholar] [CrossRef]
[12]	Korbar, B., Tran, D. and Torresani, L. (2019) Scsampler: Sampling Salient Clips from Video for Efficient Action Recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, 27-28 October 2019, 6232-6242. [Google Scholar] [CrossRef]
[13]	Zheng, Y.D., Liu, Z., Lu, T. and Wang, L. (2020) Dynamic Sampling Networks for Efficient Action Recognition in Videos. IEEE Transactions on Image Processing, 29, 7970-7983. [Google Scholar] [CrossRef]
[14]	Meng, Y., Lin, C.C., Panda, R., Sattigeri, P., Karlinsky, L., Oliva, A., Feris, R., et al. (2020) Ar-Net: Adaptive Frame Resolution for Efficient Action Recognition. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 86-104. [Google Scholar] [CrossRef]
[15]	Sun, X., Panda, R., Chen, C.F.R., Oliva, A., Feris, R. and Saenko, K. (2021) Dynamic Network Quantization for Efficient Video Inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, 11-17 October 2021, 7375-7385. [Google Scholar] [CrossRef]
[16]	Park, S.H., Tack, J., Heo, B., Ha, J.W. and Shin, J. (2022) K-Centered Patch Sampling for Efficient Video Recognition. In: European Conference on Computer Vision, Springer, Cham, 160-176. [Google Scholar] [CrossRef]
[17]	Xie, Z., Zhang, Z., Zhu, X., Huang, G. and Lin, S. (2020) Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation. Computer Vision-ECCV 2020: 16th European Conference, Glasgow, 23-28 August 2020, 531-548. [Google Scholar] [CrossRef]
[18]	Wang, J., Yang, X., Li, H., Liu, L., Wu, Z. and Jiang, Y.G. (2022) Efficient Video Transformers with Spatial-Temporal Token Selection. In: European Conference on Computer Vision, Springer, Cham, 69-86. [Google Scholar] [CrossRef]
[19]	Piergiovanni, A.J., Kuo, W. and Angelova, A. (2023) Rethinking Video Vits: Sparse Video Tubes for Joint Image and Video Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 17-24 June 2023, 2214-2224. [Google Scholar] [CrossRef]
[20]	Bulat, A., Perez Rua, J.M., Sudhakaran, S., Martinez, B. and Tzimiropoulos, G. (2021) Space-Time Mixing Attention for Video Transformer. Advances in Neural Information Processing Systems, 34, 19594-19607.
[21]	Sun, R., Zhang, T., Wan, Y., Zhang, F. and Wei, J. (2023) Wlit: Windows and Linear Transformer for Video Action Recognition. Sensors, 23, Article No. 1616. [Google Scholar] [CrossRef] [PubMed]
[22]	Ba, J.L., Kiros, J.R. and Hinton, G.E. (2016) Layer Normalization.
[23]	Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J. and Tian, Q. (2015) Scalable Person Re-Identification: A Benchmark. Proceedings of the IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 1116-1124. [Google Scholar] [CrossRef]
[24]	Wu, Z., Xiong, C., Jiang, Y.G. and Davis, L.S. (2019) Liteeval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 1-10.
[25]	Xia, B., Wang, Z., Wu, W., Wang, H. and Han, J. (2022) Temporal Saliency Query Network for Efficient Video Recognition. In: European Conference on Computer Vision, Springer, Cham, 741-759. [Google Scholar] [CrossRef]
[26]	Raviv, A., Dinai, Y., Drozdov, I., Zehngut, N., Goldin, I. and Center, S.I.R.D. (2022) D-Step: Dynamic Spatio-Temporal Pruning. Proceedings of the British Machine Vision Conference, London, 21-24 November 2022, 1-13.

为你推荐

友情链接