融合多模态特征的多流行为识别网络
Multi-Stream Action Recognition Network Fusing Multi-Modal Features
DOI: 10.12677/CSA.2021.112045, PDF,  被引量   
作者: 张彬彬, 江朝晖, 李君君:合肥工业大学计算机与信息学院,安徽 合肥
关键词: 行为识别注意力机制姿态序列3D卷积姿态网络Action Recognition Attention Posture Sequence 3D Convolution Pose Network
摘要: 针对当前行为识别网络抗干扰能力不足和单一特征难以鲁棒性的表达行为的问题,本文提出了一种融合多模态特征的多流行为识别网络模型。首先,利用三维神经网络来提取RGB视频帧的表观特征和光流帧的运动特征,并利用注意力机制学习重要信息的权重。同时,本文引入了一个姿态网络来建模人体姿态序列的时空特征,弥补表观特征和运动特征对行为表达能力的不足。最后通过对三种特征的学习来实现行为识别。本文在JHMDB数据集上进行实验验证,结果表明我们的方法优于当前大多数先进的方法。
Abstract: Aiming at the problems of insufficient anti-interference ability of current action recognition net-works and the difficulty of expressing action robustly with a single feature, this paper proposes a multi-modality feature fusion multi-behavior recognition network model. First, use a three-dimensional neural network to extract the apparent features of RGB video frames and the motion features of optical flow frames, and the attention mechanism is used to learn the weight of important information. At the same time, a pose network is introduced to model the spatial and temporal features of human posture sequence, which makes up for the deficiency of apparent features and motion features in the expression ability of action. Finally, action recognition is realized by learning the three features. Experimental verification on JHMDB dataset shows that our method is superior to most of the current advanced methods.
文章引用:张彬彬, 江朝晖, 李君君. 融合多模态特征的多流行为识别网络[J]. 计算机科学与应用, 2021, 11(2): 451-460. https://doi.org/10.12677/CSA.2021.112045

参考文献

[1] Karpathy, A., Toderici, G., Shetty, S., et al. (2014) Large-Scale Video Classification with Convolutional Neural Net-works. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1725-1732. [Google Scholar] [CrossRef
[2] Simonyan, K. and Zisserman, A. (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Columbus, 23-28 June 2014, 2-3.
[3] Feichtenhofer, C., Pinz, A. and Zisserman, A. (2016) Convolution-al Two-Stream Network Fusion for Video Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 2. [Google Scholar] [CrossRef
[4] Tran, D., Bourdev, L., Fergus, R., et al. (2015) Learning Spatiotem-poral Features with 3D Convolutional Networks. IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[5] Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Hono-lulu, 21-26 July 2017, 2-3. [Google Scholar] [CrossRef
[6] Cao, Z., Hidalgo, G., Simon, T., et al. (2018) OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 43, 172-186.
[7] Laptev, I. (2005) On Space-Time Interest Points. International Journal of Computer Vision, 64, 107-123. [Google Scholar] [CrossRef
[8] Scovanner, P., Ali, S. and Shah, M. (2007) A 3-Dimensional Sift Descriptor and Its Application to Action Recognition. Proceedings of the 15th ACM International Conference on Multi-media, Augsburg, 24-29 September 2007, 357-360. [Google Scholar] [CrossRef
[9] Wang, H., Klaser, A., Schmid, C. and Liu, C. (2011) Action Recognition by Dense Trajectories. CVPR 2011, Colorado Springs, 20-25 June 2011, 3. [Google Scholar] [CrossRef
[10] Wang, H. and Schmid, C. (2013) Action Recognition with Im-proved Trajectories. IEEE International Conference on Computer Vision, Sydney, 1-8 December 2013, 3551-3558. [Google Scholar] [CrossRef
[11] Wang, L., Qiao, Y. and Tang, X. (2015) Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 4305-4314. [Google Scholar] [CrossRef
[12] Donahue, J., Anne, H.L., Guadarrama, S., et al. (2015) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 2625-2634. [Google Scholar] [CrossRef
[13] Xu, K., Hu, W., Leskovec, J. and Jegelka, S. (2018) How Pow-erful Are Graph Neural Networks?
[14] Qi, S., Wang, W., Jia, B., Shen, J. and Zhu, S.-C. (2018) Learning Hu-man-Object Interactions by Graph Parsing Neural Networks. European Conference on Computer Vision, Munich, 8-14 September 2018, 407-423. [Google Scholar] [CrossRef
[15] Simonovsky, M. and Komodakis, N. (2017) Dynamic Edge Conditioned Filters in Convolutional Neural Networks on Graphs. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 3. [Google Scholar] [CrossRef
[16] Seo, Y., Defferrard, M., Vandergheynst, P. and Bresson, X. (2016) Structured Sequence Modeling with Graph Convolutional Recurrent Networks.
[17] Yan, S., Xiong, Y., Lin, D. and Tang, X.O. (2018) Spatial Temporal Graph Convolutional Networks for Skeleton- Based Action Recognition. 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, 2-7 February 2018, 3.
[18] Li, Z.Y., Gavrilyuk, K., Gavves, E., Jain, M. and Snoek, C.G.M. (2018) VideoLSTM Convolves, Attends and Flows for Action Recognition. Computer Vision and Image Understanding, 166, 41-50. [Google Scholar] [CrossRef
[19] Sharma, S., Kiros, R. and Salakhutdinov, R. (2016) Action Recog-nition Using Visual Attention. International Conference on Learning Representations, San Juan, 2-4 May 2016, 3.
[20] Cheron, G., Laptev, I. and Schmid, C. (2015) P-CNN: Pose-Based CNN Features for Action Recognition. IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 8. [Google Scholar] [CrossRef
[21] Peng, X.J. and Schmid, C. (2016) Multi-Region TwoStream R-CNN for Action Detection. ECCV 2016 14th European Conference, Amsterdam, 11-14 October 2016, 8.
[22] Yan, A., Wang, Y., Li, Z., et al. (2020) PA3D: Pose-Action 3D Machine for Video Recognition. 2019 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 8. [Google Scholar] [CrossRef
[23] Zolfaghari, M., Oliveira, G.L., Sedaghat, N. and Brox, T. (2017) Chained Multi-Stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection. IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 8. [Google Scholar] [CrossRef
[24] Choutas, V., Weinzaepfel, P., Revaud, J. and Schmid, C. (2018) Po-tion: Pose Motion Representation for Action Recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8. [Google Scholar] [CrossRef
[25] Jhuang, H., Gall, J., Zuffi, S., Schmid, C. and Black, M.J. (2013) Towards Understanding Action Recognition. IEEE International Conference on Computer Vision, Sydney, 1-8 Decem-ber 2013, 7. [Google Scholar] [CrossRef
[26] Kingma, D.P. and Adam, J.B. (2015) A Method for Stochastic Opti-mization. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, 7-9 May 2015, 7.
[27] Shi, X.J., et al. (2015) Convolutional LSTM Network: A Machine Learning Approach for Precipitation Now-casting.
[28] Zach, C., Pock, T. and Bischof, H. (2007) A Duality Based Approach for Realtime TV-L1 Optical Flow. Joint Pattern Recognition Symposium, Heidelberg, 12-14 September 2007, 214-223. [Google Scholar] [CrossRef