基于局部特征移位网络的手部骨架动作识别
Hand Skeleton Action Recognition Based on Local Feature Shift Network
摘要: 由于视觉的不稳定性和环境的复杂性,基于第一人称视角的动作行为难以得到准确的识别。本文提出了一种基于局部特征移位网络的动作识别网络框架,具体来说,该框架首先建立手部骨架的无向时空图拓扑结构,并使用自适应图卷积网络提取手部骨架拓扑图的关节特征和连接信息;其次,为了得到全局空间信息,本文使用ResNet152网络提取RGB特征。在获得手部骨架特征与RGB图像特征后,我们将其分别输入到提出的局部特征移位卷积网络,该网络通过样本间的互相学习为模型带来更好的泛化性。通过在FPHA数据集上进行的实验表明,该框架在动作识别上的精确度证明了该模型能够有效地应对视频背景干扰,并具有较强的鲁棒性。
Abstract: Due to the visual instability and the complexity of the environment, first-person action is difficult to recognize accurately. In this paper, we propose an action recognition network framework based on local feature shift networks. Specifically, the framework first builds the undirected spatio-temporal graph topology of the hand skeleton, and uses adaptive graph convolution network to extract joint features and connection information of the hand skeleton topology; after that, in order to obtain global spatial information, we use ResNet152 network to extract RGB features. Getting the hand skeleton features and the RGB image features, we input them into the proposed local feature shift convolutional network respectively where through the mutual learning between samples, the model could receive better generalization. Experiments on FPHA data set show that the proposed framework is accurate in motion recognition, which proves that the model can effectively deal with video background interference and has strong robustness.
文章引用:田文浩, 陈俊洪, 钟经谋, 刘文印. 基于局部特征移位网络的手部骨架动作识别[J]. 计算机科学与应用, 2022, 12(8): 1877-1886. https://doi.org/10.12677/CSA.2022.128188

参考文献

[1] Singh, S., Arora, C. and Jawahar, C.V. (2017) Trajectory Aligned Features for First Person Action Recognition. Pattern Recognition, 62, 45-55. [Google Scholar] [CrossRef
[2] Kwon, H., Kim, Y., Lee, J.S. and Cho, M. (2018) First Person Action Recognition via Two-Stream ConvNet with Long-Term Fusion Pooling. Pattern Recogni-tion Letters, 112, 161-167. [Google Scholar] [CrossRef
[3] Lu, M., Li, Z.N., Wang, Y. and Pan, G. (2019) Deep Attention Network for Egocentric Action Recognition. IEEE Transactions on Image Processing, 28, 3703-3713. [Google Scholar] [CrossRef
[4] Tang, Y., Wang, Z., Lu, J., Feng, J. and Zhou, J. (2018) Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29, 3001-3015. [Google Scholar] [CrossRef
[5] Li, C., Xie, C., Zhang, B., Han, J., Zhen, X. and Chen, J. (2021) Memory Attention Networks for Skeleton-Based Action Recognition. IEEE Transactions on Neural Networks and Learning Systems, 1-15. [Google Scholar] [CrossRef
[6] Du, Y., Fu, Y. and Wang, L. (2015) Skeleton Based Action Recognition with Convolutional Neural Network. 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, 3-6 November 2015, 579-583. [Google Scholar] [CrossRef
[7] Li, Y., He, Z., Ye, X., He, Z. and Han, K. (2019) Spatial Tem-poral Graph Convolutional Networks for Skeleton-Based Dynamic Hand Gesture Recognition. EURASIP Journal on Image and Video Processing, 2019, Article No. 78. [Google Scholar] [CrossRef
[8] Shi, L., Zhang, Y., Cheng, J. and Lu, H. (2019) Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 12018-12027. [Google Scholar] [CrossRef
[9] Su, K., Liu, X. and Shlizerman, E. (2020) PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 9628-9637. [Google Scholar] [CrossRef
[10] Singh, S., Arora, C. and Jawahar, C.V. (2016) First Person Action Recognition Using Deep Learned Descriptors. 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Las Vegas, 27-30 June 2016, 2620-2628. [Google Scholar] [CrossRef
[11] Urabe, S., Inoue, K. and Yoshioka, M. (2018) Cooking Activities Recognition in Egocentric Videos Using Combining 2Dcnn and 3Dcnn. Proceedings of the Joint Workshop on Multime-dia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, Stockholm, 15 July 2018, 1-8. [Google Scholar] [CrossRef
[12] Tang, Y., Wang, Z., Lu, J., Feng, J. and Zhou, J. (2018) Mul-ti-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition. IEEE Transactions on Circuits and Sys-tems for Video Technology, 29, 3001-3015. [Google Scholar] [CrossRef
[13] Liu, Z., Zhang, H., Chen, Z., Wang, Z. and Ouyang, W. (2020) Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 140-149. [Google Scholar] [CrossRef
[14] Shi, L., Zhang, Y., Cheng, J. and Lu, H. (2020) Skele-ton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks. IEEE Transactions on Im-age Processing, 29, 9532-9545. [Google Scholar] [CrossRef
[15] Xia, H. and Gao, X. (2021) Multi-Scale Mixed Dense Graph Convolution Network for Skeleton-Based Action Recognition. IEEE Access, 9, 36475-36484. [Google Scholar] [CrossRef
[16] Zhang, X., Xu, C. and Tao, D. (2020) Context Aware Graph Convolution for Skeleton-Based Action Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 14321-14330. [Google Scholar] [CrossRef
[17] Peng, W., Shi, J. and Zhao, G. (2021) Spatial Temporal Graph Deconvolutional Network for Skeleton-Based Human Action Recognition. IEEE Signal Processing Letters, 28, 244-248. [Google Scholar] [CrossRef
[18] Cai, J., Jiang, N., Han, X., Jia, K. and Lu, J. (2021) JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, 3-8 January 2021, 2734-2743. [Google Scholar] [CrossRef
[19] Xie, J., Xin, W., Liu, R., Sheng, L., Liu, X., Gao, X., et al. (2021) Cross-Channel Graph Convolutional Networks for Skeleton-Based Action Recognition. IEEE Access, 9, 9055-9065. [Google Scholar] [CrossRef
[20] Liu, J., Shahroudy, A., Xu, D., Kot, A.C. and Wang, G. (2018) Skeleton-Based Action Recognition Using Spatio-Temporal lstm Network with Trust Gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 3007-3021. [Google Scholar] [CrossRef
[21] Nguyen, X.S., Brun, L., Lezoray, O. and Bougleux, S. (2019) A Neural Network Based on SPD Manifold Learning for Skeleton-Based Hand Gesture Recognition. The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 12028-12037. [Google Scholar] [CrossRef
[22] Das, S., Sharma, S., Dai, R., Brémond, F. and Thonnat, M. (2020) VPN: Learning Video-Pose Embedding for Activities of Daily Living. European Conference on Computer Vision 2020, Glasgow, 23-28 August 2020, 72-90. [Google Scholar] [CrossRef
[23] Yang, S., Liu, J., Lu, S., Er, M.H. and Kot, A.C. (2020) Col-laborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-order Feature Analysis. European Conference on Computer Vision 2020, Glasgow, 23-28 August 2020, 769-786. [Google Scholar] [CrossRef
[24] He, K., Zhang, X., Ren, S., and Sun, J. (2016) Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[25] Feichtenhofer, C., Pinz, A. and Zisserman, A. (2016) Convolutional Two-Stream Network Fusion for Video Action Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 1933-1941. [Google Scholar] [CrossRef
[26] Tekin, B., Bogo, F. and Pollefeys, M. (2019) H + O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 4506-4515. [Google Scholar] [CrossRef
[27] Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J. and Lu, H. (2020) Skeleton-Based Action Recognition with Shift Graph Convolutional Network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 180-189. [Google Scholar] [CrossRef
[28] Garcia-Hernando, G., Yuan, S., Baek, S. and Kim, T.-K. (2018) First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 409-419. [Google Scholar] [CrossRef