基于深度自编码器对网络的跨域行为识别研究
Cross Domain Action Recognition Based on Deep Dual Auto-Encoder Network
DOI: 10.12677/SEA.2022.115092, PDF,   
作者: 密豪坤:上海理工大学光电信息与计算机工程学院,上海
关键词: 自编码器对深度网络迁移学习人体行为识别Dual Auto-Encoder Deep Network Transfer Learning Human Action Recognition
摘要: 为了解决行为识别模型训练阶段需要大量视频行为数据的问题,考虑利用行为图像减少行为视频的训练数量,提高视频行为识别性能。提出深度自编码器对网络算法DDAEN (Deep Dual Auto-Encoder Networks)。首先,提取出视频的关键帧,作为将图像迁移到视频的中介。然后,根据图像、关键帧和视频的行为类别,定义跨模态相似度矩阵和自模态相似度矩阵,并借助深度网络减小图像和视频间模态差异,得到图像、关键帧和视频的域不变特征。最后,将域不变特征和语义特征作为自编码器对的输入,中间隐层得到对齐特征,并将对齐特征进行串联式融合,训练SVM分类器,实现视频行为识别。在ASD→UCF101和Stanford40→UCF101数据集上进行实验,结果表明,DDAEN在视频标记样本较少情况下,比未利用图像的视频行为识别iDTs+SVM方法在识别率上相对提高了25.44%和20.82%。证实了DDAEN可以利用图像使视频行为识别准确率提高。
Abstract: In order to solve the problem that a large amount of video action data is needed in the training stage of the action recognition model, it is considered to use images to reduce the number of videos and improve the performance of video action recognition. Deep Dual Auto-Encoder Network algorithm (DDAEN) is proposed. Firstly, the keyframe of the video is extracted as the medium to transfer the image to the video. Then, according to the action categories of images, keyframes and videos, the cross-modal similarity matrix and self-modal similarity matrix are defined, the modal differences between images and videos are reduced by a deep network, and the domain invariant features of images, keyframes and videos are obtained. Finally, the domain invariant feature and semantic feature are used as the input of the dual autoencoders, and the aligned feature is obtained by the middle hidden layer, letting the aligned feature fused in series to train the SVM classifier to realize video action recognition. Experiments on ASD→UCF101 and Stanford40→UCF101 show that the proposed method can improve the action recognition rate by 25.44% and 20.82% compared with the recognition iDTs+SVM method without image in the case of fewer video tag samples, confirming that the method in this paper can use images to make video action recognition accuracy increased.
文章引用:密豪坤. 基于深度自编码器对网络的跨域行为识别研究[J]. 软件工程与应用, 2022, 11(5): 893-904. https://doi.org/10.12677/SEA.2022.115092

参考文献

[1] Manasa, R., Shukla, R. and Saranya, K.C. (2021) An Extensive Analysis of the Vision-Based Deep Learning Techniques for Action Recognition. International Journal of Advanced Computer Science and Applications, 12, 604-611. [Google Scholar] [CrossRef
[2] Yang, H., Huang, M.-Z. and Cai, Z.-Q. (2021) Research on Human Motion Recognition Based on Data Redundancy Technology. Complexity, 6, 102-107. [Google Scholar] [CrossRef
[3] Chen, J., Wu, X.-X., Hu, Y., et al. (2021) Spatial-Temporal Causal Inference for Partial Image-to-Video Adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 1027-1035. [Google Scholar] [CrossRef
[4] Ikizler, N., Cinbis, R.G., Pehlivan, S., et al. (2008) Recognizing Actions from Still Images. 2008 19th International Conference on Pattern Recognition, Tampa, 8-11 December 2008, 1-4. [Google Scholar] [CrossRef
[5] Yao, B.-P. and Li, F.-F. (2010) Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 9-16. [Google Scholar] [CrossRef
[6] Gedamu, K., Ji, Y.-L., Yang, Y., et al. (2021) Arbitrary-View Human Action Recognition via Novel-View Action Generation. Pattern Recognition, 21, 118-123. [Google Scholar] [CrossRef
[7] Gao, Z., Guo, L.-M., Guan, W.-L., et al. (2021) A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2. IEEE Transactions on Image Processing, 30, 767-782. [Google Scholar] [CrossRef
[8] Zheng, H. and Zhang, X.-M. (2020) A Cross-Modal Learning Approach for Recognizing Human Actions. IEEE Systems Journal, 15, 2322-2330. [Google Scholar] [CrossRef
[9] Li, N., Huang, S.-Y., Zhao, X., et al. (2022) Hallucinating Uncertain Motion and Future for Static Image Action Recognition. Computer Vision and Image Understanding, 2, 103-107. [Google Scholar] [CrossRef
[10] Ramasinghe, S., Rajasegaran, J., Jayasundara, V., Ranasinghe, K., et al. (2019) Combined Static and Motion Features for Deep-Networks-Based Activity Recognition in Videos. IEEE Transactions on Circuits and Systems for Video Technology, 29, 26-33. [Google Scholar] [CrossRef
[11] Yu, F.-Y., Wu, X.-X., Sun, Y.-C., et al. (2018) Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks. Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18, Stockholm, 13-19 July 2018, 56-68. [Google Scholar] [CrossRef
[12] Li, J.-N., Wong, Y.-K., Zhao, Q., et al. (2017) Attention Transfer from Web Images for Video Recognition. Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, 23-27 October 2017, 1-9. [Google Scholar] [CrossRef
[13] Yu, F.-Y., Wu, X.-X., Chen, J.-L., et al. (2019) Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation via Symmetric Adversarial Learning. IEEE Transactions on Image Processing, 28, 5308-5321. [Google Scholar] [CrossRef
[14] Liu, Y., Lu, Z.-Y., Li, J., et al. (2020) Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition. IEEE Transactions on Image Processing, 29, 3168-3182. [Google Scholar] [CrossRef
[15] Kodirov, E., Xiang, T. and Gong, S.-G. (2017) Semantic Autoencoder for Zero-Shot Learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 3174-3183. [Google Scholar] [CrossRef
[16] Liu, Y., Li, J. and Gao, X.-B. (2020) A Simple Discriminative Dual Semantic Auto-Encoder for Zero-Shot Classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, 14-19 June 2020, 940-941. [Google Scholar] [CrossRef
[17] Chang, C.-C. and Lin, C.-J. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2, 1-27. [Google Scholar] [CrossRef
[18] Jiang, Q.Y. and Li, W.-J. (2017) Deep Cross-Modal Hashing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 3232-3240. [Google Scholar] [CrossRef
[19] Zhen, L.-L., Hu, P., Wang, X., et al. (2019) Deep Supervised Cross-Modal Retrieval. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 10394-10403. [Google Scholar] [CrossRef
[20] Wang, H. and Schmid, C. (2013) Action Recognition with Improved Trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, 1-8 December 2013, 3551-3558. [Google Scholar] [CrossRef
[21] Yang, Z., Yang, L., Raymond, O.I., et al. (2021) NSDH: A Nonlinear Supervised Discrete Hashing framework for Large-Scale Cross-Modal Retrieval. Knowledge-Based Systems, 2, 68-74. [Google Scholar] [CrossRef
[22] Park, S. and Byun, Y.C. (2021) Improving Recommendation Accuracy Based on Machine Learning Using Multi-Dimensional Features of Word2Vec. Journal of KIIT, 19, 9-14. [Google Scholar] [CrossRef
[23] Saenko, K., Kulis, B., Fritz, M., et al. (2010) Adapting Visual Category Models to New Domains. In: European Conference on Computer Vision, Springer, Berlin, 213-226. [Google Scholar] [CrossRef
[24] Soomro, K., Zamir, A.R. and Shah, M. (2012) A Dataset of 101 Human Action Classes from Videos in the Wild. Center for Research in Computer Vision, 2, 78-90.
[25] Pan, S.J., Tsang, I.W., Kwok, J.T., et al. (2010) Domain Adaptation via Transfer Component Analysis. IEEE Transactions on Neural Networks, 22, 199-210. [Google Scholar] [CrossRef
[26] Sun, B.-C., Feng, J.-S. and Saenko, K. (2016) Return of Frustratingly Easy Domain Adaptation. Proceedings of the AAAI Conference on Artificial Intelligence, 30, 203-211. [Google Scholar] [CrossRef
[27] Wang, J.-D., Feng, W.-J., Chen, Y.-Q., et al. (2018) Visual Domain Adaptation with Manifold Embedded Distribution Alignment. Proceedings of the 26th ACM international conference on Multimedia, Virtual Event, 20-24 October 2021, 402-410. [Google Scholar] [CrossRef
[28] Zhang, J., Li, W.-Q. and Ogunbona, P. (2017) Joint Geometrical and Statistical Alignment for Visual Domain Adaptation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 1859-1867. [Google Scholar] [CrossRef