基于直推式零样本学习的动作识别方法
Research on Transductive Zero-Shot Learning for Action Recognition
DOI: 10.12677/CSA.2022.123050, PDF,   
作者: 齐秋平:同济大学计算机科学与技术系,上海
关键词: 零样本学习直推式学习视频动作识别Zero-Shot Learning Transductive Learning Video Action Recognition
摘要: 在物联网和智能设备飞速发展的当代社会,网络信息已从海量文本数据逐渐演变为更为直观的图像和视频数据。丰富的视频数据在为人类提供诸多便利的同时,其内容理解和分类也给人们带来诸多新的挑战。针对现有深度方法严重依赖大量标注样本,并且所学知识不可拓展的问题,零样本学习作为迁移学习的一种特殊场景,以可从可见类别拓展到未见类别的独特优势吸引了大量关注。本文提出一种基于直推式零样本学习的动作识别方法,首先将视觉信息映射到语义空间中,然后通过语义空间的最近邻搜索来完成识别任务,并且引入带有偏差的损失函数,旨在提高识别精度的同时有效缓解强偏问题。该模型在UCF101、HMDB51以及OlympicSports数据集上的识别准确率分别达到26.8%、20.3%和46.5%,充分证明了该方法的有效性。
Abstract: With the rapid development of Internet of things and intelligent devices, Internet information has gradually evolved from massive text data to more intuitive image and video data. Rich video data not only provides conveniences, but also brings many new challenges in video understanding and classification. In view of the problem that existing deep learning methods rely on a large number of labeled data and the learned knowledge cannot be expanded, zero-shot learning as a kind of transfer learning, has attracted a lot of attention because of its unique advantage of expanding from seen categories to unseen categories. In this paper, an action recognition method based on transductive zero-shot learning is proposed. Firstly, visual information is mapped into the semantic space, and then the recognition task is carried out through the nearest neighbor search in the semantic space, and the loss function with deviation is introduced to improve the recognition accuracy and effectively alleviate the problem of strong bias. Experiments on UCF101, HMDB51 and Olympic sports datasets show that the accuracy of the proposed method is 26.8%, 20.3% and 46.5% respectively, which fully proves the effectiveness of the proposed method.
文章引用:齐秋平. 基于直推式零样本学习的动作识别方法[J]. 计算机科学与应用, 2022, 12(3): 499-507. https://doi.org/10.12677/CSA.2022.123050

参考文献

[1] Rohrbach, M., Ebert, S. and Schiele, B. (2013) Transfer Learning in a Transductive Setting. 2013 NIPS Workshops, Lake Tahoe, 5-8 December 2013, 46-54.
[2] Fu, Y., Hospedales, T.M., Xiang, T., et al. (2015) Transductive Multi-View Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 2332-2345. [Google Scholar] [CrossRef
[3] Guo, Y., Ding, G., Jin, X., et al. (2016) Transductive Ze-ro-Shot Recognition via Shared Model Space Learning. AAAI-16: Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, 12-17 February 2016, 3434-3500.
[4] Xu, Y., Han, C., Qin, J., et al. (2021) Transductive Zero-Shot Action Recognition via Visually Connected Graph Convolutional Networks. IEEE Transactions on Neural Networks and Learning Systems, 32, 3761-3769. [Google Scholar] [CrossRef
[5] Wang, Q. and Chen, K. (2020) Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding. Neural Networks, 122, 1-23. [Google Scholar] [CrossRef] [PubMed]
[6] Wang, H., Oneata, D., Verbeek, J.J., et al. (2016) A Robust and Efficient Video Representation for Action Recognition. International Journal of Computer Vision, 119, 219-238. [Google Scholar] [CrossRef
[7] Kong, Y. and Fu, Y. (2018) Human Action Recognition and Pre-diction: A Survey. CoRR, abs/1806.11230.
[8] Simonyan, K. and Zisserman, A. (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. 2014 NIPS Workshops, Nevada, December 2014, 568-576.
[9] Tran, D., Bourdev, L.D., Fergus, R., et al. (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[10] Ji, S., Xu, W., Yang, M., et al. (2013) 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 221-231. [Google Scholar] [CrossRef
[11] Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings CVPR’17, Honolulu, 21-26 July 2017, 4724-4733. [Google Scholar] [CrossRef
[12] Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild.
[13] Kuehne, H., Jhuang, H., Garrote, E., et al. (2011) HMDB: A Large Video Database for Human Motion Recognition. ICCV 2011, Barcelona, 6-13 November 2011, 2556-2563. [Google Scholar] [CrossRef
[14] Xian, Y., Lorenz, T., Schiele, B., et al. (2018) Fea-ture Generating Networks for Zero-Shot Learning. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 5542-5551. [Google Scholar] [CrossRef
[15] Verma, V.K. and Rai, P. (2017) A Simple Exponential Family Framework for Zero-Shot Learning. ECML/PKDD, Skopje, 18-22 September 2017, Vol. 10535, 792-808. [Google Scholar] [CrossRef
[16] Tran, D., Wang, H., Torresani, L., et al. (2018) A Closer Look at Spatiotemporal Convolutions for Action Recognition. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 6450-6459. [Google Scholar] [CrossRef
[17] Niebles, J.C., Chen, C. and Li, F. (2010) Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. ECCV 2010: 11th European Confer-ence on Computer Vision, Heraklion, 5-11 September 2010, 392-405. [Google Scholar] [CrossRef
[18] Wang, L., Qiao, Y. and Tang, X. (2015) Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 4305-4314. [Google Scholar] [CrossRef
[19] Tsochantaridis, I., Joachims, T., Hofmann, T., et al. (2005) Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research, 6, 1453-1484.
[20] Song, J., Shen, C., Yang, Y., et al. (2018) Transductive Unbiased Embedding for Zero-Shot Learning. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 1024-1033. [Google Scholar] [CrossRef
[21] Gao, J., Zhang, T. and Xu, C. (2019) I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs. The Thir-ty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, 27 January-1 February 2019, 8303-8311. [Google Scholar] [CrossRef
[22] Akata, Z., Reed, S.E., Walter, D., et al. (2015) Evaluation of Output Embeddings for Fine-Grained Image Classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 2927-2936. [Google Scholar] [CrossRef
[23] Xu, X., Hospedales, T.M. and Gong, S. (2015) Semantic Embedding Space for Zero-Shot Action Recognition. 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, 27-30 September 2015, 63-67. [Google Scholar] [CrossRef
[24] Xun, X., Hospedales, T.M. and Gong, S.G. (2016) Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation. ECCV 2016, Amsterdam, 8-16 October 2016, 343-359. [Google Scholar] [CrossRef
[25] Li, Y., Hu, S. and Li, B. (2016) Recognizing Unseen Actions in a Domain-Adapted Embedding Space. 2016 IEEE International Conference on Image Processing, Phoenix, 25-28 September 2016, 4195-4199. [Google Scholar] [CrossRef
[26] Xu, X., Hospedales, T.M. and Gong, S. (2017) Transductive Ze-ro-Shot Action Recognition by Word-Vector Embedding. International Journal of Computer Vision, 123, 309-333. [Google Scholar] [CrossRef
[27] Qin, J., Liu, L., Shao, L., et al. (2017) Zero-Shot Action Recogni-tion with Error-Correcting Output Codes. Proceedings CVPR’17, Honolulu, 21-26 July 2017, 1042-1051. [Google Scholar] [CrossRef
[28] Mishra, A., Verma, V.K., Reddy, M.S.K., et al. (2018) A Generative Approach to Zero-Shot and Few-Shot Action Recognition. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, 12-15 March 2018, 372-380. [Google Scholar] [CrossRef
[29] Zhu, Y., Long, Y., Guan, Y., et al. (2018) Towards Universal Representation for Unseen Action Recognition. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 9436-9445. [Google Scholar] [CrossRef
[30] Zhang, C. and Peng, Y. (2018) Visual Data Synthesis via GAN for Zero-Shot Video Classification. Proceedings of the Twen-ty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, 13-19 July 2018, 1128-1134. [Google Scholar] [CrossRef
[31] Kodirov, E., Xiang, T., Fu, Z., et al. (2015) Unsupervised Domain Adaptation for Zero-Shot Learning. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2452-2460. [Google Scholar] [CrossRef
[32] Wang, Q. and Chen, K. (2017) Zero-Shot Visual Recognition via Bi-directional Latent Embedding. International Journal of Computer Vision, 124, 356-383. [Google Scholar] [CrossRef
[33] Rohrbach, M., Ebert, S. and Schiele, B. (2013) Transfer Learning in a Transductive Setting. 2013 NIPS Workshops, Lake Tahoe, 5-8 December 2013, 46-54.
[34] Fu, Y., Hospedales, T.M., Xiang, T., et al. (2015) Transductive Multi-View Zero-Shot Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 2332-2345. [Google Scholar] [CrossRef
[35] Guo, Y., Ding, G., Jin, X., et al. (2016) Transductive Ze-ro-Shot Recognition via Shared Model Space Learning. AAAI-16: Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, 12-17 February 2016, 3434-3500.
[36] Xu, Y., Han, C., Qin, J., et al. (2021) Transductive Zero-Shot Action Recognition via Visually Connected Graph Convolutional Networks. IEEE Transactions on Neural Networks and Learning Systems, 32, 3761-3769. [Google Scholar] [CrossRef
[37] Wang, Q. and Chen, K. (2020) Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding. Neural Networks, 122, 1-23. [Google Scholar] [CrossRef] [PubMed]
[38] Wang, H., Oneata, D., Verbeek, J.J., et al. (2016) A Robust and Efficient Video Representation for Action Recognition. International Journal of Computer Vision, 119, 219-238. [Google Scholar] [CrossRef
[39] Kong, Y. and Fu, Y. (2018) Human Action Recognition and Pre-diction: A Survey. CoRR, abs/1806.11230.
[40] Simonyan, K. and Zisserman, A. (2014) Two-Stream Convolutional Networks for Action Recognition in Videos. 2014 NIPS Workshops, Nevada, December 2014, 568-576.
[41] Tran, D., Bourdev, L.D., Fergus, R., et al. (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 4489-4497. [Google Scholar] [CrossRef
[42] Ji, S., Xu, W., Yang, M., et al. (2013) 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 221-231. [Google Scholar] [CrossRef
[43] Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Proceedings CVPR’17, Honolulu, 21-26 July 2017, 4724-4733. [Google Scholar] [CrossRef
[44] Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild.
[45] Kuehne, H., Jhuang, H., Garrote, E., et al. (2011) HMDB: A Large Video Database for Human Motion Recognition. ICCV 2011, Barcelona, 6-13 November 2011, 2556-2563. [Google Scholar] [CrossRef
[46] Xian, Y., Lorenz, T., Schiele, B., et al. (2018) Fea-ture Generating Networks for Zero-Shot Learning. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 5542-5551. [Google Scholar] [CrossRef
[47] Verma, V.K. and Rai, P. (2017) A Simple Exponential Family Framework for Zero-Shot Learning. ECML/PKDD, Skopje, 18-22 September 2017, Vol. 10535, 792-808. [Google Scholar] [CrossRef
[48] Tran, D., Wang, H., Torresani, L., et al. (2018) A Closer Look at Spatiotemporal Convolutions for Action Recognition. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 6450-6459. [Google Scholar] [CrossRef
[49] Niebles, J.C., Chen, C. and Li, F. (2010) Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. ECCV 2010: 11th European Confer-ence on Computer Vision, Heraklion, 5-11 September 2010, 392-405. [Google Scholar] [CrossRef
[50] Wang, L., Qiao, Y. and Tang, X. (2015) Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 4305-4314. [Google Scholar] [CrossRef
[51] Tsochantaridis, I., Joachims, T., Hofmann, T., et al. (2005) Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research, 6, 1453-1484.
[52] Song, J., Shen, C., Yang, Y., et al. (2018) Transductive Unbiased Embedding for Zero-Shot Learning. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 1024-1033. [Google Scholar] [CrossRef
[53] Gao, J., Zhang, T. and Xu, C. (2019) I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs. The Thir-ty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, 27 January-1 February 2019, 8303-8311. [Google Scholar] [CrossRef
[54] Akata, Z., Reed, S.E., Walter, D., et al. (2015) Evaluation of Output Embeddings for Fine-Grained Image Classification. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 2927-2936. [Google Scholar] [CrossRef
[55] Xu, X., Hospedales, T.M. and Gong, S. (2015) Semantic Embedding Space for Zero-Shot Action Recognition. 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, 27-30 September 2015, 63-67. [Google Scholar] [CrossRef
[56] Xun, X., Hospedales, T.M. and Gong, S.G. (2016) Multi-Task Zero-Shot Action Recognition with Prioritised Data Augmentation. ECCV 2016, Amsterdam, 8-16 October 2016, 343-359. [Google Scholar] [CrossRef
[57] Li, Y., Hu, S. and Li, B. (2016) Recognizing Unseen Actions in a Domain-Adapted Embedding Space. 2016 IEEE International Conference on Image Processing, Phoenix, 25-28 September 2016, 4195-4199. [Google Scholar] [CrossRef
[58] Xu, X., Hospedales, T.M. and Gong, S. (2017) Transductive Ze-ro-Shot Action Recognition by Word-Vector Embedding. International Journal of Computer Vision, 123, 309-333. [Google Scholar] [CrossRef
[59] Qin, J., Liu, L., Shao, L., et al. (2017) Zero-Shot Action Recogni-tion with Error-Correcting Output Codes. Proceedings CVPR’17, Honolulu, 21-26 July 2017, 1042-1051. [Google Scholar] [CrossRef
[60] Mishra, A., Verma, V.K., Reddy, M.S.K., et al. (2018) A Generative Approach to Zero-Shot and Few-Shot Action Recognition. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, 12-15 March 2018, 372-380. [Google Scholar] [CrossRef
[61] Zhu, Y., Long, Y., Guan, Y., et al. (2018) Towards Universal Representation for Unseen Action Recognition. Proceedings CVPR’18, Salt Lake City, 18-22 June 2018, 9436-9445. [Google Scholar] [CrossRef
[62] Zhang, C. and Peng, Y. (2018) Visual Data Synthesis via GAN for Zero-Shot Video Classification. Proceedings of the Twen-ty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, 13-19 July 2018, 1128-1134. [Google Scholar] [CrossRef
[63] Kodirov, E., Xiang, T., Fu, Z., et al. (2015) Unsupervised Domain Adaptation for Zero-Shot Learning. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2452-2460. [Google Scholar] [CrossRef
[64] Wang, Q. and Chen, K. (2017) Zero-Shot Visual Recognition via Bi-directional Latent Embedding. International Journal of Computer Vision, 124, 356-383. [Google Scholar] [CrossRef