视频社会关系识别的多尺度图推理模型
Multi-Scale Graph Reasoning Model for Video Social Relation Recognition
DOI: 10.12677/CSA.2021.112042, PDF,   
作者: 许 飞, 张天雨, 史俊彪:合肥工业大学计算机科学与信息工程学院,安徽 合肥
关键词: 社会关系识别多尺度图卷积注意力机制Social Relation Recognition Multi-Scale Graph Convolution Attention Mechanism
摘要: 人类社会关系识别作为视频分类中的一个重要问题,逐渐成为计算机视觉领域的一个研究热点。由于视频信息较多,冗余信息过量,关键帧较少,因此如何准确的识别视频中的关键信息进行社会关系推理至关重要。为此,本文提出一种多尺度图推理模型来进行视频社会关系识别。首先我们提取视频中的时空特征和语义对象信息,获得丰富、鲁棒的社会关系表示。接着通过多尺度图卷积利用不同的感受野来进行时间推理,捕捉人物和语义对象间的交互。特别地,我们利用注意力机制来评估每个语义对象在不同场景的效果。在SRIV数据集上的实验结果表明,本文提出的方法优于大多数先进的方法。
Abstract: As an important issue in video classification, human social relationship recognition has gradually become a research hotspot in the field of computer vision. Due to the large amount of video information, excessive redundant information and less key frames, how to accurately identify the key information in the video and carry out social relation reasoning is of great importance. To this end, this paper proposes a multi-scale graph reasoning model to identify video social relationships. First, we extract the temporal and spatial features and semantic object information in the video to obtain a rich and Lupin representation of social relations. Then use different receptive fields to perform temporal reasoning through multi-scale graph convolution, and capture the interaction between characters and semantic objects. In particular, we use the attention mechanism to evaluate the effect of each semantic object in different scenarios. The experimental results on SRIV dataset show that the method proposed in this paper is superior to most advanced methods.
文章引用:许飞, 张天雨, 史俊彪. 视频社会关系识别的多尺度图推理模型[J]. 计算机科学与应用, 2021, 11(2): 423-434. https://doi.org/10.12677/CSA.2021.112042

参考文献

[1] Wang, G., Gallagher, A.C., Luo, J.B. and Forsyth, D.A. (2010) Seeing People in Social Context: Recognizing People and Social Relationships. European Conference on Computer Vision, Glasgow, 23-28 August 2010, 169-182. [Google Scholar] [CrossRef
[2] Park, Y.-J. and Chang, K.-N. (2009) Individual and Group Behavior-Based Customer Profile Model for Personalized Product Recommendation. Expert Systems with Applications, 36, 1932-1939. [Google Scholar] [CrossRef
[3] Ding, L. and Yilmaz, A. (2011) Inferring Social Relations from Visual Concepts. IEEE International Conference on Computer Vision, Barcelona, 6-13 November 2011, 699-706. [Google Scholar] [CrossRef
[4] Yu, T., Lim, S.-N., Patwardhan, K.A. and Krahnstoever, N. (2009) Monitoring, Recognizing and Discovering Social Networks. IEEE Conference on Computer Vision and Pattern Recognition, Miami, 20-25 June 2009, 1462-1469. [Google Scholar] [CrossRef
[5] Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A.N., Murphy, K. and Li, F.-F. (2016) Detecting Events and Key Actors in Multi-Person Videos. IEEE Conference on Com-puter Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 3043-3053. [Google Scholar] [CrossRef
[6] Ramanathan, V., Yao, B.P. and Li, F.-F. (2013) Social Role Discov-ery in Human Events. IEEE Conference on Computer Vision and Pattern Recognition, Portland, 23-28 June 2013, 2475-2482. [Google Scholar] [CrossRef
[7] Bagautdinov, T.M., Alahi, A., Fleuret, F., Fua, P. and Savarese, S. (2017) Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 3425-3434. [Google Scholar] [CrossRef
[8] Lv, J.N., Liu, W., Zhou, L.L., Wu, B. and Ma, H.D. (2018) Mul-ti-Stream Fusion Model for Social Relation Recognition from Videos. International Conference on Multimedia Modeling, Bangkok, 5-7 February 2018, 355-368. [Google Scholar] [CrossRef
[9] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Li, F.-F. and Savarese, S. (2016) Social LSTM: Human Trajectory Prediction in Crowded Spaces. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 961-971. [Google Scholar] [CrossRef
[10] Choi, W. and Savarese, S. (2012) A Unified Framework for Mul-ti-Target Tracking and Collective Activity Recognition. European Conference on Computer Vision, Florence, 7-13 Octo-ber 2012, 215-230. [Google Scholar] [CrossRef
[11] Li, J.N., Wong, Y.K., Zhao, Q. and Kankanhalli, M.S. (2017) Dual-Glance Model for Deciphering Social Relationships. ICCV 2017, Palazzo del Cinema, 28 October 2017, 2669-2678.
[12] Zhang, Z.P., Luo, P., Loy, C.C. and Tang, X.O. (2015) Learning Social Relation Traits from Face Im-ages. ICCV, Santiago, 7-13 December 2015, 3631-3639. [Google Scholar] [CrossRef
[13] Sun, Q.R., Schiele, B. and Fritz, M. (2017) A Domain Based Approach to Social Relation Recognition. IEEE Conference on Com-puter Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 435-444.
[14] Wang, Z.X., Chen, T.S., Ren, J.S.J., Yu, W.H., Cheng, H. and Lin, L. (2018) Deep Reasoning with Knowledge Graph for Social Relationship Understanding. International Joint Conference on Artificial Intelligence, 1021-1028. [Google Scholar] [CrossRef
[15] Bugental, D.B. (2000) Acquisition of the Algorithms of Social Life: A Domain-Based Approach. Psychological Bulletin, 126, 187. [Google Scholar] [CrossRef] [PubMed]
[16] Wang, L.M., Xiong, Y.J., Wang, Z., Qiao, Y., Lin, D.H., Tang, X.O. and Van Gool, L. (2016) Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. 14th European Conference, Amsterdam, 11-14 October 2016, 20-36. [Google Scholar] [CrossRef
[17] Lin, L., Wang, X.L., Yang, W. and Lai, J.-H. (2015) Discrimi-natively Trained And-Or Graph Models for Object Shape Detection. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 37, 959-972. [Google Scholar] [CrossRef
[18] Felzenszwalb, P.F. and Huttenlocher, D.P. (2004) Efficient Graph-Based Image Segmentation. International Journal of Computer Vision, 59, 167-181. [Google Scholar] [CrossRef
[19] Liu, W., Jiang, Y.-G., Luo, J.B. and Chang, S.-F. (2011) Noise Resistant Graph Ranking for Improved Web Image Search. IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, 20-25 June 2011, 849-856.
[20] Kipf, T.N. and Welling, M. (2016) Semi-Supervised Classification with Graph Convolutional Networks.
[21] Defferrard, M., Bresson, X. and Vandergh-eynst, P. (2016) Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. Advances in Neural Information Processing Systems, Barcelona, 5-10 December 2016, 3837- 3845.
[22] Li, Y.J., Tarlow, D., Brockschmidt, M. and Zemel, R.S. (2015) Gated Graph Sequence Neural Networks.
[23] Wang, X.L. and Gupta, A. (2018) Videos as Space-Time Region Graphs. European Conference on Computer Vision, Munich, 8-14 September 2018, 413-431. [Google Scholar] [CrossRef
[24] Liang, X.D., Shen, X.H., Feng, J.S., Lin, L. and Yan, S.C. (2016) Semantic Object Parsing with Graph LSTM. European Conference on Computer Vision, Amsterdam, 8-16 Octo-ber 2016, 125-143. [Google Scholar] [CrossRef
[25] Qi, X.J., Liao, R.J., Jia, J.Y., Fidler, S. and Urtasun, R. (2017) 3D Graph Neural Networks for RGBD Semantic Segmentation. IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 5209-5218.
[26] Phan, M.C., Sun, A.X., Tay, Y., Han, J.L. and Li, C.L. (2017) NeuPL: Attention-Based Semantic Matching and Pair- Linking for Entity Disambiguation. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6-10 November 2017, 1667-1676. [Google Scholar] [CrossRef
[27] Xu, J., Yao, T., Zhang, Y.D. and Mei, T. (2017) Learning Multi-modal Attention LSTM Networks for Video Captioning. Proceedings of the 25th ACM International Conference on Mul-timedia, Mountain View, 23-27 October 2017, 537-545. [Google Scholar] [CrossRef
[28] Li, Y., Miao, Z., He, M., Zhang, Y.F. and Li, H. (2018) Deep Attention Residual Hashing. IEICE Transactions on Fundamen-tals of Electronics, Communications and Computer Sciences, 101-A, 654-657. [Google Scholar] [CrossRef
[29] Bin, Y., Yang, Y., Shen, F.M., Xie, N., Shen, H.T. and Li, X.L. (2019) Describing Video with Attention-Based Bidirectional LSTM. IEEE Transactions on Cybernetics, 49, 2631-2641. [Google Scholar] [CrossRef
[30] Zhu, F., Li, H.S., Ouyang, W.L., Yu, N.H. and Wang, X.G. (2017) Learning Spatial Regularization with Image-Level Supervisions for Multi-label Image Classification. IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 2027-2036. [Google Scholar] [CrossRef
[31] Girdhar, R. and Ramanan, D. (2017) Attentional Pooling for Action Recognition. NIPS 2017, Long Beach, 4-9 December 2017, 34-45.
[32] Rao, T.R., Li, X.X., Zhang, H.M. and Xu, M. (2019) Multi-Level Region-Based Convolutional Neural Network for Image Emotion Classification. Neurocomputing, 333, 429-439. [Google Scholar] [CrossRef
[33] Pei, W.J., Baltrusaitis, T., Tax, D.M.J. and Morency, L.-P. (2016) Temporal Attention-Gated Model for Robust Sequence Classification.
[34] He, K.M., Zhang, X.Y., Ren, S.Q. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778.
[35] Lv, J.N. and Wu, B. (2019) Spa-tio-Temporal Attention Model Based on Multi-View for Social Relation Understanding. 25th International Conference on Multi-Media Modeling, Thessaloniki, 8-11 January 2019, 1-12.
[36] Ren, S.Q., He, K.M., Girshick, R.B. and Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural In-formation Processing Systems (NIPS 2015), Vol. 28, 91-99.
[37] Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L. (2014) Microsoft COCO: Common Objects in Context. 13th European Conference, Zurich, 6-12 September 2014, 740-755. [Google Scholar] [CrossRef
[38] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolu-tional Networks for Large-Scale Image Recognition.
[39] Tran, D., Bourdev, L.D., Fergus, R., Torresani, L. and Paluri, M. (2015) Learning Spatiotemporal Features with 3D Convolutional Networks. ICCV, Santiago, 13-16 December 2015, 4489-4497. [Google Scholar] [CrossRef
[40] Findler, N.V. (1972) Short Note on a Heuristic Search Strategy in Long-Term Memory Networks. Information Processing Letters, 1, 191-196. [Google Scholar] [CrossRef
[41] Dai, P.L., Lv, J.N. and Wu, B. (2019) Two-Stage Model for Social Relationship Understanding from Videos. ICME 2019, Shanghai, 8-12 July 2019, 1132-1137. [Google Scholar] [CrossRef
[42] Lv, J.N., Wu, B., Zhang, Y.L. and Xiao, Y.P. (2019) Attentive Sequences Recurrent Network for Social Relation Recognition from Video. IEICE Transactions on Information and Systems, 102-D, 2568-2576. [Google Scholar] [CrossRef