基于深度学习的视觉关系检测方法及应用
Method and Application of Visual Relationship Detection Based on Deep Learning
DOI: 10.12677/JISP.2022.113016, PDF,   
作者: 汤婧婧, 石爱业, 张丽丽, 徐立中:河海大学计算机与信息学院,江苏 南京;黄 晶:河海大学商学院,江苏 南京
关键词: 计算机视觉深度学习神经网络视觉关系检测Computer Vision Deep Learning Neural Networks Visual Relationship Detection
摘要: 随着深度学习的不断发展和广泛应用,计算机视觉的许多领域也得到了长足的进步,例如在图像分类、对象检测、图像分割等任务中的表现。视觉关系检测(VRD)是计算机视觉的重要任务,旨在识别图像中物体之间的关系或相互作用,这对于理解图像及视觉世界都很重要,VRD也是计算机视觉技术应用研究的关键环节。与一般的物体检测任务相比,VRD不仅需要预测每个物体的类别和轨迹,还需要预测物体之间的关系,研究人员已经针对改任务提出了很多办法,特别在近年来基于深度神经网络的发展的深度学习也有所突破。本文介绍了VRD任务的内容,深度学习基本方法,VRD的传统方法和基于深度学习模型的一些分类和框架及其VRD在计算机视觉领域的应用。
Abstract: With the continuous development and wide application of deep learning, many fields of computer vision have also made great progress, such as performance in image classification, object detection, image segmentation and other tasks. Visual relationship detection (VRD) is an important task for computer vision, aiming to recognize relations or interactions between objects in an image, which is important for understanding images even the visual world. Compared with the general object detection task, VRD requires not only to predict the categories and trajectories of each object, but also to predict the relationship between objects. Researchers have proposed to tackle this problem especially with the development of deep neural networks in recent years. In this survey, we provide a comprehensive review of VRD in computer vision and some categorization and frameworks of deep learning models for VRD with its applications.
文章引用:汤婧婧, 黄晶, 石爱业, 张丽丽, 徐立中. 基于深度学习的视觉关系检测方法及应用[J]. 图像与信号处理, 2022, 11(3): 144-161. https://doi.org/10.12677/JISP.2022.113016

参考文献

[1] Wang, Q., Zou, L., Yao, Y., Wang, Y., Li, J. and Yang, W. (2021) An Interconnected Feature Pyramid Networks for Object Detection. Journal of Visual Communication and Image Representation, 3, Article ID: 103260.
[Google Scholar] [CrossRef
[2] Zhang, L., Hu, X., Zhou, Y., Zhou, G. and Duan, S. (2021) Memristive DeepLab: A Hardware Friendly Deep CNN for Semantic Segmentation. Neurocomputing, 451, 181-191.
[Google Scholar] [CrossRef
[3] Zhu, Y., Li, L. and Wu, X. (2021) Stacked Convolutional Sparse Auto-Encoders for Representation Learning. ACM Transactions on Knowledge Discovery from Data, 15, Article No. 31.
[Google Scholar] [CrossRef
[4] Lu, C., Krishna, R., Bernstein, M. and Li, F.-F.(2016) Visual Relationship Detection with Language Priors. European Conference on Computer Vision (ECCV) 2016, Amsterdam, 11-14 October 2016, 852-869.
[Google Scholar] [CrossRef
[5] Liu, P., Xiang, C., Jia, D., Zhao, X., Meng, W. and Wang, J. (2020) Stacked Attention Recurrent Relational Networks for Question Answering. Journal of Physics Conference Series, 1570, Article ID: 012072.
[Google Scholar] [CrossRef
[6] Zhang, H., Kyaw, Z., Chang, S.F. and Chua, T.-S. (2017) Visual Translation Embedding Network for Visual Relation Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 3107-3115.
[Google Scholar] [CrossRef
[7] Desai, C., Ramanan, D. and Fowlkes, C. (2009) Discriminative Models for Multi-Class Object Layout. 2009 IEEE 12th International Conference on Computer Vision, Kyoto, 29 September-2 October 2009, 229-236.
[Google Scholar] [CrossRef
[8] Yao, B. and Li, F.F. (2010) Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 17-24.
[Google Scholar] [CrossRef
[9] Mensink, T., Gavves, E. and Snoek, C.G.M. (2014) COSTA: Co-Occurrence Statistics for Zero-Shot Classification. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 2441-2448.
[Google Scholar] [CrossRef
[10] Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J. and Lazebnik, S. (2017) Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1946-1955.
[Google Scholar] [CrossRef
[11] Liang, X., Lee, L. and Xing, E.P. (2017) Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 4408-4417.
[Google Scholar] [CrossRef
[12] Li, Y., Ouyang, W., Wang, X. and Tang, X. (2017) ViP-CNN: Visual Phrase Guided Convolutional Neural Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 7244-7253.
[Google Scholar] [CrossRef
[13] Dai, B., Zhang, Y. and Lin, D. (2017) Detecting Visual Relationships with Deep Relational Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 3298-3308.
[Google Scholar] [CrossRef
[14] Hu, Z., Yang, Z., Salakhutdinov, R. and Xing, E. (2016) Deep Neural Networks with Massive Learned Knowledge. 2016 Conference on Empirical Methods in Natural Language, Austin, 1-4 November 2016, 1670-1679.
[Google Scholar] [CrossRef
[15] Dechter, R. (1986) Learning While Searching in Constraint-Satisfaction-Problems. Proceedings of the 5th AAAI National Conference on Artificial Intelligence, Philadelphia, 11-15 August 1986, 178-183.
[16] Aizenberg, I.N., Aizenberg, N.N. and Vandewalle, J. (2000) Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer, New York.
[Google Scholar] [CrossRef
[17] Huang, M.L. and Wu, Y.Z. (2022) Semantic Segmentation of Pancreatic Medical Images by Using Convolutional Neural Network. Biomedical Signal Processing and Control, 73, Article ID: 103458.
[Google Scholar] [CrossRef
[18] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551.
[Google Scholar] [CrossRef
[19] Cullheim, S., Kellerth, J.O. and Conradi, S. (1977) Evidence for Direct Synaptic Interconnections between Cat Spinal α-Motoneurons via the Recurrent Axon Collaterals: A Morphological Study Using Intracellular Injection of Horseradish Peroxidase. Brain Research, 132, 1-10.
[Google Scholar] [CrossRef] [PubMed]
[20] Hopfield, J.J. (1982) Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences of the United States of America, 79, 2554-255.
[Google Scholar] [CrossRef] [PubMed]
[21] Jordan, M.I. (1997) Serial Order: A Parallel Distributed Processing Approach. In: Donahoe, J.W. and Dorsel, V.P., Eds., Neural-Network Models of Cognition: Biobehavioral Foundations, Vol. 121, North-Holland, Amsterdam, 471-495.
[Google Scholar] [CrossRef
[22] Elman, J.L. (1990) Finding Structure in Time. Cognitive Science, 14, 179-211.
[Google Scholar] [CrossRef
[23] Schmidhuber, J. (1992) Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4, 234-242.
[Google Scholar] [CrossRef
[24] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780.
[Google Scholar] [CrossRef] [PubMed]
[25] Schuster, M. and Paliwa, K.K. (1997) Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45, 2673-2681.
[Google Scholar] [CrossRef
[26] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014) Generative Adversarial Nets. arXiv preprint arXiv:1406.2661
[27] Tieleman, T. (2008) Training Restricted Boltzmann Machines Using Approximations to the Likelihood Gradient. Proceedings of the 25th International Conference on Machine Learning, Helsinki, 5-9 July 2008, 1064-1071.
[Google Scholar] [CrossRef
[28] Sperduti, A. and Starita, A. (1997) Supervised Neural Networks for the Classification of Structures. IEEE Transactions on Neural Networks, 8, 714-735.
[Google Scholar] [CrossRef] [PubMed]
[29] Ruiz, L., Gama, F. and Ribeiro, A. (2020) Gated Graph Recurrent Neural Networks. IEEE Transactions on Signal Processing, 68, 6303-6318.
[Google Scholar] [CrossRef
[30] Bruna, J., Zaremba, W., Szlam, A. and LeCun, Y. (2013) Spectral Networks and Locally Connected Networks on Graphs. arXiv preprint arXiv:1312.6203.
[31] Micheli, A. (2009) Neural Network for Graphs: A Contextual Constructive Approach. IEEE Transactions on Neural Networks, 20, 498-511.
[Google Scholar] [CrossRef
[32] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. and Yu, P.S. (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32, 4-24.
[Google Scholar] [CrossRef
[33] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2016) Attention Is All You Need. arXiv preprint arXiv:1706.03762
[34] Hu, J., Shen, L., Albanie, S., Sun, G. and Wu, E. (2020) Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2011-2023.
[Google Scholar] [CrossRef
[35] Sadeghi, M.A. and Farhadi, A. (2011) Recognition Using Visual Phrases. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, 20-25 June 2011, 1745-1752.
[Google Scholar] [CrossRef
[36] Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2015) The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111, 98-136.
[Google Scholar] [CrossRef
[37] Yu, R., Li, A., Morariu, V.I. and Davis, L.S. (2017) Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. 2017 IEEE International Conference on Computer Vision (CVPR), Venice, 22-29 October 2017, 1068-1076.
[Google Scholar] [CrossRef
[38] Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 580-587.
[Google Scholar] [CrossRef
[39] Plesse, F., Ginsca, A., Delezoide, B. and Prêteux, F. (2018) Visual Relationship Detection Based on Guided Proposals and Semantic Knowledge Distillation. 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, 23-27 July 2018, 1-6.
[Google Scholar] [CrossRef
[40] Zhuang, B., Liu, L., Shen, C. and Reid, I. (2017) Towards Context-Aware Interaction Recognition for Visual Relationship Detection. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 589-598.
[Google Scholar] [CrossRef
[41] Krishna, R., Chami, I., Bernstein, M. and Li, F.-F. (2018) Referring Relationship. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6867-6876.
[Google Scholar] [CrossRef
[42] Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L. and van den Hengel, A. (2019) Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1960-1968.
[Google Scholar] [CrossRef
[43] Mi, L. and Chen, Z. (2020) Hierarchical Graph Attention Network for Visual Relationship Detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 13883-13892.
[Google Scholar] [CrossRef
[44] Zhu, Y., Jiang, S. and Li, X. (2017) Visual Relationship Detection with Object Spatial Distribution. 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10-14 July 2017, 379-384.
[Google Scholar] [CrossRef
[45] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013) Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv.1312.5602.
[46] Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.A., Bernstein, M.S., et al. (2015) Image Retrieval Using Scene Graphs. Proc. of the IEEE conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 3668-3678.
[Google Scholar] [CrossRef
[47] Fisher, M., Savva, M. and Hanrahan, P. (2011) Characterizing Structural Relationships in Scenes Using Graph Kernels. ACM Transactions on Graphics, 30, Article No. 34.
[Google Scholar] [CrossRef
[48] Chang, A.X., Savva, M. and Manning, C.D. (2014) Learning Spatial Knowledge for Text to 3D Scene Generation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 2028-2038.
[Google Scholar] [CrossRef
[49] Kim, U.-H., Park, J.-M., Song, T,-J. and Kim, J.-H. (2020) 3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents. IEEE Transactions on Cybernetics, 50, 4921-4933.
[Google Scholar] [CrossRef
[50] Gkioxari, G., Girshick, R., Dollár, P. and He, K. (2018) Detecting and Recognizing Human-Object Interactions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18-23 June 2018, 8359-8367.
[Google Scholar] [CrossRef
[51] Su, Z., Zhu. C., Dong, Y., Cai, D., Chen, Y. and Li, J. (2018) Learning Visual Knowledge Memory Networks for Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18-23 June 2018, 7736-7745.
[Google Scholar] [CrossRef
[52] Cadene, R., Ben-Younnes, H., Cord, M. and Thome, N. (2019) MUREL: Multimodal Relational Reasoning for Visual Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1989-1998.
[Google Scholar] [CrossRef
[53] Peng, L., Yang, Y., Wang, Z., Huang Z. and Shen, H.T. (2022) MRA-Net: Improving VQA via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 318-329.
[Google Scholar] [CrossRef
[54] Hudson, D.A. and Manning, C.D. (2019) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 6700-6709.
[Google Scholar] [CrossRef
[55] Gupta, R., Hooda, P., Sanjeev and Kumar Chikkara, N. (2020) Natural Language Processing Based Visual Question Answering Efficient: an Efficient Det Approach. 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, 13-15 May 2020, 900-904.
[Google Scholar] [CrossRef
[56] Andreas, J., Rohrbach, M., Darrell, T. and Klein, D. (2016) Neural Module Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 39-48.
[Google Scholar] [CrossRef