基于Transformer的人物交互检测综述
Transformer-Based Human Interaction Detection: A Survey
DOI: 10.12677/csa.2024.148175, PDF,    科研立项经费支持
作者: 管尹凡, 努尔古丽·艾子木把, 王慧玲*:伊犁师范大学网络安全与信息技术学院,新疆 伊宁
关键词: 人物交互检测Transformer深度学习目标检测Human Interaction Detection Transformer Deep Learning Object Detection
摘要: 人物交互(HOI)检测旨在定位图像中的人和物体,并对它们之间的交互进行分类。实用的HOI检测系统执行以人为中心的场景理解,因此对许多应用具有巨大的潜在影响,如监视事件检测和机器人模仿学习。随着最近Transformer网络在目标检测方面的成功,基于Transformer的HOI检测方法已被积极开发,引领了近期HOI关系检测研究的进步。基于Transformer的HOI检测方法利用Transformer的自注意力机制来提取上下文语义信息和嵌入来表示HOI实例,成为HOI检测任务的新趋势。本文综述了现有方法的最新研究进展,并将其分为四类:早期端到端模型、利用DETR变体和改进骨干网络的模型、语言–图像预训练的模型以及基于DETR的两阶段模型。系统地阐述目前基于Transformer的HOI检测方法的发展现状,分析各种流派的优缺点,梳理该领域方法的发展脉络,最后对未来的研究方向进行展望。
Abstract: Human-Object Interaction (HOI) detection aims to localize humans and objects in an image and classify their interactions. Practical HOI detection systems enable human-centric scene understanding, thus holding significant potential impact on various applications such as surveillance event detection and robot imitation learning. With the recent success of Transformer networks in object detection, Transformer-based HOI detection methods have been actively developed, leading to advancements in recent research on HOI relation detection. Transformer-based HOI detection methods leverage the self-attention mechanism of Transformers to extract contextual semantic information and embeddings to represent HOI instances, becoming a new trend in HOI detection tasks. This paper reviews the latest research progress of existing methods, categorizing them into four types: early end-to-end models, models using variants of DETR and improved backbone networks, language-image pre-trained models, and two-stage models based on DETR. It systematically elaborates on the current development status of Transformer-based HOI detection methods, analyzes the advantages and disadvantages of various approaches, outlines the development trajectory of methods in this field, and finally provides prospects for future research directions.
文章引用:管尹凡, 努尔古丽·艾子木把, 王慧玲. 基于Transformer的人物交互检测综述[J]. 计算机科学与应用, 2024, 14(8): 179-193. https://doi.org/10.12677/csa.2024.148175

参考文献

[1] Bemelmans, R., Gelderblom, G.J., Jonker, P. and de Witte, L. (2012) Socially Assistive Robots in Elderly Care: A Systematic Review into Effects and Effectiveness. Journal of the American Medical Directors Association, 13, 114-120.E1. [Google Scholar] [CrossRef] [PubMed]
[2] Bolme, D., Beveridge, J.R., Draper, B.A. and Lui, Y.M. (2010) Visual Object Tracking Using Adaptive Correlation Filters. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 2544-2550. [Google Scholar] [CrossRef
[3] Dee, H.M. and Velastin, S.A. (2007) How Close Are We to Solving the Problem of Automated Visual Surveillance? Machine Vision and Applications, 19, 329-343. [Google Scholar] [CrossRef
[4] Feichtenhofer, C., Pinz, A. and Wildes, R.P. (2017) Spatiotemporal Multiplier Networks for Video Action Recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 7445-7454. [Google Scholar] [CrossRef
[5] 李宝珍, 张晋, 王宝录, 等. 融合多层次视觉信息的人物交互动作识别[J]. 计算机科学, 2022, 49(S2): 643-650.
[6] 吴伟, 刘泽宇. 基于图的人-物交互识别[J]. 计算机工程与应用, 2021, 57(3): 175-181.
[7] Wang, T., Anwer, R.M., Khan, M.H., Khan, F.S., Pang, Y., Shao, L., et al. (2019) Deep Contextual Attention for Human-Object Interaction Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5693-5701. [Google Scholar] [CrossRef
[8] Wan, B., Zhou, D., Liu, Y., Li, R. and He, X. (2019) Pose-Aware Multi-Level Feature Network for Human Object Interaction Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 9468-9477. [Google Scholar] [CrossRef
[9] Kim, B., Choi, T., Kang, J. and Kim, H.J. (2020) UnionDet: Union-Level Detector towards Real-Time Human-Object Interaction Detection. Computer Vision-ECCV 2020, Glasgow, 23-28 August 2020, 498-514. [Google Scholar] [CrossRef
[10] Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C. and Feng, J. (2020) PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 479-487. [Google Scholar] [CrossRef
[11] Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X. and Sun, J. (2020) Learning Human-Object Interaction Detection Using Interaction Points. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 4115-4124. [Google Scholar] [CrossRef
[12] Zhong, X., Qu, X., Ding, C. and Tao, D. (2021) Glance and Gaze: Inferring Action-Aware Points for One-Stage Human-Object Interaction Detection. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 13229-13238. [Google Scholar] [CrossRef
[13] Newell, A., Yang, K. and Deng, J. (2016) Stacked Hourglass Networks for Human Pose Estimation. Computer Vision-ECCV 2016, Amsterdam, 11-14 October 2016, 483-499. [Google Scholar] [CrossRef
[14] Yu, F., Wang, D., Shelhamer, E. and Darrell, T. (2018) Deep Layer Aggregation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 2403-2412. [Google Scholar] [CrossRef
[15] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[16] Girshick, R. (2015) Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 1440-1448. [Google Scholar] [CrossRef
[17] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788. [Google Scholar] [CrossRef
[18] Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q. and Tian, Q. (2019) CenterNet: Keypoint Triplets for Object Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 6568-6577. [Google Scholar] [CrossRef
[19] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S. (2020) End-to-End Object Detection with Transformers. Computer Vision-ECCV 2020, Glasgow, 23-28 August 2020, 213-229. [Google Scholar] [CrossRef
[20] Tamura, M., Ohashi, H. and Yoshinaga, T. (2021) QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 10405-10414. [Google Scholar] [CrossRef
[21] Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., et al. (2021) End-to-End Human Object Interaction Detection with HOI Transformer. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 11820-11829. [Google Scholar] [CrossRef
[22] Kim, B., Lee, J., Kang, J., Kim, E. and Kim, H.J. (2021) HOTR: End-to-End Human-Object Interaction Detection with Transformers. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 74-83. [Google Scholar] [CrossRef
[23] Chen, M., Liao, Y., Liu, S., Chen, Z., Wang, F. and Qian, C. (2021) Reformulating HOI Detection as Adaptive Set Prediction. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 9000-9009. [Google Scholar] [CrossRef
[24] Zhang, A., Liao, Y., Liu, S., et al. (2021) Mining the Benefits of Two-Stage and One-Stage HOI Detection. Advances in Neural Information Processing Systems, 34, 17209-17220.
[25] Qu, X., Ding, C., Li, X., Zhong, X. and Tao, D. (2022) Distillation Using Oracle Queries for Transformer-Based Human-Object Interaction Detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 19536-19545. [Google Scholar] [CrossRef
[26] Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M. and Zhang, L. (2022) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 13609-13617. [Google Scholar] [CrossRef
[27] Chen, J., Wang, Y. and Yanai, K. (2023) Focusing on What to Decode and What to Train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising. arXiv:2307.02291.
[28] Gao, P., Zheng, M., Wang, X., Dai, J. and Li, H. (2021) Fast Convergence of DETR with Spatially Modulated Co-Attention. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 3601-3610. [Google Scholar] [CrossRef
[29] Zhu, X., Su, W., Lu, L., et al. (2020) Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv:2010.04159. [Google Scholar] [CrossRef
[30] Chen, J. and Yanai, K. (2023) QAHOI: Query-Based Anchors for Human-Object Interaction Detection. 2023 18th International Conference on Machine Vision and Applications (MVA), Hamamatsu, 23-25 July 2023, 1-5. [Google Scholar] [CrossRef
[31] Ma, S., Wang, Y., Wang, S. and Wei, Y. (2024) FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 2415-2429. [Google Scholar] [CrossRef] [PubMed]
[32] Kim, B., Mun, J., On, K., Shin, M., Lee, J. and Kim, E. (2022) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 19556-19565. [Google Scholar] [CrossRef
[33] He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[34] Tan, M. and Le, Q. (2021) EfficientNetV2: Smaller Models and Faster Training. arXiv: 2104.00298. [Google Scholar] [CrossRef
[35] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
[36] Park, J., Park, J. and Lee, J. (2023) ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 17152-17162. [Google Scholar] [CrossRef
[37] Lim, J., Baskaran, V.M., Lim, J.M., Wong, K., See, J. and Tistarelli, M. (2023) ERNet: An Efficient and Reliable Human-Object Interaction Detection Network. IEEE Transactions on Image Processing, 32, 964-979. [Google Scholar] [CrossRef] [PubMed]
[38] Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I. and Carion, N. (2021) MDETR-Modulated Detection for End-to-End Multi-Modal Understanding. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1760-1770. [Google Scholar] [CrossRef
[39] Cai, Z., Kwon, G., Ravichandran, A., Bas, E., Tu, Z., Bhotika, R., et al. (2022) X-DETR: A Versatile Architecture for Instance-Wise Vision-Language Tasks. Computer Vision-ECCV 2022, Tel Aviv, 23-27 October 2022, 290-308. [Google Scholar] [CrossRef
[40] Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., et al. (2022) Grounded Language-Image Pre-Training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10955-10965. [Google Scholar] [CrossRef
[41] Yao, L., Han, J., Wen, Y., et al. (2022) DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-Training for Open-World Detection. Advances in Neural Information Processing Systems, 35, 9125-9138.
[42] Liao, Y., Zhang, A., Lu, M., Wang, Y., Li, X. and Liu, S. (2022) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 20091-20100. [Google Scholar] [CrossRef
[43] Radford, A., Kim, J.W., Hallacy, C., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv:2103.00020. [Google Scholar] [CrossRef
[44] Yuan, H., Jiang, J., Albanie, S., et al. (2022) RLIP: Relational Language-Image Pre-Training for Human-Object Interaction Detection. Advances in Neural Information Processing Systems, 35, 37416-37431.
[45] Yuan, H., Zhang, S., Wang, X., Albanie, S., Pan, Y., Feng, T., et al. (2023) RLIPv2: Fast Scaling of Relational Language-Image Pre-training. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 21592-21604. [Google Scholar] [CrossRef
[46] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., et al. (2020) The Open Images Dataset V4. International Journal of Computer Vision, 128, 1956-1981. [Google Scholar] [CrossRef
[47] Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., et al. (2019) Objects365: A Large-Scale, High-Quality Dataset for Object Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 8429-8438. [Google Scholar] [CrossRef
[48] Li, J., Li, D., Xiong, C., et al. (2022) BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. arXiv:2201.12086, [Google Scholar] [CrossRef
[49] Ning, S., Qiu, L., Liu, Y. and He, X. (2023) HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 23507-23517. [Google Scholar] [CrossRef
[50] Zhang, F.Z., Campbell, D. and Gould, S. (2021) Spatially Conditioned Graphs for Detecting Human-Object Interactions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 13299-13307. [Google Scholar] [CrossRef
[51] Zhang, F.Z., Campbell, D. and Gould, S. (2022) Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 20072-20080. [Google Scholar] [CrossRef
[52] Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T. and Chen, C. (2022) Exploring Structure-Aware Transformer over Interaction Proposals for Human-Object Interaction Detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 19526-19535. [Google Scholar] [CrossRef
[53] Zhang, F.Z., Yuan, Y., Campbell, D., Zhong, Z. and Gould, S. (2023) Exploring Predicate Visual Context in Detecting of Human-Object Interactions. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 10377-10387. [Google Scholar] [CrossRef
[54] Gupta, S. and Malik, J. (2015) Visual Semantic Role Labeling. arXiv: 1505.04474.
[55] Chao, Y., Liu, Y., Liu, X., Zeng, H. and Deng, J. (2018) Learning to Detect Human-Object Interactions. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, 12-15 March 2018, 381-389. [Google Scholar] [CrossRef
[56] Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014) Microsoft COCO: Common Objects in Context. Computer Vision-ECCV 2014, Zurich, 6-12 September 2014, 740-755. [Google Scholar] [CrossRef