基于时空Transformer的端到端的视频注视目标检测
End-to-End Video Gaze Target Detection with Spatial-Temporal Transformers
摘要: 注视目标检测旨在定位人的注视目标。HGTTR的提出,将Transformer结构用于注视目标检测的任务中,解决了卷积神经网络需要额外的头部探测器的问题,实现了端到端的对头部位置和注视目标的同时检测,并且实现了优于传统的卷积神经网络的性能。然而,目前的方法在视频数据集上的性能还有较大提升空间。原因在于,当前的方法侧重于在单个视频帧中学习人的注视目标,没有对视频中的时间变化进行建模,所以无法解决动态注视、镜头失焦、运动模糊等问题。当一个人的注视目标在不断的发生变化时,缺乏时间变化建模可能会导致定位注视目标偏离人的真实注视目标。并且由于缺乏对于时间维度上的建模,模型无法解决因为镜头失焦和运动模糊等问题所导致的特征缺失。在这项工作当中,我们提出了一种基于时空Transformer的端到端的视频注视目标检测模型。首先,我们提出帧间局部可变形注意力机制,用于处理特征缺失的问题。其次,我们在可变形注意力机制的基础上,提出帧间可变形注意力机制,利用相邻视频帧的时序差异,动态选择采样点,从而实现对于动态注视的建模。最后,我们提出了时序Transformer来聚合由当前帧和参考帧的注视关系查询向量和注视关系特征。我们的时序Transformer包含三个部分:用于编码多帧空间信息的时序注视关系特征编码器,用于融合注视关系查询的时序注视关系查询编码器以及用于获取当前帧检测结果的时序注视关系解码器。通过对于单个帧空间、相邻帧间以及帧序列三个维度的时空建模,很好的解决了视频数据中常见的动态注视、镜头失焦、运动模糊等问题。大量实验证明,我们的方法在VideoAttentionTarget和VideoCoAtt两个数据集上均取得了较为优异的性能。
Abstract: Gaze target detection is designed to locate the human gaze target. Proposed by HGTTR, Transformer structure is used in the task of gaze target detection, which solves the problem that convolutional neural networks need additional head detectors, realizes the end-to-end simultaneous detection of head position and gaze target, and achieves better performance than traditional convolutional neural networks. However, there is still much room for improvement in the performance of current methods on video data sets. The reason is that the current method focuses on learning the human gaze target in a single video frame, and does not model the time change in the video, so it cannot solve the problems of dynamic gaze, out-of-focus lens, and motion blur. When a person’s gaze target is constantly changing, the lack of time change modeling may cause the fixed gaze target to deviate from the person’s real gaze target. In addition, due to the lack of modeling in the time dimension, the model cannot solve the feature loss caused by out-of-focus lens and motion blur. In this work, we propose an end-to-end video gaze target detection model based on spatial-temporal Transformers. First, we propose an interframe local deformable attention mechanism to deal with feature missing problems. Secondly, on the basis of the deformable attention mechanism, we propose the Inter-frames deformable attention mechanism, which uses the timing difference of adjacent video frames to dynamically select sampling points, so as to realize the modeling of dynamic gaze. Finally, we propose a temporal Transformers to aggregate gaze relation query vectors and gaze relation features from the current frame and reference frame. Our temporal Transformers consists of three parts: A temporal gaze feature encoder for encoding multi-frame spatial information, a temporal gaze query encoder for fusing gaze queries, and a temporal gaze decoder for obtaining current frame detection results. Through the spatial-temporal modeling of single frame space, adjacent frames and frame sequence, the common problems of dynamic gaze, lens out of focus and motion blur in video data are solved well. A large number of experiments show that our method achieves excellent performances on both VideoAttentionTarget and VideoCoAtt datasets.
文章引用:彭梦昊, 王冠, 徐浩, 景圣恩. 基于时空Transformer的端到端的视频注视目标检测[J]. 图像与信号处理, 2024, 13(2): 190-209. https://doi.org/10.12677/jisp.2024.132017

参考文献

[1] Judd, T., Ehinger, K., Durand, F., et al. (2009) Learning to Predict Where Humans Look. 2009 IEEE 12th International Conference on Computer Vision, Kyoto, 29 September-02 October 2009, 2106-2113. [Google Scholar] [CrossRef
[2] Recasens, A., Khosla, A., Vondrick, C., et al. (2015) Where Are They Looking? Advances in Neural Information Processing Systems, 28, 199-207.
[3] Chong, E., Ruiz, N., Wang, Y., et al. (2018) Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer VisionECCV 2018, Lecture Notes in Computer Science, Vol. 11209, Springer, Cham, 383-398. [Google Scholar] [CrossRef
[4] Bao, J., Liu, B. and Yu, J. (2022) Escnet: Gaze Target Detection with the Understanding of 3d Scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 18-24 June 2022, 14126-14135. [Google Scholar] [CrossRef
[5] Chong, E., Wang, Y., Ruiz, N., et al. (2020) Detecting Attended Visual Targets in Video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 5396-5406. [Google Scholar] [CrossRef
[6] Lian, D., Yu, Z. and Gao, S. (2018) Believe It or Not, We Know What You Are Looking at! In: Jawahar, C., Li, H., Mori, G. and Schindler, K., Eds., Computer VisionACCV 2018, Lecture Notes in Computer Science, Vol. 11363, Springer, Cham, 35-50. [Google Scholar] [CrossRef
[7] Recasens, A., Vondrick, C., Khosla, A., et al. (2017) Following Gaze in Video. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 1435-1443. [Google Scholar] [CrossRef
[8] Fan, L., Chen, Y., Wei, P., et al. (2018) Inferring Shared Attention in Social Scene Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6460-6468. [Google Scholar] [CrossRef
[9] Zhou, Q., Li, X., He, L., et al. (2022) TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7853-7869. [Google Scholar] [CrossRef
[10] Dai, J., Qi, H., Xiong, Y., et al. (2017) Deformable Convolutional Networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, 22-29 October 2017, 764-773. [Google Scholar] [CrossRef
[11] Miao, Q., Hoai, M. and Samaras, D. (2023) Patch-Level Gaze Distribution Prediction for Gaze Following. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, 2-7 January 2023, 880-889. [Google Scholar] [CrossRef
[12] Fang, Y., Tang, J., Shen, W., et al. (2021) Dual Attention Guided Gaze Target Detection in the Wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 11390-11399. [Google Scholar] [CrossRef
[13] Jin, T., Yu, Q., Zhu, S., et al. (2022) Depth-Aware Gaze-Following via Auxiliary Networks for Robotics. Engineering Applications of Artificial Intelligence, 113, Article 104924. [Google Scholar] [CrossRef
[14] Tu, D., Min, X., Duan, H., et al. (2022) End-to-End Human-Gaze-Target Detection with Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 2192-2200. [Google Scholar] [CrossRef
[15] Tonini, F., Dall’Asen, N., Beyan, C., et al. (2023) Object-Aware Gaze Target Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, 1-6 October 2023, 21860-21869. [Google Scholar] [CrossRef
[16] Tonini, F., Beyan, C. and Ricci, E. (2022) Multimodal across Domains Gaze Target Detection. Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, 7-11 November 2022, 420-431. [Google Scholar] [CrossRef
[17] Long, F., Qiu, Z., Pan, Y., et al. (2022) Stand-Alone Inter-Frame Attention in Video Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3192-3201. [Google Scholar] [CrossRef
[18] Zhu, X., Su, W., Lu, L., et al. (2020) Deformable DETR: Deformable Transformers for End-to-End Object Detection.
[19] Saran, A., Majumdar, S., Short, E.S., et al. (2018) Human Gaze Following for Human-Robot Interaction. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 1-5 October 2018, 8615-8621. [Google Scholar] [CrossRef
[20] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
[21] 田永林, 王雨桐, 王建功, 等. 视觉 Transformer 研究的关键问题: 现状及展望[J]. 自动化学报, 2022, 48(4): 957-979.
[22] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[23] Carion, N., Massa, F., Synnaeve, G., et al. (2020) End-to-End Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer VisionECCV 2020, Lecture Notes in Computer Science, Vol. 12346, Springer, Cham, 213-229. [Google Scholar] [CrossRef
[24] Cheng, Y. and Lu, F. (2022) Gaze Estimation Using Transformer. 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, 21-25 August 2022, 3341-3347. [Google Scholar] [CrossRef
[25] He, K., Zhang, X., Ren, S., et al. (2016) Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 770-778. [Google Scholar] [CrossRef
[26] Glorot, X. and Bengio, Y. (2010) Understanding the Difficulty of Training Deep Feedforward Neural Networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9, 249-256.
[27] Pan, J., Sayrol, E., Giro-i-Nieto, X., et al. (2016) Shallow and Deep Convolutional Networks for Saliency Prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 598-606. [Google Scholar] [CrossRef