基于预测一致性嵌入的注视目标检测

doi:10.12677/JISP.2023.122015

期刊菜单

基于预测一致性嵌入的注视目标检测
Gaze Target Detection Based on Predictive Consistency Embedding

DOI: 10.12677/JISP.2023.122015, PDF, 科研立项经费支持
作者: 史俊彪, 骆文杰, 熊思璇, 单东风, 江朝晖, 韩超：合肥工业大学计算机科学与信息工程学院，安徽合肥
关键词: 注视目标检测；注视跟随；域自适应；RGB图像；深度图像；Gaze Target Detection； Gaze Follow； Domain Adaptation； RGB Image； Depth Image

摘要: 本文研究了第三人称视角下图像的注视目标检测问题我们提出了一个深度架构推断场景中的人在看哪里。该模型在蕴含丰富上下文信息的场景图像、深度图像和头部图像上进行训练。与现有的技术不同，我们的模型不需要监视注视角度，不依赖于头部方向信息和眼睛信息。大量的实验表明，我们的方法在多个基准数据集上具有更强的性能。我们还研究了注视目标检测的域自适应方法，使用一致性嵌入确保源域和目标域对齐，使得我们的模型能够有效地处理数据集之间的间隙。

Abstract: In this paper, we study the problem of gaze target detection in images from the third person perspective. We propose a deep architecture to infer where people are looking in the scene. The model is trained on scene image, depth image and head image containing rich contextual information. Unlike existing technologies, our model does not need to monitor gaze angles and does not rely on head direction information and eye information. A large number of experiments show that our method has stronger performance on multiple benchmark data sets. We also study a domain adaptive approach to gaze target detection, using consistency embedding to ensure the alignment of source and target domains, so that our model can effectively deal with gaps between datasets.

文章引用：史俊彪, 骆文杰, 熊思璇, 单东风, 江朝晖, 韩超. 基于预测一致性嵌入的注视目标检测[J]. 图像与信号处理, 2023, 12(2): 144-157. https://doi.org/10.12677/JISP.2023.122015

参考文献

[1]	Fang, Y., Tang, J.P., Shen, W., Shen, W., Gu, X., Song, L. and Zhai, G.T. (2021) Dual Attention Guided Gaze Target Detection in the Wild. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 20-25 June 2021, 11390-11399. [Google Scholar] [CrossRef]
[2]	Niewiadomski, R., Chauvigne, L., Mancini, M. and Camurri, A. (2018) Towards a Model of Nonverbal Leadership in Unstructured Joint Physical Activity. Proceedings of the 5th International Conference on Movement and Computing (MOCO’18), Genoa, 28-30 June 2018, 1-8. [Google Scholar] [CrossRef]
[3]	Thakur, S.K., Beyan, C., Morerio, P. and Del Bue, A. (2021) Predicting Gaze from Egocentric Social Interaction Videos and IMU Data. Proceedings of 2021 International Conference on Multimodal Interaction (ICMI’ 21), Montreal, 18-22 October 2021, 717-722. [Google Scholar] [CrossRef]
[4]	Chong, E.J., Ruiz, N., Wang, Y.X., Zhang, Y., Rozga, A. and Rehg, J.M. (2018) Connecting Gaze, Scene and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency. ECCV 2018: 15th European Conference, Munich, 8-14 September 2018, 397-412. [Google Scholar] [CrossRef]
[5]	Chong, E.J., Wang, Y.X., Ruiz, N. and Rehg, J.M. (2020) Detecting Attended Visual Targets in Video. Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 5396-5406. [Google Scholar] [CrossRef]
[6]	Hu, Z., Yang, D., Cheng, S., Zhou, L., Wu, S. and Liu, J. (2022) We Know Where They Are Looking at From the RGB-D Camera: Gaze Following in 3D. IEEE Transactions on Instrumentation and Measurement, 17, 1-14. [Google Scholar] [CrossRef]
[7]	Zhang, X., Huang, M.X., Sugano, Y. and Bulling, A. (2018) Training Person-Specific Gaze Estimators from User Interactions with Multiple Devices. Proceedings of 2018 CHI Conference on Human Factors in Computing Systems, Montreal, 21-26 April 2018, 1-12. [Google Scholar] [CrossRef]
[8]	Liu, M., Li, Y. and Liu, H. (2020) 3D Gaze Estimation for Head-Mounted Eye Tracking System with Auto-Calibration Method. IEEE Access, 8, 104207-104215. [Google Scholar] [CrossRef]
[9]	Lian, D., Yu, Z. and Gao, S. (2018) Believe It or Not, We Know What You Are Looking At! ACCV 2018: 14th Asian Conference on Computer Vision, Perth, 2-6 December 2018, 35-50. Lian D, Yu Z, Gao S. Believe it or not, we know what you are looking at![C]//Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14. Springer International Publishing, 2019: 35-50. [Google Scholar] [CrossRef]
[10]	Recasens, A., Khosla, A., Vondrick, C. and Torralba, A. (2015) Where Are They Looking? Advances in Neural Information Processing Systems, 28, 199-207.
[11]	Jin, T., Yu, Q., Zhu, S., Lin, Z., Ren, J., Zhou, Y. and Song, W. (2022) Depth-Aware Gaze-Following via Auxiliary Networks for Robotics. Engineering Applications of Artificial Intelligence, 113, Article ID: 104924. [Google Scholar] [CrossRef]
[12]	Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. and Koltun, V. (2020) Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1-14.
[13]	Li, Y., Liu, M. and Rehg, J. (2021) In the Eye of the Beholder: Gaze and Actions in First Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence. [Google Scholar] [CrossRef]
[14]	Min, K. and Corso, J.J. (2021) Integrating Human Gaze into Attention for Egocentric Activity Recognition. Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2021, 1069-1078. [Google Scholar] [CrossRef]
[15]	Dohan, M. and Mu, M. (2019) Understanding User Attention In VR Using Gaze Controlled Games. Proceedings of 2019 ACM International Conference on Interactive Experiences for TV and Online Video (TVX’ 19), Salford, 5-7 June 2019, 167-173. [Google Scholar] [CrossRef]
[16]	Wei, P., Liu, Y., Shu, T., Zheng, N. and Zhu, S.-C. (2018) Where and Why are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Lake City, 18-23 June 2018, 6801-6809. [Google Scholar] [CrossRef]
[17]	Marin-Jimenez, M.J., Kalogeiton, V., Medina-Suarez, P. and Zisserman, A. (2019) LAEO-Net: Revisiting People Looking at Each Other in Videos. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 3472-3480. [Google Scholar] [CrossRef]
[18]	Yang, X., Xu, F., Wu, K., Xie, Z. and Sun, Y. (2021) Gaze-Aware Graph Convolutional Network for Social Relation Recognition. IEEE Access, 9, 99398-99408. [Google Scholar] [CrossRef]
[19]	Zhuang, N., Ni, B., Xu, Y., Yang, X., Zhang, W., Li, Z. and Gao, W. (2019) Muggle: Multi-Stream Group Gaze Learning and Estimation. IEEE Transactions on Circuits and Systems for Video Technology, 30, 3637-3650. [Google Scholar] [CrossRef]
[20]	Recasens, A., Vondrick, C., Khosla, A. and Torralba, A. (2017) Following Gaze in Video. Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1444-1452. [Google Scholar] [CrossRef]
[21]	Brau, E., Guan, J., Jeffries, T. and Barnard, K. (2018) Multiple-Gaze Geometry: Inferring Novel 3D Locations from Gazes Observed in Monocular Video. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., ECCV 2018: Computer Vision—ECCV 2018, Lecture Notes in Computer Science, Vol. 11208, Springer, Cham, 612-630. [Google Scholar] [CrossRef]
[22]	Massé, B., Ba, S. and Horaud, R. (2017) Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 2711-2724. [Google Scholar] [CrossRef]
[23]	Long, M., Cao, Y., Wang, J. and Jordan, M. (2015) Learning Transferable Features with Deep Adaptation Networks. Proceedings of the 32nd International Conference on Machine Learning, Lille, 6-11 July 2015, 97-105.
[24]	Xu, R., Li, G., Yang, J. and Lin, L. (2019) Larger Norm More Transferable: An Adaptive Feature Norm Approach for Unsupervised Domain Adaptation. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop, Seoul, 27-28 October 2019, 1426-1435.
[25]	Zen, G., Sangineto, E., Ricci, E. and Sebe, N. (2014) Unsupervised Domain Adaptation for Personalized Facial Emotion Recognition. Proceedings of the 16th International Conference on Multimodal Interaction (ICMI’14), Istanbul, 12-16 November 2014, 128-135. [Google Scholar] [CrossRef]
[26]	Cui, S., Wang, S., Zhuo, J., Su, C., Huang, Q. and Tian, Q. (2020) Gradually Vanishing Bridge for Adversarial Domain Adaptation. Proceedings of the 16th IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 12455-12464.
[27]	Zhu, J.-Y., Park, T., Isola, P. and Efros, A.A. (2017) Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 2242-2251.
[28]	da Costa, T.V.G., Zara, G., Rota, P., Oliveira-Santos, T., Sebe, N., Murino, V. and Ricci, E. (2022) Dual-Head Contrastive Domain Adaptation for Video Action Recognition. Proceedings of 2020 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2022, 1181-1190. [Google Scholar] [CrossRef]
[29]	Wang, Q., Dai, D., Hoyer, L., Van Gool, L. and Fink, O. (2021) Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation. Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 8515-8525. [Google Scholar] [CrossRef]
[30]	Xu, J., Xiao, L. and López, A.M. (2019) Self-Supervised Domain Adaptation for Computer Vision Tasks. IEEE Access, 7, 156694-156706. [Google Scholar] [CrossRef]
[31]	Kellnhofer, P., Recasens, A., Stent, S., Matusik, W. and Torralba, A. (2019) Gaze360: Physically Unconstrained Gaze Estimation in the Wild. Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 27 October-2 November 2019, 6912-6921. [Google Scholar] [CrossRef]
[32]	Tzeng, E., Hoffman, J., Saenko, K. and Darrell, T. (2017) Adversarial Discriminative Domain Adaptation. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 2962-2971. [Google Scholar] [CrossRef]
[33]	Yu, Y., Liu, G. and Odobez, J.-M. (2019) Improving Few-Shot User-Specific Gaze Adaptation via Gaze Redirection Synthesis. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Beach, 15-20 June 2019, 11937-11946. [Google Scholar] [CrossRef]
[34]	Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M. and Lempitsky, V. (2016) Domain-Adversarial Training of Neural Networks. The Journal of Machine Learning Research, 17, 2096-2030.
[35]	Tomas, H., Reyes, M., Dionido, R., Ty, M., Mirando, J., Casimiro, J., Atienza, R. and Guinto, R. (2021) Goo: A Dataset for Gaze Object Prediction in Retail Environments. Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 19-25 June 2021, 3125-3133. [Google Scholar] [CrossRef]
[36]	Zhou, B., Lapedriza, A., Xiao, J., Torralba, A. and Oliva, A. (2014) Learning Deep Features for Scene Recognition Using Places Database. Advances in Neural Information Processing Systems, 27, 487-495.
[37]	Mora, K.A.F., Monay, F. and Odobez, J.-M. (2014) Eyediap: A Database for the Development and Evaluation of Gaze Estimation Algorithms from RGB and RGB-D Cameras. Proceedings of 2014 Symposium on Eye Tracking Research and Applications (ETRA’ 14), Safety Harbor Florida, 26-28 March 2014, 255-258. [Google Scholar] [CrossRef]
[38]	Judd, T., Ehinger, K., Durand, F. and Torralba, A. (2009) Learning to Predict Where Humans Look. Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto, 29 September-02 October 2009, 2106-2113. [Google Scholar] [CrossRef]

友情链接