面向医学视觉问答定位任务的视觉定位与文本–视觉交互注意力机制
Visual Localization and Text-Visual Interaction Attention for Medical Visual Question Localized-Answering Tasks
DOI: 10.12677/airr.2025.144085, PDF,   
作者: 郑少义:温州大学计算机与人工智能学院,浙江 温州
关键词: 视觉问答定位(VQLA)注意力机制Visual Question Localized-Answering Attention Mechanism
摘要: 医学领域的视觉问答(VQA)任务旨在针对医学图像中的临床问题生成准确答案。尽管现有医学VQA系统已取得显著进展,但在临床外科手术中,准确识别手术区域位置仍至关重要。因此,将视觉问答定位任务(Visual Question Localized-Answering, VQLA)引入临床手术场景,有助于更有效地辅助医生完成对精确定位要求较高的操作。然而,现有VQLA方法多依赖简单注意力机制进行模态融合,缺乏对同一模态内及跨模态特征的深度交互,导致答案区域定位不准确及问题理解不足。为解决上述问题,本文提出一种融合视觉定位与文本–视觉交互的注意力机制(VLTVI Attention),从通道维度与空间维度对视觉模态特征进行更全面建模,从而实现对答案区域的精准定位。同时,引入分层结构的文本–视觉交互注意力,以加深模型对问题语义的理解,并增强其推理能力。我们在基于MICCAI EndoVis-2017与EndoVis-2018 手术视频构建的两个公共VQLA数据集上开展了大量实证研究,验证了所提方法在医学VQLA任务中的性能优越性,并取得新的最先进性能(state-of-the-art)。此外,本文还提供了详尽的消融实验与可视化分析,以验证关键注意力模块的有效性。
Abstract: The VQA in the medical domain aims to predict answers to clinical questions related to medical images. While existing medical VQA systems have made rapid progress, accurate identification of the surgical site’s location is crucial in clinical surgical procedures. Therefore, introducing Visual Question Localized-Answering (VQLA) in clinical surgery can better assist healthcare professionals in addressing issues involving precise location operations. However, existing VQLA methods only use simple attention mechanisms to fuse different modality features, lacking sufficient interaction between individual or different modality features, resulting in inadequate localization of answer regions and understanding of questions. To address this issue, we designed a Visual Localization and Text-Visual Interaction (VLTVI) Attention aimed at more comprehensive modeling of visual modality features from channel and spatial dimensions to accurately locate answer regions. Additionally, hierarchical text-visual interaction attention is designed to deepen the model’s understanding of questions and strengthen reasoning of answers. To validate our VLTVI, extensive experiments were conducted on two public VQLA datasets based on surgical videos from MICCAI EndoVis-17 and 18, achieving a new state-of-the-art performance. Furthermore, comprehensive ablation studies and visualizations are provided to validate the essential attention modules of our method.
文章引用:郑少义. 面向医学视觉问答定位任务的视觉定位与文本–视觉交互注意力机制[J]. 人工智能与机器人研究, 2025, 14(4): 893-905. https://doi.org/10.12677/airr.2025.144085

参考文献

[1] Bai, L., Islam, M. and Ren, H. (2023) Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. In: Lecture Notes in Computer Science, Springer, 397-407. [Google Scholar] [CrossRef
[2] Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J. and Chang, K.W. (2019) VisualBERT: A Simple and Performant Baseline for Vision and Language.
[3] Seenivasan, L., Islam, M., Krishna, A.K. and Ren, H.L. (2022) Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer. In: Lecture Notes in Computer Science, Springer, 33-43. [Google Scholar] [CrossRef
[4] Yu, Z., Yu, J., Cui, Y.H., Tao, D.C. and Tian, Q. (2019) Deep Modular Co-Attention Networks for Visual Question Answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 6281-6290.
[5] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Jegou, H. (2021) Training Data-Efficient Image Transformers Amp; Distillation through Attention. Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Location, 18-24 July 2021, 10347-10357.
[6] Ben-Younes, H., Cadene, R., Cord, M. and Thome, N. (2017) MUTAN: Multimodal Tucker Fusion for Visual Question Answering. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 2631-2639. [Google Scholar] [CrossRef
[7] Yu, Z., Yu, J., Xiang, C.C., Fan, J.P. and Tao, D.C. (2018) Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 29, 5947-5959. [Google Scholar] [CrossRef] [PubMed]
[8] Ben-Younes, H., Cadene, R., Thome, N. and Cord, M. (2019) BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8102-8109. [Google Scholar] [CrossRef
[9] Bai, L., Islam, M., Seenivasan, L. and Ren, H.L. (2023) Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. 2023 IEEE International Conference on Robotics and Automation (ICRA), London, 29 May-2 June 2023, 6859-6865. [Google Scholar] [CrossRef