面向医疗视觉问答的动态问题编码与知识对比推理模型
Dynamic Question Encoding and Knowledge Contrastive Reasoning Model for Medical Visual Question Answering
摘要: 医疗视觉问答(Med-VQA)旨在基于医学图像与自然语言问题预测可信且准确答案。现有方法主要依赖医学图像特征解析,缺乏对问题语义的深入建模,未能充分考虑开放式与封闭式问题在语义理解上的差异需求。此外,面对多义性强、语境依赖性高的医疗问题,文本查询往往缺乏足够的描述性内容,仅依赖图像和文本特征的融合会导致跨模态对齐不足。针对上述问题,本文提出动态问题编码与知识对比推理(DKCR)模型。该模型利用动态问题编码模块根据问题类型动态建模,既增强开放式问题的语义表征,又避免对封闭式问题引入冗余特征。为减轻跨模态语义偏差,DKCR的知识对比学习通过知识–图像与知识–问题两条路径联合约束潜在表征空间,促进跨模态特征的一致性与互补性。同时,引入外部知识以弥补查询语义不足,丰富问题表征,并通过知识驱动的引导注意力机制,实现视觉、文本与知识模态的深度交互与细粒度对齐。在VQA-RAD与SLAKE数据集上的实验结果表明,DKCR在整体性能上优于现有方法,消融研究进一步验证了各模块的独立价值与协同增益。
Abstract: Medical Visual Question Answering (Med-VQA) is a task that aims to predict reliable and accurate answers based on medical images and natural language questions. Existing methods primarily rely on the analysis of medical image features, lacking in-depth modeling of question semantics and failing to fully consider the distinct semantic understanding requirements of open-ended and closed-ended questions. Furthermore, medical questions often exhibit strong ambiguity and high context dependence, where textual queries frequently lack sufficient descriptive content. Solely relying on the fusion of image and text features leads to insufficient cross-modal alignment. To address these issues, this paper proposes a Dynamic Question Encoding and Knowledge Contrastive Reasoning Model (DKCR) model. This model employs a dynamic question encoding module to adaptively model questions based on their types, enhancing semantic representation for open-ended questions while avoiding the introduction of redundant features for closed-ended ones. To mitigate cross-modal semantic bias, DKCR’s knowledge contrastive learning constrains the latent representation space through two pathways—knowledge-image and knowledge-question—promoting consistency and complementarity of cross-modal features. Concurrently, external knowledge is incorporated to compensate for insufficient query semantics and enrich question representation. A knowledge-driven co-attention mechanism facilitates deep interaction and fine-grained alignment among visual, textual, and knowledge modalities. Experimental results on the VQA-RAD and SLAKE datasets demonstrate that DKCR outperforms existing methods in overall performance, and ablation studies further validate the individual value and synergistic benefits of each module.
文章引用:张澳斌. 面向医疗视觉问答的动态问题编码与知识对比推理模型[J]. 计算机科学与应用, 2026, 16(4): 378-395. https://doi.org/10.12677/csa.2026.164138

参考文献

[1] Lin, Z., Zhang, D., Tao, Q., Shi, D., Haffari, G., Wu, Q., et al. (2023) Medical Visual Question Answering: A Survey. Artificial Intelligence in Medicine, 143, Article ID: 102611. [Google Scholar] [CrossRef] [PubMed]
[2] Pan, H., He, S., Zhang, K., Qu, B., Chen, C. and Shi, K. (2022) AMAM: An Attention-Based Multimodal Alignment Model for Medical Visual Question Answering. Knowledge-Based Systems, 255, Article ID: 109763. [Google Scholar] [CrossRef
[3] Long, S., Yang, Z., Li, Y., Qian, X., Zeng, K. and Hao, T. (2023) MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering. In: Zhang, H., et al., Eds., International Conference on Neural Computing for Advanced Applications, Springer, 202-214. [Google Scholar] [CrossRef
[4] Zhan, L., Liu, B., Fan, L., Chen, J. and Wu, X. (2020) Medical Visual Question Answering via Conditional Reasoning. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 2345-2354. [Google Scholar] [CrossRef
[5] Liu, G., He, J., Li, P., Zhao, Z. and Zhong, S. (2024) Cross-Modal Self-Supervised Vision Language Pre-Training with Multiple Objectives for Medical Visual Question Answering. Journal of Biomedical Informatics, 160, Article ID: 104748. [Google Scholar] [CrossRef] [PubMed]
[6] Liu, B., Zhan, L., Xu, L. and Wu, X. (2023) Medical Visual Question Answering via Conditional Reasoning and Contrastive Learning. IEEE Transactions on Medical Imaging, 42, 1532-1545. [Google Scholar] [CrossRef] [PubMed]
[7] Li, P., Liu, G., Tan, L., Liao, J. and Zhong, S. (2023) Self-Supervised Vision-Language Pretraining for Medial Visual Question Answering. 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, 18-21 April 2023, 1-5. [Google Scholar] [CrossRef
[8] Bhatti, U.A., Tang, H., Wu, G., Marjan, S. and Hussain, A. (2023) Deep Learning with Graph Convolutional Networks: An Overview and Latest Applications in Computational Intelligence. International Journal of Intelligent Systems, 2023, Article ID: 8342104. [Google Scholar] [CrossRef
[9] Ren, F. and Zhou, Y. (2020) CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering. IEEE Access, 8, 50626-50636. [Google Scholar] [CrossRef
[10] Wang, H. and Du, H. (2023) Knowledge-Enhanced Medical Visual Question Answering: A Survey (Invited Talk Summary). In: Yang, S. and Islam, S., Eds., Web and Big Data. APWeb-WAIM 2022 International Workshops, Springer, 3-9. [Google Scholar] [CrossRef
[11] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D. and Batra, D. (2017) Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 618-626. [Google Scholar] [CrossRef
[12] Huang, X. and Gong, H. (2024) A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering. IEEE Transactions on Medical Imaging, 43, 832-845. [Google Scholar] [CrossRef] [PubMed]
[13] Chen, Z., Li, G. and Wan, X. (2022) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-Training with Knowledge. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 5152-5161. [Google Scholar] [CrossRef
[14] Gu, T., Yang, K., Liu, D. and Cai, W. (2024) LaPA: Latent Prompt Assist Model for Medical Visual Question Answering. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 17-18 June 2024, 4971-4980. [Google Scholar] [CrossRef
[15] Liu, B., Zhan, L. and Wu, X. (2021) Contrastive Pre-Training and Representation Distillation for Medical Visual Question Answering Based on Radiology Images. In: de Bruijne, M., et al., Eds., Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Springer, 210-220. [Google Scholar] [CrossRef
[16] Huang, J., Chen, Y., Li, Y., Yang, Z., Gong, X., Wang, F.L., et al. (2023) Medical Knowledge-Based Network for Patient-Oriented Visual Question Answering. Information Processing & Management, 60, Article ID: 103241. [Google Scholar] [CrossRef
[17] Chen, Z., Du, Y., Hu, J., Liu, Y., Li, G., Wan, X., et al. (2024) Mapping Medical Image-Text to a Joint Space via Masked Modeling. Medical Image Analysis, 91, Article ID: 103018. [Google Scholar] [CrossRef] [PubMed]
[18] Eslami, S., Meinel, C. and de Melo, G. (2023) PubMedClip: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, May 2023, 1181-1193. [Google Scholar] [CrossRef
[19] Lameesa, A., Silpasuwanchai, C. and Alam, M.S.B. (2025) VG-CALF: A Vision-Guided Cross-Attention and Late-Fusion Network for Radiology Images in Medical Visual Question Answering. Neurocomputing, 613, Article ID: 128730. [Google Scholar] [CrossRef
[20] Shu, C., Zhu, Y., Tang, X., Xiao, J., Chen, Y., Li, X., et al. (2024) MITER: Medical Image-Text Joint Adaptive Pretraining with Multi-Level Contrastive Learning. Expert Systems with Applications, 238, Article ID: 121526. [Google Scholar] [CrossRef
[21] Zhan, C., Peng, P., Wang, H., Wang, G., Lin, Y., Chen, T., et al. (2025) UnICLAM: Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering. Medical Image Analysis, 101, Article ID: 103464. [Google Scholar] [CrossRef] [PubMed]
[22] Wu, Y., Lu, Y., Zhou, Y., Ding, Y., Liu, J. and Ruan, T. (2025) MKGF: A Multi-Modal Knowledge Graph Based RAG Framework to Enhance LVLMs for Medical Visual Question Answering. Neurocomputing, 635, Article ID: 129999. [Google Scholar] [CrossRef