基于跨模态注意力的皮肤病图像分割增强方法
Skin Lesion Image Segmentation Enhancement Method Based on Cross-Modal Attention
DOI: 10.12677/mos.2025.144355, PDF,   
作者: 苏凡军*, 孟祥臣:上海理工大学光电信息与计算机工程学院,上海
关键词: 医学图像分割TransformerU-Net跨模态学习Medical Image Segmentation Transformer U-Net Cross-Modal Learning
摘要: 精确的皮肤病灶分割在临床精准诊疗中至关重要。然而,现有方法主要依赖单模态影像数据,难以有效应对皮肤病灶形态的多样性和复杂病例中的语义模糊性问题。为此,本研究提出跨模态注意力引导的皮肤病灶分割网络(CMG-Net),通过跨模态信息融合突破传统方法的性能瓶颈。该网络构建了跨模态数据协同机制,整合临床文本描述(包括病灶颜色、边界特征等语义信息)与视觉特征,实现跨模态信息的深度融合。并设计基于Transformer架构的跨模态特征融合模块(CMFM),该模块通过双流交叉注意力机制实现视觉–语义特征的高效对齐与互补性交互。其中文本分支采用预训练语言模型提取深层语义表征,视觉分支通过动态参数共享策略实现模态特异性特征提取。在公开皮肤影像数据集ISIC2017上的实验结果表明,CMG-Net在复杂病例分割任务中显著优于现有单模态方法,尤其在形态相似病灶的鉴别任务中,IoU与Dice系数分别提升4.2%和4.3%。
Abstract: Accurate segmentation of skin lesions is crucial for clinical precision diagnosis and treatment. However, existing methods primarily rely on single-modal imaging data, which struggle to effectively address the diversity of skin lesion morphology and the semantic ambiguity in complex cases. To overcome these limitations, this study proposes a Cross-Modal Attention-Guided Skin Lesion Segmentation Network (CMG-Net), which breaks through the performance bottleneck of traditional methods by leveraging cross-modal information fusion. The network constructs a cross-modal data collaboration mechanism, integrating clinical textual descriptions (including semantic information such as lesion color and boundary features) with visual features to achieve deep fusion of cross-modal information. Additionally, a Transformer-based Cross-Modal Feature Fusion Module (CMFM) is designed, which utilizes a dual-stream cross-attention mechanism to enable efficient alignment and complementary interaction between visual and semantic features. In this module, the text branch employs a pre-trained language model to extract deep semantic representations, while the visual branch adopts a dynamic parameter-sharing strategy to achieve modality-specific feature extraction. Experimental results on the public skin imaging dataset ISIC2017 demonstrate that CMG-Net significantly outperforms existing single-modal methods in complex lesion segmentation tasks, particularly in distinguishing morphologically similar lesions, with improvements of 4.2% in IoU and 4.3% in Dice coefficient.
文章引用:苏凡军, 孟祥臣. 基于跨模态注意力的皮肤病图像分割增强方法[J]. 建模与仿真, 2025, 14(4): 1072-1084. https://doi.org/10.12677/mos.2025.144355

参考文献

[1] Gershenwald, J.E., Scolyer, R.A., Hess, K.R., Sondak, V.K., Long, G.V., Ross, M.I., et al. (2017) Melanoma Staging: Evidence‐Based Changes in the American Joint Committee on Cancer Eighth Edition Cancer Staging Manual. CA: A Cancer Journal for Clinicians, 67, 472-492. [Google Scholar] [CrossRef] [PubMed]
[2] Tschandl, P., Codella, N., Akay, B.N., Argenziano, G., Braun, R.P., Cabo, H., et al. (2019) Comparison of the Accuracy of Human Readers versus Machine-Learning Algorithms for Pigmented Skin Lesion Classification: An Open, Web-Based, International, Diagnostic Study. The Lancet Oncology, 20, 938-947. [Google Scholar] [CrossRef] [PubMed]
[3] Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., et al. (2017) Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature, 542, 115-118. [Google Scholar] [CrossRef] [PubMed]
[4] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W. and Frangi, A., Eds., Lecture Notes in Computer Science, Springer International Publishing, 234-241. [Google Scholar] [CrossRef
[5] Codella, N., et al. (2019) Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv preprint arXiv:1902.03368.
[6] Chen, J., Lu, Y., Yu, Q., et al. (2021) TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. IEEE Trans-Actions on Pattern Analysis and Machine Intelligence. arXiv:2102.04306.
[7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26-30 April 2020, 1-21.
https://openreview.net/forum?id=YicbFdNTTy
[8] Vats, A., Pedersen, M., Mohammed, A. and Hovde, Ø. (2021) Learning More for Free—A Multi Task Learning Approach for Improved Pathology Classification in Capsule Endoscopy. In: de Bruijne, M., et al., Eds., Lecture Notes in Computer Science, Springer International Publishing, 3-13. [Google Scholar] [CrossRef
[9] Philippi, A., Heller, S., Costa, I.G., Senée, V., Breunig, M., Li, Z., et al. (2021) Mutations and Variants of ONECUT1 in Diabetes. Nature Medicine, 27, 1928-1940. [Google Scholar] [CrossRef] [PubMed]
[10] Xie, Y., et al. (2022) CLIP-Derm: Aligning Vision-Language Models for Dermatology Diagnosis with Clinical Text. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, 18-24 June 2022, 21210-21219.
[11] Li, X., et al. (2023) Dynamic Multimodal Fusion with Learnable Gates for Medical Image Segmentation. IEEE Transactions on Medical Imaging, 42, 1324-1335.
[12] Wang, Y., Lam, H.K., Hou, Z., Li, R., Xie, X. and Liu, S. (2023) High-Resolution Feature Based Central Venous Catheter Tip Detection Network in X-Ray Images. Medical Image Analysis, 88, Article 102876. [Google Scholar] [CrossRef] [PubMed]
[13] Zhang, Y., et al. (2022) Knowledge-Aware Multimodal Fusion Network for Dermatological Diagnosis. AAAI Conference on Artificial Intelligence, 36, 3219-3227.
[14] Xu, Z., et al. (2023) Noisy Clinical Text Filtering for Robust Multimodal Learning in Dermatology. In: Bouamor, H., Pino, J., Bali, K., Eds., Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 15677-15691.
https://aclanthology.org/volumes/2023.emnlp-main/
[15] Huang, S.-C., et al. (2023) The Curse of Heterogeneity: Why Multimodal Medical AI Models Struggle with Alignment. Advances in Neural Information Processing Systems (NEURIPS), New Orleans, LA, 10-16 December 2023.
[16] Liu, F., et al. (2024) Dynamic Modality Selection for Medical Multimodal Learning. Proceedings of International Conference on Learning Representations (ICLR 2024), Vienna, 7-11 May 2024.
[17] Radford, A., Kim, J.W., Hallacy, C., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. In: Proceedings of the 38th International Conference on Machine Learning, PMLR, 8748-8763.
https://proceedings.mlr.press/v139/radford21a.html
[18] Jia, C., Yang, Y.F., Xia, Y., et al. (2021) Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In: Proceedings of International Conference on Machine Learning (ICML), PMLR, 4904-4916.
https://proceedings.mlr.press/v139/jia21b.html
[19] Li, Y., Fan, H., Hu, R., Feichtenhofer, C. and He, K. (2023) Scaling Language-Image Pre-Training via Masking. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 23390-23400. [Google Scholar] [CrossRef
[20] Jiang, Y., et al. (2023) Cross-Modal Co-Training for Medical Image Segmentation with Textual Annotations. Proceedings of 26th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 8-12 October 2023, Vancouver.
[21] Yang, J., et al. (2021) Learning to Fuse Asymmetric Features from MRI and PET for Alzheimer’s Disease Diagnosis. IEEE Transactions on Medical Imaging, 40, 100-110.
[22] Chen, T., et al. (2023) Graph-Based Multimodal Fusion for Medical Image Segmentation. In: Proceedings of Advances in Neural Information Processing Systems (NEURIPS), Curran Associates, Inc.
[23] Huang, S.-C., et al. (2023) GAN-Based Cross-Modal Fusion for Robust Medical Image Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023.