基于动态门控制和注意力遮掩策略改进ICAFusion的多模态目标检测
Improving ICAFusion’s Multimodal Object Detection Based on Dynamic Gate Control and Attention Masking Strategy
DOI: 10.12677/csa.2025.153060, PDF,   
作者: 谷昊宇, 苑春苗:天津工业大学软件学院,天津;杨清永*:天津中德应用技术大学软件与通讯学院,天津
关键词: 多模态目标检测跨模态特征精细化Multimodal Object Detection Cross-Modal Feature Refinement
摘要: 传统卷积特征融合方法(如CNN)因局部感受野的限制,难以捕获模态间长距离特征关系,同时对图像错位敏感;而Transformer虽具备全局建模能力,但直接堆叠会导致计算复杂度和参数量激增。ICAFusion通过迭代跨模态注意力机制部分解决了这些问题,但仍存在不足:1) 跨模态特征增强模块(CFE)缺乏动态权重调整,对模态间质量差异适应性不足;2) 迭代特征增强模块(ICFE)在局部特征优化和精细化处理方面能力有限。为此,本文提出一种改进的多模态特征融合框架。在CFE模块中加入动态门控机制和注意力遮掩策略,自适应平衡模态特征贡献并过滤无效信息;在ICFE模块中引入精细化特征优化模块(FRFM),结合局部卷积、线性变换和门控机制对特征进行细化优化,提升模态互补性和特征表达能力。实验结果表明,改进后的模型在KAIST和FLIR数据集上的目标检测精度和鲁棒性显著提升,在FLIR上高阈值指标mAP75和mAP50-95分别提升了2.7%和2.4%。
Abstract: Traditional convolutional feature fusion methods (e.g., CNNs) are limited by local receptive fields, making it difficult to capture long-range relationships between modalities and sensitive to image misalignments. While Transformers possess global modeling capabilities, stacking them directly leads to increased computational complexity and parameter overhead. ICAFusion partially addresses these issues through an iterative cross-modal attention mechanism. However, it still has the following limitations: 1) The Cross-modal Feature Enhancement (CFE) module lacks dynamic weight adjustment, making it less adaptive to quality differences between modalities; 2) The Iterative Cross-modal Feature Enhancement (ICFE) module has limited capabilities in local feature optimization and fine-grained processing. To address these shortcomings, this paper proposes an improved multimodal feature fusion framework. In the CFE module, a dynamic gating mechanism and attention masking strategy are introduced to adaptively balance modal feature contributions and filter out irrelevant information. In the ICFE module, a Fine-grained Feature Refinement Module (FRFM) is incorporated, which combines local convolution, linear transformation, and gating mechanisms to refine features, enhancing modality complementarity and feature representation capabilities. Experimental results demonstrate that the improved model significantly enhances object detection accuracy and robustness on the KAIST and FLIR datasets. Specifically, on the FLIR dataset, the high-threshold metrics mAP75 and mAP50-95 improve by 2.7% and 2.4%, respectively.
文章引用:谷昊宇, 苑春苗, 杨清永. 基于动态门控制和注意力遮掩策略改进ICAFusion的多模态目标检测[J]. 计算机科学与应用, 2025, 15(3): 83-93. https://doi.org/10.12677/csa.2025.153060

参考文献

[1] Jaffe, J.S. (2015) Underwater Optical Imaging: The Past, the Present, and the Prospects. IEEE Journal of Oceanic Engineering, 40, 683-700. [Google Scholar] [CrossRef
[2] Zheng, Y., Blasch, E. and Liu, Z. (2018) Multispectral Image Fusion and Colorization. SPIE Press.
[3] Alldieck, T., Bahnsen, C. and Moeslund, T. (2016) Context-Aware Fusion of RGB and Thermal Imagery for Traffic Monitoring. Sensors, 16, Article 1947. [Google Scholar] [CrossRef] [PubMed]
[4] Fu, C., Mertz, C. and Dolan, J.M. (2019) LIDAR and Monocular Camera Fusion: On-Road Depth Completion for Autonomous Driving. 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, 27-30 October 2019, 273-278. [Google Scholar] [CrossRef
[5] Shopovska, I., Jovanov, L. and Philips, W. (2019) Deep Visible and Thermal Image Fusion for Enhanced Pedestrian Visibility. Sensors, 19, Article 3727. [Google Scholar] [CrossRef] [PubMed]
[6] Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O. and Lopez, A.M. (2022) Multimodal End-To-End Autonomous Driving. IEEE Transactions on Intelligent Transportation Systems, 23, 537-547. [Google Scholar] [CrossRef
[7] Shen, J., Chen, Y., Liu, Y., Zuo, X., Fan, H. and Yang, W. (2024) Icafusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection. Pattern Recognition, 145, Article ID: 109913. [Google Scholar] [CrossRef
[8] Fang, Q.Y., Han, D.P. and Wang, Z.K. (2021) Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv: 2111.00273.
[9] Hwang, S., Park, J., Kim, N., Choi, Y. and Kweon, I.S. (2015) Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 137-1045. [Google Scholar] [CrossRef
[10] Zhang, H., Fromont, E., Lefevre, S. and Avignon, B. (2020) Multispectral Fusion for Object Detection with Cyclic Fuse-And-Refine Blocks. 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, 25-28 October 2020, 276-280. [Google Scholar] [CrossRef
[11] Ross, T.Y. and Dollár, G. (2017) Focal Loss for Dense Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2980-2988.
[12] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788. [Google Scholar] [CrossRef
[13] Zhang, H., Fromont, E., Lefevre, S. and Avignon, B. (2021) Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2021, 72-80. [Google Scholar] [CrossRef
[14] Qingyun, F. and Zhaokui, W. (2022) Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognition, 130, Article ID: 108786. [Google Scholar] [CrossRef
[15] Zhou, K., Chen, L. and Cao, X. (2020) Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer VisionECCV 2020, Springer, 787-803. [Google Scholar] [CrossRef
[16] Shen, J., Liu, Y., Chen, Y., Zuo, X., Li, J. and Yang, W. (2022) Mask-Guided Explicit Feature Modulation for Multispectral Pedestrian Detection. Computers and Electrical Engineering, 103, Article ID: 108385. [Google Scholar] [CrossRef
[17] Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z. and Liu, Z. (2019) Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5126-5136. [Google Scholar] [CrossRef
[18] Liu, J., Zhang, S., Wang, S., et al. (2016) Multispectral Deep Neural Networks for Pedestrian Detection. arXiv: 1611.02644.
[19] Sun, Y., Cao, B., Zhu, P. and Hu, Q. (2022) Drone-Based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Transactions on Circuits and Systems for Video Technology, 32, 6700-6713. [Google Scholar] [CrossRef