RTRFNet:面向RGB-T分割的鲁棒融合网络——应对传感器模态缺失的鲁棒性研究
RTRFNet: A Robust Fusion Network for RGB-T Segmentation—Robustness Research against Missing Sensor Modalities
摘要: RGB-热红外(RGB-T)语义分割对于在弱光或黑暗等复杂环境中运行的机器人系统至关重要。然而,传统的多模态融合方法往往导致不同模态特征的高度耦合,使得模型在真实场景下遭遇传感器信号缺失时极度脆弱,性能发生严重退化。为此,本文提出了RTRFNet,一个基于教师–学生学习机制的多模态鲁棒网络,从根本上打破了推理阶段对双源输入的强依赖。在训练阶段,网络通过一个精简的通道注意力特征融合模块(CA-FFM)高效聚合跨模态互补线索,为中枢的轻量化感知头(教师分支)构建完备的联合语义表示;随后,引入多模态知识蒸馏(MKD)策略,利用教师分支输出的高质量软分布,隐式地监督并引导完全独立的RGB与热红外双流网络(学生分支),促使其充分获得并内化跨模态的丰富上下文知识。这种联合训练机制赋予了系统在推理阶段极高的灵活性:系统既能在全模态下移除教师网络并执行极具参数效率的决策层均值融合,也能在单传感器失效时仅激活存活链路进行高精度的独立推理。在主流基准数据集上的大量实验证明,RTRFNet不仅维持了全模态下的前沿精度,更在单模态缺失的极端条件下展现出了卓越的鲁棒性与轻量化部署优势。
Abstract: RGB-Thermal (RGB-T) semantic segmentation is crucial for robotic systems operating in complex environments such as low-light or dark conditions. However, traditional multimodal fusion methods often lead to highly coupled modal features, making models extremely vulnerable to severe performance degradation when encountering missing sensor signals in real-world scenarios. To address this, this paper proposes RTRFNet, a robust multimodal network based on a teacher-student learning mechanism, which fundamentally breaks the strong reliance on dual-source inputs during the inference phase. During training, the network efficiently aggregates cross-modal complementary cues through a lightweight Channel Attention Feature Fusion Module (CA-FFM) to build a comprehensive joint semantic representation for a central lightweight perception head (teacher branch). Subsequently, a Multimodal Knowledge Distillation (MKD) strategy is introduced. It utilizes the high-quality soft distributions output by the teacher branch to implicitly supervise and guide the completely independent RGB and thermal dual-stream networks (student branches), prompting them to acquire and internalize rich cross-modal contextual knowledge. This joint training mechanism endows the system with extremely high flexibility during inference: by removing the teacher network, the system can perform highly parameter-efficient decision-level mean fusion under full modalities, or solely activate the surviving link for high-precision independent inference when a single sensor fails. Extensive experiments on mainstream benchmark datasets demonstrate that RTRFNet not only maintains state-of-the-art accuracy under full modalities but also exhibits exceptional robustness and lightweight deployment advantages under extreme missing modality conditions.
文章引用:谈焜宇, 洪智勇, 熊利平. RTRFNet:面向RGB-T分割的鲁棒融合网络——应对传感器模态缺失的鲁棒性研究[J]. 人工智能与机器人研究, 2026, 15(2): 638-650. https://doi.org/10.12677/airr.2026.152061

参考文献

[1] Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y. and Harada, T. (2017) MFNet: Towards Real-Time Semantic Segmentation for Autonomous Vehicles with Multi-Spectral Scenes. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, 24-28 September 2017, 5108-5115. [Google Scholar] [CrossRef
[2] Deng, F., Feng, H., Liang, M., Wang, H., Yang, Y., Gao, Y., et al. (2021) FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-Time Semantic Segmentation. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, 27 September-1 October 2021, 4467-4473. [Google Scholar] [CrossRef
[3] Liang, M., Hu, J., Bao, C., Feng, H., Deng, F. and Lam, T.L. (2023) Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks. IEEE Robotics and Automation Letters, 8, 4060-4067. [Google Scholar] [CrossRef
[4] Tang, L., Yuan, J., Zhang, H., Jiang, X. and Ma, J. (2022) PIAFusion: A Progressive Infrared and Visible Image Fusion Network Based on Illumination Aware. Information Fusion, 83, 79-92. [Google Scholar] [CrossRef
[5] Sun, Y., Zuo, W. and Liu, M. (2019) RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robotics and Automation Letters, 4, 2576-2583. [Google Scholar] [CrossRef
[6] Sun, Y., Zuo, W., Yun, P., Wang, H. and Liu, M. (2021) FuseSeg: Semantic Segmentation of Urban Scenes Based on RGB and Thermal Data Fusion. IEEE Transactions on Automation Science and Engineering, 18, 1000-1011. [Google Scholar] [CrossRef
[7] Zhang, Q., Zhao, S., Luo, Y., Zhang, D., Huang, N. and Han, J. (2021) ABMDRNet: Adaptive-Weighted Bi-Directional Modality Difference Reduction Network for RGB-T Semantic Segmentation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 2633-2642. [Google Scholar] [CrossRef
[8] Dong, S., Zhou, W., Xu, C. and Yan, W. (2024) EGFNet: Edge-Aware Guidance Fusion Network for RGB-Thermal Urban Scene Parsing. IEEE Transactions on Intelligent Transportation Systems, 25, 657-669. [Google Scholar] [CrossRef
[9] Zhou, W., Liu, J., Lei, J., Yu, L. and Hwang, J. (2021) GMNet: Graded-Feature Multilabel-Learning Network for RGB-Thermal Urban Scene Semantic Segmentation. IEEE Transactions on Image Processing, 30, 7790-7802. [Google Scholar] [CrossRef] [PubMed]
[10] Zhou, W., Lin, X., Lei, J., Yu, L. and Hwang, J. (2022) MFFENet: Multiscale Feature Fusion and Enhancement Network for Rgb-Thermal Urban Road Scene Parsing. IEEE Transactions on Multimedia, 24, 2526-2538. [Google Scholar] [CrossRef
[11] Li, J., Yun, P., Chen, Q. and Fan, R. (2024) HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion. arXiv: 2404.03527.
[12] Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R. and Stiefelhagen, R. (2023) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Transactions on Intelligent Transportation Systems, 24, 14679-14694. [Google Scholar] [CrossRef
[13] Zhang, J., Liu, R., Shi, H., Yang, K., Reiß, S., Peng, K., et al. (2023) Delivering Arbitrary-Modal Semantic Segmentation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 1136-1147. [Google Scholar] [CrossRef
[14] Li, B., Zhang, D., Zhao, Z., Gao, J. and Li, X. (2025) StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation. Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, 27-31 October 2025, 1308-1317. [Google Scholar] [CrossRef
[15] Shin, U., Lee, K., Kweon, I.S. and Oh, J. (2024) Complementary Random Masking for RGB-Thermal Semantic Segmentation. 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, 13-17 May 2024, 11110-11117. [Google Scholar] [CrossRef
[16] Zheng, X., Xue, H., Chen, J., Yan, Y., Jiang, L., Lyu, Y., Yang, K., Zhang, L. and Hu, X. (2024) Learning Robust Anymodal Segmentor with Unimodal and Cross-Modal Distillation. arXiv: 2411.17141.
[17] Reza, M.K., Prater-Bennette, A. and Asif, M.S. (2025) Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47, 742-754. [Google Scholar] [CrossRef] [PubMed]
[18] Liu, J., Liu, Z., Wu, G., Ma, L., Liu, R., Zhong, W., et al. (2023) Multi-Interactive Feature Learning and a Full-Time Multi-Modality Benchmark for Image Fusion and Segmentation. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 8081-8090. [Google Scholar] [CrossRef
[19] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M. and Luo, P. (2021) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Advances in Neural Information Processing Systems, 34, 12077-12090.
[20] Hazirbas, C., Ma, L., Domokos, C. and Cremers, D. (2017) FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In: Lai, S.H., Lepetit, V., Nishino, K. and Sato, Y., Eds., Computer VisionACCV 2016, Springer, 213-228. [Google Scholar] [CrossRef
[21] Sun, Y., Dong, W., Wang, S., Wu, P., Feng, M., Li, X., et al. (2025) Distilling Hierarchical Knowledge from Multimodal Fusion for Unimodal Image Segmentation. IEEE Transactions on Circuits and Systems for Video Technology, 35, 11797-11809. [Google Scholar] [CrossRef
[22] Lin, B., Lin, Z., Guo, Y., Zhang, Y., Zou, J. and Fan, S. (2023) Variational Probabilistic Fusion Network for RGB-T Semantic Segmentation. arXiv: 2307.08536.