基于RGB-D联合语义分割和边界检测的跨模态融合网络
Cross-Modal Fusion Network for RGB-D Joint Semantic Segmentation and Boundary Detection
DOI: 10.12677/airr.2026.151027, PDF,   
作者: 李超杰:西华大学汽车与交通学院,四川 成都
关键词: 语义分割边界检测联合学习网络RGB-DSemantic Segmentation Boundary Detection Joint Learning Network RGB-D
摘要: 语义分割和边界检测是自动驾驶汽车实现准确环境感知的两大关键任务,然而现有研究多将两者视为独立的任务,或者将语义和边界特征进行简单堆叠,忽略了两者之间的内在联系,缺乏对物体与边界间依赖关系的显式建模,易导致在颜色相近区域出现边界模糊、类别混淆。为此,本文提出了一个跨模态联合感知网络,通过引入深度信息Depth为RGB图像提供几何先验,并建立了一种动态边界引导机制,利用边界信息与几何结构共同指导语义分割过程。具体来说,网络采用双分支结构分别捕获RGB信息、Depth信息并提出了一个边界引导的跨模态融合模块BGCF (Boundary-Guided Cross-modality Fusion Module),通过动态融合不同层级的RGB特征与深度特征,建立二者之间的全局依赖关系,从而获取更准确的多级融合特征。为进一步捕获多尺度全局信息,本文引用了自适应金字塔上下文模块APC (Adaptive Pyramid Context Module)。在解码阶段,采用两个独立的解码器,语义解码器通过BGCF模块输出精确的分割结果,边界解码器则采用残差结构有效融合局部细节与全局信息,提升边界检测准确性。实验结果表明,该方法在Cityscapes数据集上取得了优越的分割精度与边界检测精度。
Abstract: Semantic segmentation and boundary detection are two critical tasks for autonomous vehicles to achieve precise environmental awareness. However, most existing methods treat these tasks as independent or merely stack semantic and boundary features, neglecting the intrinsic relationship between them. This oversight results in a lack of explicit modeling of the interdependence between objects and boundaries, often leading to blurry boundaries and category confusion in regions with similar colors. To address this issue, we propose a cross-modal joint perception network, which enhances RGB images by incorporating depth information as geometric priors. Additionally, the method establishes a dynamic boundary guidance mechanism that utilizes both boundary information and geometric structure to jointly steer the semantic segmentation process. Specifically, the method employs a dual-branch architecture to separately capture RGB information and Depth information while introducing a Boundary Guided Cross Modality Fusion Module (BGCF). By dynamically fusing RGB features and depth features at various levels, we establish a global dependency relationship between the two modalities to obtain more accurate multi-level fusion features. To further enhance the capture of multi-scale global information, this paper references the Adaptive Pyramid Context Module (APC). In the decoding stage, two independently designed decoders are used. One for semantic output that generates precise segmentation results through the BGCF, another for boundary detection that employs lightweight residual units to effectively integrate local details with global context, improving boundary detection accuracy. Experimental results demonstrate that the method achieves superior segmentation and boundary detection accuracy on the Cityscapes dataset.
文章引用:李超杰. 基于RGB-D联合语义分割和边界检测的跨模态融合网络[J]. 人工智能与机器人研究, 2026, 15(1): 277-287. https://doi.org/10.12677/airr.2026.151027

参考文献

[1] Liao, Y., Kang, S., Li, J., Liu, Y., Liu, Y., Dong, Z., et al. (2024) Mobile-Seed: Joint Semantic Segmentation and Boundary Detection for Mobile Robots. IEEE Robotics and Automation Letters, 9, 3902-3909. [Google Scholar] [CrossRef
[2] Xie, E., Wang, W., Yu, Z., et al. (2021) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Annual Conference on Neural Information Processing Systems 2021, 6-14 December 2021, 12077-12090.
[3] Liu, Y., Cheng, M., Fan, D., Zhang, L., Bian, J. and Tao, D. (2021) Semantic Edge Detection with Diverse Deep Supervision. International Journal of Computer Vision, 130, 179-198. [Google Scholar] [CrossRef
[4] Xiao, X., Zhao, Y., Zhang, F., Luo, B., Yu, L., Chen, B., et al. (2023) BASeg: Boundary Aware Semantic Segmentation for Autonomous Driving. Neural Networks, 157, 460-470. [Google Scholar] [CrossRef] [PubMed]
[5] Zhang, Y., Xiong, C., Liu, J., Ye, X. and Sun, G. (2023) Spatial Information-Guided Adaptive Context-Aware Network for Efficient RGB-D Semantic Segmentation. IEEE Sensors Journal, 23, 23512-23521. [Google Scholar] [CrossRef
[6] Takikawa, T., Acuna, D., Jampani, V. and Fidler, S. (2019) Gated-SCNN: Gated Shape CNNs for Semantic Segmentation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27-28 October 2019, 5229-5238. [Google Scholar] [CrossRef
[7] Long, J., Shelhamer, E. and Darrell, T. (2015) Fully Convolutional Networks for Semantic Segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 3431-3440. [Google Scholar] [CrossRef
[8] Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., et al., Eds., International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer International Publishing, 234-241. [Google Scholar] [CrossRef
[9] Chen, L., Papandreou, G., Kokkinos, I., Murphy, K. and Yuille, A.L. (2018) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 834-848. [Google Scholar] [CrossRef] [PubMed]
[10] Cheng, B., Schwing, A. and Kirillov, A. (2021) Per-Pixel Classification Is Not All You Need for Semantic Segmentation. Advances in Neural Information Processing Systems, 34, 17864-17875.
[11] Dong, B., Wang, P. and Wang, F. (2023) Head-Free Lightweight Semantic Segmentation with Linear Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 516-524. [Google Scholar] [CrossRef
[12] Du, S., Wang, W., Guo, R., Wang, R. and Tang, S. (2024) AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, 17-18 June 2024, 7608-7615. [Google Scholar] [CrossRef
[13] Hu, X., Yang, K., Fei, L. and Wang, K. (2019) ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. 2019 IEEE International Conference on Image Processing (ICIP), 22-25 September 2019, 1440-1444. [Google Scholar] [CrossRef
[14] Hu, Y., Chen, Y., Li, X. and Feng, J. (2019) Dynamic Feature Fusion for Semantic Edge Detection. Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, 10-16 August 2019, 782-788. [Google Scholar] [CrossRef
[15] Zhen, M., Wang, J., Zhou, L., Li, S., Shen, T., Shang, J., et al. (2020) Joint Semantic Segmentation and Boundary Detection Using Iterative Pyramid Contexts. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 13666-13675. [Google Scholar] [CrossRef
[16] Yu, Z., Feng, C., Liu, M. and Ramalingam, S. (2017) CASENet: Deep Category-Aware Semantic Edge Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 5964-5973. [Google Scholar] [CrossRef
[17] Li, X., Li, X., Zhang, L., Cheng, G., Shi, J., Lin, Z., et al. (2020) Improving Semantic Segmentation via Decoupled Body and Edge Supervision. 16th European Conference ECCV 2020, Glasgow, 23-28 August 2020, 435-452. [Google Scholar] [CrossRef
[18] Wang, H., Mohamed, H., Wang, Z., Rueckauer, B. and Liu, S. (2021) LiteEdge: Lightweight Semantic Edge Detection Network. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, 11-17 October 2021, 2657-2666. [Google Scholar] [CrossRef
[19] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L. (2018) MobileNetV2: Inverted Residuals and Linear Bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520. [Google Scholar] [CrossRef
[20] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016) The Cityscapes Dataset for Semantic Urban Scene Understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 3213-3223. [Google Scholar] [CrossRef
[21] Cheng, B., Girshick, R., Dollar, P., Berg, A.C. and Kirillov, A. (2021) Boundary IoU: Improving Object-Centric Image Segmentation Evaluation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 15334-15342. [Google Scholar] [CrossRef