ESINet——基于空间信息增强的实时语义分割网络
ESINet—Real-Time Semantic Segmentation Network Based on Enhanced Spatial Information
摘要: 为提高语义分割实时性,现有的方法细节分支往往设计简单,对空间上下文信息提取极不充分,另外传统卷积操作伴随大量的冗余特征计算对推理速度形成瓶颈。本文提出实时语义分割模型ESINet,引入改进的轻量化空间优化卷积模块(SpaceOptiConv),采用少量生成特征图并与本征特征进行拼接,减少冗余特征图的计算,降低计算复杂度和参数量;在混合注意力的基础上引入多尺度特征提取,设计了一种多尺度混合注意力机制(HMA),高效捕捉不同尺度的特征信息;提出一种复合损失函数,交并比损失(IoULoss)优化分割区域的整体重叠度,在线难例挖掘交叉熵损失(OhemCELoss)聚焦类别不平衡、小目标和复杂边界增强局部分类的准确性。对于2048 × 1024的输入,ESINet在Cityscapes测试集上实现81.6%的mIoU,NVIDIA TITAN RTX上的速度为144.5 FPS。相较于基线模型mIoU和分割速度分别取得5.6%和16.4 fps的提高。
Abstract: To improve the real-time performance of semantic segmentation, existing methods often design detail branches that are relatively simple and fail to extract spatial context information effectively. In addition, traditional convolution operations involve significant redundant feature computations, which become bottlenecks for inference speed. This paper proposes a real-time semantic segmentation model, ESINet, which introduces an improved lightweight spatial optimization convolution module (SpaceOptiConv). The module generates a few feature maps and concatenates them with intrinsic features, reducing redundant feature map computations, thus lowering computational complexity and the number of parameters. Based on a hybrid attention mechanism, a multi-scale feature extraction approach is introduced, and a multi-scale hybrid attention mechanism (HMA) is designed to efficiently capture feature information at different scales. Moreover, a composite loss function is proposed, including Intersection over Union loss (IoULoss) to optimize the overall overlap of segmented regions, and Online Hard Example Mining Cross-Entropy Loss (OhemCELoss) to focus on class imbalance, small targets, and complex boundaries, enhancing the accuracy of local classification. With an input size of 2048 × 1024, ESINet achieves 81.6% mIoU on the Cityscapes test set, with a speed of 144.5 FPS on an NVIDIA TITAN RTX. Compared to the baseline model, ESINet achieves improvements of 5.6% in mIoU and 16.4 FPS in segmentation speed.
文章引用:梅金庄, 王超. ESINet——基于空间信息增强的实时语义分割网络[J]. 人工智能与机器人研究, 2025, 14(6): 1476-1488. https://doi.org/10.12677/airr.2025.146138

参考文献

[1] Cheng, B., Schwing, A. and Kirillov, A. (2021) Per-Pixel Classification Is Not All You Need for Semantic Segmentation. Advances in Neural Information Processing Systems, 34, 17864-17875.
[2] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., et al. (2021) Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 3349-3364. [Google Scholar] [CrossRef] [PubMed]
[3] Badrinarayanan, V., Kendall, A. and Cipolla, R. (2017) SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 2481-2495. [Google Scholar] [CrossRef] [PubMed]
[4] Zhang, B., Tian, Z., Tang, Q., et al. (2022) Segvit: Semantic Segmentation with Plain Vision Transformers. Advances in Neural Information Processing Systems, 35, 4971-4982.
[5] Paszke, A., Chaurasia, A., Kim, S., et al. (2016) Enet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv: 1606.02147.
[6] Tsai, T. and Tseng, Y. (2023) BiSeNet V3: Bilateral Segmentation Network with Coordinate Attention for Real-Time Semantic Segmentation. Neurocomputing, 532, 33-42. [Google Scholar] [CrossRef
[7] Yu, C., Gao, C., Wang, J., Yu, G., Shen, C. and Sang, N. (2021) BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. International Journal of Computer Vision, 129, 3051-3068. [Google Scholar] [CrossRef
[8] Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J., et al. (2021) Rethinking BiSeNet for Real-Time Semantic Segmentation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 9711-9720. [Google Scholar] [CrossRef
[9] Pan, H., Hong, Y., Sun, W. and Jia, Y. (2023) Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Traffic Scenes. IEEE Transactions on Intelligent Transportation Systems, 24, 3448-3460. [Google Scholar] [CrossRef
[10] Xu, J., Xiong, Z. and Bhattacharyya, S.P. (2023) PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 19529-19539. [Google Scholar] [CrossRef
[11] Tang, Y., Han, K., Guo, J., et al. (2022) GhostNetv2: Enhance Cheap Operation with Long-Range Attention. Advances in Neural Information Processing Systems, 35, 9969-9982.
[12] Peng, B., Liu, Y., Zhu, X., Ikeda, S. and Tsunoda, S. (2022) Femoral Segmentation of MRI Images Using PP-LiteSeg. 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Ioannina, 27-30 September 2022, 1-4. [Google Scholar] [CrossRef
[13] Lu, C., de Geus, D. and Dubbelman, G. (2023) Content-Aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 23631-23640. [Google Scholar] [CrossRef
[14] Xu, Z., Wu, D., Yu, C., Chu, X., Sang, N. and Gao, C. (2024) SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 38, 6378-6386. [Google Scholar] [CrossRef
[15] Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer VisionECCV 2018, Springer, 3-19. [Google Scholar] [CrossRef
[16] Huang, H., Chen, Z., Zou, Y., Lu, M., Chen, C., Song, Y., et al. (2024) Channel Prior Convolutional Attention for Medical Image Segmentation. Computers in Biology and Medicine, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
[17] Long, J., Shelhamer, E. and Darrell, T. (2015) Fully Convolutional Networks for Semantic Segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 3431-3440. [Google Scholar] [CrossRef
[18] Yan, L., Liu, D., Xiang, Q., Luo, Y., Wang, T., Wu, D., et al. (2021) PSP Net-Based Automatic Segmentation Network Model for Prostate Magnetic Resonance Imaging. Computer Methods and Programs in Biomedicine, 207, Article ID: 106211. [Google Scholar] [CrossRef] [PubMed]
[19] Chua, L., Jimenez-Diaz, J., Lewthwaite, R., Kim, T. and Wulf, G. (2021) Superiority of External Attentional Focus for Motor Performance and Learning: Systematic Reviews and Meta-Analyses. Psychological Bulletin, 147, 618-645. [Google Scholar] [CrossRef] [PubMed]
[20] Chen, L.C. (2017) Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv: 1706.05587.