基于视觉的双分支结构鸟瞰图下的目标检测
Visual-Based Dual-Branch Structure for Bird’s-Eye-View Object Detection
摘要: 在鸟瞰图下基于视觉的3D目标检测中,会出现因伪点云生成过程中细节信息丢失而导致小目标和远距离目标检测性能不足以及计算量参数量较大的问题。为此,本文提出了一种基于视觉的双分支结构的3D目标检测方案。首先,采用跨阶段局部网络进行特征提取,以增强模型对多层次特征的学习能力并提升计算效率;其次,引入辅助特征提取网络,充分利用多尺度特征信息,强化对小目标和远距离目标的特征表达;最后,通过高效的特征融合策略,综合处理伪点云特征与辅助分支输出的特征,实现二维细节与三维信息的有效融合。实验结果表明,在nuScenes数据集上,所提方法的平均精度较BEVDet提升约2%,检测得分提升6.5%。由此可见,本文方案在兼顾计算效率的前提下,有效提高了3D目标检测精度,为无人驾驶系统的环境感知提供了有力支持。
Abstract: In vision-based 3D object detection under the bird’s-eye view, challenges arise due to the loss of fine details during the pseudo point cloud generation process, which results in suboptimal detection performance for small and distant objects, as well as high computational complexity and a large number of parameters. To address these issues, this paper proposes a vision-based dual-branch structure 3D object detection scheme. Firstly, a cross-stage partial network is employed for feature extraction to enhance the model’s ability to learn multi-level features while improving computational efficiency. Secondly, an auxiliary feature extraction network is introduced to fully leverage multi-scale feature information, thereby strengthening the representation of small and distant objects. Finally, an efficient feature fusion strategy is implemented to comprehensively integrate the pseudo point cloud features with the outputs from the auxiliary branch, effectively merging two-dimensional details with three-dimensional spatial information. Experimental results on the nuScenes dataset demonstrate that the proposed method achieves an approximately 2% improvement in mean average precision and a 6.5% increase in overall detection score compared to the BEVDet baseline. These findings indicate that the proposed framework not only enhances 3D object detection accuracy but also maintains computational efficiency, thereby providing robust support for the environmental perception capabilities of autonomous driving systems.
文章引用:李朝阳, 王超. 基于视觉的双分支结构鸟瞰图下的目标检测[J]. 人工智能与机器人研究, 2025, 14(4): 842-854. https://doi.org/10.12677/airr.2025.144080

参考文献

[1] Ma, Y., Wang, T., Bai, X., Yang, H., Hou, Y., Wang, Y., et al. (2024) Vision-Centric BEV Perception: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 10978-10997. [Google Scholar] [CrossRef] [PubMed]
[2] Wang, Y., Guizilini, V.C., Zhang, T., et al. (2022) Detr3d: 3d Object Detection from Multi-View Images via 3d-to-2d Queries. 2022 Conference on Robot Learning, Auckland, 14-18 December 2022, 180-191.
[3] Liu, Y., Wang, T., Zhang, X. and Sun, J. (2022) PETR: Position Embedding Transformation for Multi-View 3D Object Detection. In: Lecture Notes in Computer Science, Springer, 531-548. [Google Scholar] [CrossRef
[4] Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q. and Zhao, F. (2022) Graph-DETR3D. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 5999-6008. [Google Scholar] [CrossRef
[5] Peng, L., Chen, Z., Fu, Z., Liang, P. and Cheng, E. (2023) BEVSegFormer: Bird’s Eye View Semantic Segmentation from Arbitrary Camera Rigs. 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 2-7 January 2023, 5924-5932. [Google Scholar] [CrossRef
[6] Reinauer, R., Caorsi, M. and Berkouk, N. (2021) Persformer: A Transformer Architecture for Topological Machine Learning.
[7] Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., et al. (2025) BEVFormer: Learning Bird’s-Eye-View Representation from Lidar-Camera via Spatiotemporal Transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47, 2020-2036. [Google Scholar] [CrossRef] [PubMed]
[8] Liu, Y., Yan, J., Jia, F., Li, S., Gao, A., Wang, T., et al. (2023) PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3239-3249. [Google Scholar] [CrossRef
[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[10] Hu, H., Wang, F., Su, J., et al. (2023) EA-LSS: Edge-Aware Lift-Splat-Shot Framework for 3d Bev Object Detection.
[11] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition.
[12] Huang, J., Huang, G., Zhu, Z., et al. (2021) BEVDet: High-Performance Multi-Camera 3d Object Detection in Bird-Eye-view.
[13] Daubechies, I., DeVore, R., Foucart, S., Hanin, B. and Petrova, G. (2021) Nonlinear Approximation and (Deep) ReLU Networks. Constructive Approximation, 55, 127-172. [Google Scholar] [CrossRef
[14] Gürbüz, Y.Z., Şener, O. and Alatan, A.A. (2023) Generalized Sum Pooling for Metric Learning. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 5439-5450. [Google Scholar] [CrossRef
[15] Zou, Z., Chen, K., Shi, Z., Guo, Y. and Ye, J. (2023) Object Detection in 20 Years: A Survey. Proceedings of the IEEE, 111, 257-276. [Google Scholar] [CrossRef
[16] Hao, S., Zhou, Y. and Guo, Y. (2020) A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing, 406, 302-321. [Google Scholar] [CrossRef
[17] Chen, X., Yang, C., Mo, J., Sun, Y., Karmouni, H., Jiang, Y., et al. (2024) CSPNeXt: A New Efficient Token Hybrid Backbone. Engineering Applications of Artificial Intelligence, 132, Article 107886. [Google Scholar] [CrossRef
[18] Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, B. and Belongie, S. (2017) Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 936-944. [Google Scholar] [CrossRef
[19] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., et al. (2020) nuScenes: A Multimodal Dataset for Autonomous Driving. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 11618-11628. [Google Scholar] [CrossRef
[20] Zhang, Z., Schwing, A.G., Fidler, S. and Urtasun, R. (2015) Monocular Object Instance Segmentation and Depth Ordering with CNNs. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2614-2622. [Google Scholar] [CrossRef
[21] Zhou, X., Wang, D. and Krähenbühl, P. (2019) Objects as Points.
[22] Wang, T., Zhu, X., Pang, J. and Lin, D. (2021) FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, 11-17 October 2021, 913-922. [Google Scholar] [CrossRef