FM-VXNet:一种基于MVX-Net改进的多模态3D目标检测算法的研究
FM-VXNet: A Study on an Improved Multimodal 3D Object Detection Algorithm Based on MVX-Net
摘要: 多模态3D目标检测通过融合不同模态信息,有效克服了单一模态的局限性,在自动驾驶和机器人导航等领域展现出重要价值。然而,现有方法仍存在图像分支对全局语义的建模能力有限、跨模态融合多依赖简单拼接,未能充分挖掘模态间的互补潜力、点云体素特征在密度分布不均时易受噪声或冗余信息干扰等不足。针对上述问题,本文提出频域多模态体素网络(Frequency-domain Multimodal Voxel Network, FM-VXNet)模型。该模型是基于多模态体素网络(Multimodal Voxel Network, MVX-Net)设计,它包含三个核心模块:(1) 在图像分支中引入频域–空间域融合模块(Frequency and Spatial Fusion Module, FFCM),借助快速傅里叶变换增强全局语义感知能力;(2) 提出双向跨模态门控注意力模块(Bidirectional Cross-Modal Gated Attention, Bi-CMGA),实现图像与点云特征间的双向交互融合,并引入通道级门控机制以抑制噪声干扰,提升融合特征的判别力;(3) 在体素特征编码阶段设计双模态密度感知注意力模块(Bimodal Density-aware Attention, BiDA),通过密度感知与通道重标定机制,有效缓解稀疏体素中的噪声问题和密集体素中的冗余现象。改进后的FM-VXNet算法在KITTI数据集上的实验表明,FM-VXNet在鸟瞰图(BEV)检测任务中,全类平均精度(mean Average Precision, mAP)在简单、中等和困难场景下分别达到96.3%、95.2%和92.9%;在3D检测任务中,mAP分别达到96.2%、88.9%和87.7%,相较BEVFusion、MVX-Net等主流算法平均提升5.7%~8.2%。本研究创新性地引入频域分析、双向门控注意力与密度感知机制,为多模态3D目标检测提供了新的研究思路。
Abstract: Multimodal 3D object detection, by fusing data from different modalities, effectively overcomes the limitations of single-modal approaches and has demonstrated significant value in fields such as autonomous driving and robot navigation. However, current methods still face several shortcomings: the image branch has a limited capacity for global semantic modeling; cross-modal fusion often relies on simple feature concatenation, failing to fully exploit the complementary potential between modalities; and point cloud voxel features are susceptible to noise or redundant information when the density distribution is uneven. To address these issues, this paper proposes the Frequency-domain Multimodal Voxel Network (FM-VXNet). Designed based on the Multimodal Voxel Network (MVX-Net), the model incorporates three core modules: (1) the Frequency and Spatial Fusion Module (FFCM), which leverages the Fast Fourier Transform (FFT) to enhance global semantic perception in the image branch; (2) the Bidirectional Cross-Modal Gated Attention (Bi-CMGA) module, which enables bidirectional interactive fusion between image and point cloud features and introduces a channel-wise gating mechanism to suppress noise and improve the discriminative power of the fused features; (3) the Bimodal Density-aware Attention (BiDA) module, which operates during the voxel feature encoding stage and effectively mitigates noise in sparse voxels and redundancy in dense voxels through density-aware and channel re-calibration mechanisms. Experiments on the KITTI dataset show that the enhanced FM-VXNet algorithm achieves mean Average Precision (mAP) scores of 96.3%, 95.2%, and 92.9% for the Bird’s Eye View (BEV) detection task under easy, moderate, and hard settings, respectively. For the 3D detection task, it achieves mAP scores of 96.2%, 88.9%, and 87.7% across the respective difficulty levels, outperforming state-of-the-art methods like BEV Fusion and MVX-Net by an average of 5.7% to 8.2%. This research innovatively introduces frequency-domain analysis, bidirectional gated attention, and density-aware mechanisms, offering a new direction for multimodal 3D object detection research.
文章引用:郑广海, 张薇, 张倩. FM-VXNet:一种基于MVX-Net改进的多模态3D目标检测算法的研究[J]. 计算机科学与应用, 2025, 15(11): 143-155. https://doi.org/10.12677/csa.2025.1511292

参考文献

[1] 顾芳铭, 况博裕, 许亚倩, 等. 面向自动驾驶感知系统的对抗样本攻击研究综述[J]. 信息安全研究, 2024, 10(9): 786-794.
[2] 尹彦鑫, 孟志军, 赵春江, 等. 大田无人农场关键技术研究现状与展望[J]. 智慧农业(中英文), 2022, 4(4): 1-25.
[3] 王若萱, 吴建平, 徐辉. 自动驾驶汽车感知系统仿真的研究及应用综述[J]. 系统仿真学报, 2022, 34(12): 2507-2521.
[4] 魏海跃, 杨奎河. 自动驾驶场景下的多模态3D目标检测算法[J]. 长江信息通信, 2024, 37(6): 28-30.
[5] 代振钊. 面向自动驾驶的多模态融合感知技术研究[D]: [硕士学位论文]. 北京: 北方工业大学, 2024.
[6] Liu, C., Gao, C., Liu, F., Liu, J., Meng, D. and Gao, X. (2022) SS3D: Sparsely-Supervised 3D Object Detection from Point Cloud. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 8418-8427. [Google Scholar] [CrossRef
[7] Wang, Y., Yin, J., Li, W., Frossard, P., Yang, R. and Shen, J. (2023) SSDA3D: Semi-Supervised Domain Adaptation for 3D Object Detection from Point Cloud. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 2707-2715. [Google Scholar] [CrossRef
[8] Bai, Z., Wu, G., Barth, M.J., Liu, Y., Sisbot, E.A. and Oguchi, K. (2023) VINet: Lightweight, Scalable, and Heterogeneous Cooperative Perception for 3D Object Detection. Mechanical Systems and Signal Processing, 204, Article 110723. [Google Scholar] [CrossRef
[9] Xu, W., Jin, J., Xu, F., Li, Z. and Tao, C. (2023) Denoising and Reducing Inner Disorder in Point Clouds for Improved 3D Object Detection in Autonomous Driving. Electronics, 12, Article 2364. [Google Scholar] [CrossRef
[10] Huang, K.C., Wu, T.H., Su, H.T., et al. (2022) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 4002-4011. [Google Scholar] [CrossRef
[11] Lian, Q., Ye, B., Xu, R., Yao, W. and Zhang, T. (2022) Exploring Geometric Consistency for Monocular 3D Object Detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 1675-1684. [Google Scholar] [CrossRef
[12] Nakatsuka, C. and Komorita, S. (2021) Denoising 3D Human Poses from Low-Resolution Video Using Variational Autoencoder. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, 27 September-1 October 2021, 4625-4630. [Google Scholar] [CrossRef
[13] Zhang, C., Chen, W., Wang, W. and Zhang, Z. (2024) MA-ST3D: Motion Associated Self-Training for Unsupervised Domain Adaptation on 3D Object Detection. IEEE Transactions on Image Processing, 33, 6227-6240. [Google Scholar] [CrossRef] [PubMed]
[14] Li, P. and Jin, J. (2022) Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3875-3884. [Google Scholar] [CrossRef
[15] Zhang, C., Wang, H., Cai, Y., Chen, L., Li, Y., Sotelo, M.A., et al. (2022) Robust-FusionNet: Deep Multimodal Sensor Fusion for 3-D Object Detection under Severe Weather Conditions. IEEE Transactions on Instrumentation and Measurement, 71, 1-13. [Google Scholar] [CrossRef
[16] Sindagi, V.A., Zhou, Y. and Tuzel, O. (2019) MVX-Net: Multimodal VoxelNet for 3D Object Detection. 2019 International Conference on Robotics and Automation (ICRA), Montreal, 20-24 May 2019, 7276-7282. [Google Scholar] [CrossRef
[17] Xia, B., Zhou, J., Kong, F., You, Y., Yang, J. and Lin, L. (2024) Enhancing 3D Object Detection through Multi-Modal Fusion for Cooperative Perception. Alexandria Engineering Journal, 104, 46-55. [Google Scholar] [CrossRef
[18] Chu, H., Liu, H., Zhuo, J., Chen, J. and Ma, H. (2024) Occlusion-Guided Multi-Modal Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection. Pattern Recognition, 157, Article 110939. [Google Scholar] [CrossRef
[19] Ding, B., Xie, J., Nie, J., Wu, Y. and Cao, J. (2024) C2BG-Net: Cross-Modality and Cross-Scale Balance Network with Global Semantics for Multi-Modal 3D Object Detection. Neural Networks, 179, Article 106535. [Google Scholar] [CrossRef] [PubMed]
[20] Song, Z., Wei, H., Bai, L., Yang, L. and Jia, C. (2023) GraphAlign: Enhancing Accurate Feature Alignment by Graph Matching for Multi-Modal 3D Object Detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3335-3346. [Google Scholar] [CrossRef
[21] Chen, X., Kundu, K., Zhu, Y., et al. (2015) 3D Object Proposals for Accurate Object Class Detection. Advances in Neural Information Processing Systems, Montreal, 7-12 December 2015, 424-432.
[22] Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., et al. (2020) PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 10526-10535. [Google Scholar] [CrossRef
[23] Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y. and Li, H. (2021) Voxel R-CNN: Towards High Performance Voxel-Based 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 1201-1209. [Google Scholar] [CrossRef
[24] Chen, X.Z., Ma, H.M., Wan, J., Li, B., et al. (2017) Multi-View 3D Object Detection Network for Autonomous Driving. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6526-6534. [Google Scholar] [CrossRef
[25] Ding, B., Xie, J., Nie, J. and Cao, J. (2025) SSLFusion: Scale and Space Aligned Latent Fusion Model for Multimodal 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 39, 2735-2743. [Google Scholar] [CrossRef
[26] Li, Z., Gu, J., Li, K., et al. (2023) DVF: Dynamic Voxel Fusion for 3D Object Detection in Point Clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 17-24 June 2023, 17580-17589.
[27] Liu, Z., Tang, H., Amini, A., et al. (2022) BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. Advances in Neural Information Processing Systems, 35, 10421-10434.