多模态大模型驱动的空间智能:技术进展、评估体系与未来挑战
Spatial Intelligence Powered by Multimodal Large Language Models: Technological Advances, Evaluation Frameworks, and Future Challenges
DOI: 10.12677/csa.2025.1512327, PDF,   
作者: 王承伟, 赵虹阳:新疆理工职业大学人工智能学院,新疆 图木舒克;刘小华*:新疆理工职业大学人工智能学院,新疆 图木舒克;深圳职业技术大学人工智能学院,广东 深圳
关键词: 多模态大语言模型空间智能具身智能评估基准Multimodal Large Language Models Spatial Intelligence Embodied AI Evaluation Benchmarks
摘要: 近年来,随着多模态大语言模型(Multimodal Large Language Models, MLLMs)的迅猛发展,空间智能(Spatial Intelligence)作为连接感知、推理与行动的核心能力,正成为人工智能迈向物理世界的关键突破口。本文系统梳理了多模态大模型在三维视觉理解、空间感知与推理、具身交互等方面的技术演进路径,重点分析了以视频、深度图、点云等多源异构数据为基础的空间表征方法,并归纳了当前主流的评估基准与典型应用。同时,本文指出模型在跨视角一致性、组合推理、动态场景建模等方面仍面临显著挑战,并对未来研究方向提出展望,旨在为空间智能系统的构建提供理论支撑与技术路线参考。
Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has established spatial intelligence as a key enabler for AI to interact with the physical world, connecting perception with reasoning and action. This paper systematically reviews the technical progress of multimodal models in 3D understanding, spatial reasoning, and embodied interaction. We analyze representation learning methods based on diverse data like video, depth maps, and point clouds, and summarize key benchmarks and applications. Critical challenges in cross-view consistency, compositional reasoning, and dynamic scene understanding are discussed, followed by an outlook on future research to provide a foundation and reference for the development of spatial intelligence systems.
文章引用:王承伟, 赵虹阳, 刘小华. 多模态大模型驱动的空间智能:技术进展、评估体系与未来挑战[J]. 计算机科学与应用, 2025, 15(12): 118-124. https://doi.org/10.12677/csa.2025.1512327

参考文献

[1] Azuma, D., Miyanishi, T., Kurita, S. and Kawanabe, M. (2022) ScanQA: 3D Question Answering for Spatial Scene Understanding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 19-24 June 2022, 19107-19117. [Google Scholar] [CrossRef
[2] Team, G., Georgiev, P., Lei, V.I., Burnell, R., et al. (2024) Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context.
https://arxiv.org/abs/2403.05530
[3] Wu, J., Wang, Y., Xue, T., et al. (2017) MarrNet: 3D Shape Reconstruction via 2.5 D Sketches. Advances in Neural Information Processing Systems, 30, 1-11.
[4] 严永嘉, 蹇木伟, 刘宏哲, 等. 基于深度学习的视觉SLAM研究综述[C]//中国计算机用户协会网络应用分会. 中国计算机用户协会网络应用分会2023年第二十七届网络新技术与应用年会论文集. 镇江, 2023: 55-58.
https://kns.cnki.net/kcms2/article/abstract?v=Jz-lw5xPjDacNhj1bXaSiYL-pJVHArlVQka4-Jhyj_MDeqg7raOKKh0NFINXQ1P91RSnND436dfb9QWqC8bbi2fdpqdlFRQzHBE9hBFcGXmp3XW7US1p-8jpuVxRv36_5a5e0YkPqd4NGj1uex8glTGHy1Fm8PhteH9dLZCQrbLCOvh8UhV3anYyMwXXTMpE&uniplatform=NZKPT&language=CHS
[5] Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., et al. (2025) SpatialBot: Precise Spatial Understanding with Vision Language Models. 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, 19-23 May 2025, 9490-9498. [Google Scholar] [CrossRef
[6] Liu, C., Wang, H., Henry, F., et al. (2025) MIRAGE: A Multi-Modal Benchmark for Spatial Perception, Reasoning, and Intelligence.
https://arxiv.org/abs/2505.10604
[7] Yang, S., Xu, R., Xie, Y., et al. (2025) MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.
https://arxiv.org/abs/2505.23764
[8] Chen, Z., Wu, J.N., Wang, W.H., et al. (2023) InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.
https://ieeexplore.ieee.org/document/10656429
[9] Team, G., Anil, R., Borgeaud, S., et al. (2023) Gemini: A Family of Highly Capable Multimodal Models.
https://arxiv.org/abs/2312.11805
[10] Wang, P., Bai, S., Tan, S., et al. (2024) Qwen2-vl: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.
https://arxiv.org/abs/2409.12191
[11] Newcombe, R.A., Fitzgibbon, A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., et al. (2011) Kinectfusion: Real-Time Dense Surface Mapping and Tracking. 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, 26-29 October 2011, 127-136. [Google Scholar] [CrossRef
[12] Wu, D., Liu, F., Hung, Y.H., et al. (2025) Spatial-MLLM: Boosting MLLM Capabilities in Visual-Based Spatial Intelligence.
https://arxiv.org/abs/2505.23747
[13] Liao, K., Wu, S., Wu, Z., et al. (2025) Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation.
https://arxiv.org/abs/2510.08673
[14] Zhang, L., Rao, A. and Agrawala, M. (2023) Adding Conditional Control to Text-to-Image Diffusion Models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3813-3824. [Google Scholar] [CrossRef
[15] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B. (2022) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10674-10685. [Google Scholar] [CrossRef
[16] Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L. and Xie, S. (2025) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 10632-10643. [Google Scholar] [CrossRef
[17] Hurst, A., Lerer, A., Goucher, A.P., et al. (2024) Gpt-4o System Card.
https://arxiv.org/abs/2410.21276
[18] Sprague, Z., Yin, F., Rodriguez, J.D., et al. (2024) To Cot or not to Cot? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning.
https://arxiv.org/abs/2409.12183
[19] Xu, R., Wang, W., Tang, H., et al. (2025) Multi-Spatial MLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models.
https://arxiv.org/abs/2505.17015
[20] Zheng, D., Huang, S. and Wang, L. (2025) Video-3d LLM: Learning Position-Aware Video Representation for 3D Scene Understanding. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 2995-9006. [Google Scholar] [CrossRef
[21] Mildenhall, B., Srinivasan, P.P., Tancik, M., et al. (2021) Nerf: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM, 65, 99-106.
[22] Gao, C., Saraf, A., Kopf, J. and Huang, J. (2021) Dynamic View Synthesis from Dynamic Monocular Video. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5692-5701. [Google Scholar] [CrossRef
[23] Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., et al. (2019) Habitat: A Platform for Embodied AI Research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 9338-9346. [Google Scholar] [CrossRef
[24] Kolve, E., Mottaghi, R., Han, W., et al. (2017) AI2-thor. An Interactive 3d Environment for Visual AI.
https://arxiv.org/abs/1712.05474