|
[1]
|
Azuma, D., Miyanishi, T., Kurita, S. and Kawanabe, M. (2022) ScanQA: 3D Question Answering for Spatial Scene Understanding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 19-24 June 2022, 19107-19117. [Google Scholar] [CrossRef]
|
|
[2]
|
Team, G., Georgiev, P., Lei, V.I., Burnell, R., et al. (2024) Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. https://arxiv.org/abs/2403.05530
|
|
[3]
|
Wu, J., Wang, Y., Xue, T., et al. (2017) MarrNet: 3D Shape Reconstruction via 2.5 D Sketches. Advances in Neural Information Processing Systems, 30, 1-11.
|
|
[4]
|
严永嘉, 蹇木伟, 刘宏哲, 等. 基于深度学习的视觉SLAM研究综述[C]//中国计算机用户协会网络应用分会. 中国计算机用户协会网络应用分会2023年第二十七届网络新技术与应用年会论文集. 镇江, 2023: 55-58. https://kns.cnki.net/kcms2/article/abstract?v=Jz-lw5xPjDacNhj1bXaSiYL-pJVHArlVQka4-Jhyj_MDeqg7raOKKh0NFINXQ1P91RSnND436dfb9QWqC8bbi2fdpqdlFRQzHBE9hBFcGXmp3XW7US1p-8jpuVxRv36_5a5e0YkPqd4NGj1uex8glTGHy1Fm8PhteH9dLZCQrbLCOvh8UhV3anYyMwXXTMpE&uniplatform=NZKPT&language=CHS
|
|
[5]
|
Cai, W., Ponomarenko, I., Yuan, J., Li, X., Yang, W., Dong, H., et al. (2025) SpatialBot: Precise Spatial Understanding with Vision Language Models. 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, 19-23 May 2025, 9490-9498. [Google Scholar] [CrossRef]
|
|
[6]
|
Liu, C., Wang, H., Henry, F., et al. (2025) MIRAGE: A Multi-Modal Benchmark for Spatial Perception, Reasoning, and Intelligence. https://arxiv.org/abs/2505.10604
|
|
[7]
|
Yang, S., Xu, R., Xie, Y., et al. (2025) MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence. https://arxiv.org/abs/2505.23764
|
|
[8]
|
Chen, Z., Wu, J.N., Wang, W.H., et al. (2023) InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. https://ieeexplore.ieee.org/document/10656429
|
|
[9]
|
Team, G., Anil, R., Borgeaud, S., et al. (2023) Gemini: A Family of Highly Capable Multimodal Models. https://arxiv.org/abs/2312.11805
|
|
[10]
|
Wang, P., Bai, S., Tan, S., et al. (2024) Qwen2-vl: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. https://arxiv.org/abs/2409.12191
|
|
[11]
|
Newcombe, R.A., Fitzgibbon, A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., et al. (2011) Kinectfusion: Real-Time Dense Surface Mapping and Tracking. 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, 26-29 October 2011, 127-136. [Google Scholar] [CrossRef]
|
|
[12]
|
Wu, D., Liu, F., Hung, Y.H., et al. (2025) Spatial-MLLM: Boosting MLLM Capabilities in Visual-Based Spatial Intelligence. https://arxiv.org/abs/2505.23747
|
|
[13]
|
Liao, K., Wu, S., Wu, Z., et al. (2025) Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation. https://arxiv.org/abs/2510.08673
|
|
[14]
|
Zhang, L., Rao, A. and Agrawala, M. (2023) Adding Conditional Control to Text-to-Image Diffusion Models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3813-3824. [Google Scholar] [CrossRef]
|
|
[15]
|
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B. (2022) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10674-10685. [Google Scholar] [CrossRef]
|
|
[16]
|
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L. and Xie, S. (2025) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 10632-10643. [Google Scholar] [CrossRef]
|
|
[17]
|
Hurst, A., Lerer, A., Goucher, A.P., et al. (2024) Gpt-4o System Card. https://arxiv.org/abs/2410.21276
|
|
[18]
|
Sprague, Z., Yin, F., Rodriguez, J.D., et al. (2024) To Cot or not to Cot? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning. https://arxiv.org/abs/2409.12183
|
|
[19]
|
Xu, R., Wang, W., Tang, H., et al. (2025) Multi-Spatial MLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models. https://arxiv.org/abs/2505.17015
|
|
[20]
|
Zheng, D., Huang, S. and Wang, L. (2025) Video-3d LLM: Learning Position-Aware Video Representation for 3D Scene Understanding. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 2995-9006. [Google Scholar] [CrossRef]
|
|
[21]
|
Mildenhall, B., Srinivasan, P.P., Tancik, M., et al. (2021) Nerf: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM, 65, 99-106.
|
|
[22]
|
Gao, C., Saraf, A., Kopf, J. and Huang, J. (2021) Dynamic View Synthesis from Dynamic Monocular Video. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5692-5701. [Google Scholar] [CrossRef]
|
|
[23]
|
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., et al. (2019) Habitat: A Platform for Embodied AI Research. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 9338-9346. [Google Scholar] [CrossRef]
|
|
[24]
|
Kolve, E., Mottaghi, R., Han, W., et al. (2017) AI2-thor. An Interactive 3d Environment for Visual AI. https://arxiv.org/abs/1712.05474
|