基于GRPO强化学习算法的视觉语言模型空间推理研究
Spatial Reasoning in Large Vision-Language Models via GRPO-Based Reinforcement Learning
摘要: 视觉语言模型(VLM)在视觉语义任务上表现出色,但在相对深度、三维定位与多实体空间关系等空间几何推理任务中仍存在不足,根源在于高质量空间数据稀缺以及传统监督式训练难以激发模型的空间推理过程。为此,本文提出一套集自动化标注、数据合成与模型训练于一体的端到端框架,以系统提升VLM的空间认知能力。我们构建了无需人工干预的图片空间标注流水线,通过高召回检测、掩码精修与深度相机参数恢复,实现了对照片空间信息的完整标注。在此基础上,我们设计了任务导向的数据合成模块,生成覆盖定性、定量,单跳、多跳的空间推理数据。进一步地,我们基于GRPO算法设计了一套强化学习训练流程,并结合了定制的奖励函数与课程学习训练策略,实现了大模型在空间任务上的稳定训练。实验结果表明,该框架在多个公开与自建基准上超越包括Qwen2.5-VL与InternVL在内的主流模型,验证了语义–几何一致的数据构建与强化学习训练策略优化对提升VLM空间推理能力的有效性。
Abstract: While Visual Language Models (VLMs) excel in visual semantic tasks, they exhibit significant deficiencies in spatial geometric reasoning tasks, such as relative depth estimation, 3D localization, and multi-entity spatial relationships. These limitations stem primarily from the scarcity of high-quality spatial data and the inability of traditional supervised training to effectively elicit spatial reasoning processes within the models. To address these challenges, this paper proposes an end-to-end framework integrating automated annotation, data synthesis, and model training, designed to systematically enhance the spatial cognitive capabilities of VLMs. We construct a fully automated image spatial annotation pipeline that achieves comprehensive annotation of spatial information through high-recall detection, mask refinement, and the recovery of depth and camera parameters, eliminating the need for human intervention. Building upon this, we design a task-oriented data synthesis module to generate spatial reasoning data encompassing qualitative, quantitative, single-hop, and multi-hop scenarios. Furthermore, we develop a reinforcement learning training pipeline based on the Group Relative Policy Optimization (GRPO) algorithm. By incorporating customized reward functions and curriculum learning strategies, we enable the stable training of large models for spatial tasks. Experimental results demonstrate that the proposed framework outperforms mainstream models, including Qwen2.5-VL and InternVL, across multiple public and self-constructed benchmarks. These findings validate the effectiveness of semantic-geometric consistent data construction and optimized reinforcement learning strategies in elevating the spatial reasoning abilities of VLMs.
文章引用:李亦然. 基于GRPO强化学习算法的视觉语言模型空间推理研究[J]. 人工智能与机器人研究, 2026, 15(1): 9-16. https://doi.org/10.12677/airr.2026.151002

参考文献

[1] Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X. and Liu, S. (2024) SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. In: Advances in Neural Information Processing Systems, Curran Associates Inc, 135062-135093.
[2] Ren, S., He, K., Girshick, R. and Sun, J. (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. [Google Scholar] [CrossRef] [PubMed]
[3] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023). Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 4015-4026.[CrossRef
[4] Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J. and Zhao, H. (2024) Depth Anything V2. In: Advances in Neural Information Processing Systems (NeurIPS), Curran Associates Inc, 21875-21911.
[5] Li, Z., Wang, Q., Zhang, F. and Tan, P. (2025) MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Monocular Videos of Dynamic Scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 10486-10496.
[6] Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Xiao, M., Li, Y.K., Wu, Y. and Guo, D. (2024) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
https://arxiv.org/abs/2402.03300
[7] He, J., Liu, J., Liu, C.Y., Yan, R., Wang, C., et al. (2025) Skywork Open Reasoner 1 Technical Report.
https://arxiv.org/abs/2502.06657
[8] Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., et al. (2025) Qwen2.5-VL Technical Report.
https://arxiv.org/abs/2502.13923
[9] Yang, J., Yang, S., Gupta, A., Han, R., et al. (2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces.
https://arxiv.org/abs/2406.18385
[10] Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z. and Li, C. (2024) LLaVA-OneVision: Easy Visual Task Transfer.
https://arxiv.org/abs/2408.03326
[11] Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F. and Sun, X. (2025) Spacer: Reinforcing MLLMs in Video Spatial Reasoning.
https://arxiv.org/abs/2501.01805
[12] Deng, N., Gu, L., Ye, S., He, Y., Chen, Z., Li, S., Wang, H., Wei, X., Yang, T., Dou, M., et al. (2025) InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models.
https://arxiv.org/abs/2502.14028
[13] Ray, A., Duan, J., Brown, E., Tan, R., Bashkirova, D., Hendrix, R., Ehsani, K., Kembhavi, A., Plummer, B.A., Krishna, R., et al. (2024) SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models.
https://arxiv.org/abs/2412.07755