|
[1]
|
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X. and Liu, S. (2024) SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models. In: Advances in Neural Information Processing Systems, Curran Associates Inc, 135062-135093.
|
|
[2]
|
Ren, S., He, K., Girshick, R. and Sun, J. (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. [Google Scholar] [CrossRef] [PubMed]
|
|
[3]
|
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023). Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 4015-4026.[CrossRef]
|
|
[4]
|
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J. and Zhao, H. (2024) Depth Anything V2. In: Advances in Neural Information Processing Systems (NeurIPS), Curran Associates Inc, 21875-21911.
|
|
[5]
|
Li, Z., Wang, Q., Zhang, F. and Tan, P. (2025) MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Monocular Videos of Dynamic Scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 10486-10496.
|
|
[6]
|
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Xiao, M., Li, Y.K., Wu, Y. and Guo, D. (2024) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300
|
|
[7]
|
He, J., Liu, J., Liu, C.Y., Yan, R., Wang, C., et al. (2025) Skywork Open Reasoner 1 Technical Report. https://arxiv.org/abs/2502.06657
|
|
[8]
|
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., et al. (2025) Qwen2.5-VL Technical Report. https://arxiv.org/abs/2502.13923
|
|
[9]
|
Yang, J., Yang, S., Gupta, A., Han, R., et al. (2024) Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. https://arxiv.org/abs/2406.18385
|
|
[10]
|
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y., Liu, Z. and Li, C. (2024) LLaVA-OneVision: Easy Visual Task Transfer. https://arxiv.org/abs/2408.03326
|
|
[11]
|
Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F. and Sun, X. (2025) Spacer: Reinforcing MLLMs in Video Spatial Reasoning. https://arxiv.org/abs/2501.01805
|
|
[12]
|
Deng, N., Gu, L., Ye, S., He, Y., Chen, Z., Li, S., Wang, H., Wei, X., Yang, T., Dou, M., et al. (2025) InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models. https://arxiv.org/abs/2502.14028
|
|
[13]
|
Ray, A., Duan, J., Brown, E., Tan, R., Bashkirova, D., Hendrix, R., Ehsani, K., Kembhavi, A., Plummer, B.A., Krishna, R., et al. (2024) SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models. https://arxiv.org/abs/2412.07755
|