SpatialGeo enhances spatial reasoning in MLLMs based on the novel vision encoder generating spatial-aware visual embedding.
The overall architecture of SpatialGeo is shown in the figure below, which is composed of three major modules: 1) CLIP module with the CLIP encoder and its adapter to extract instance-level semantic features; 2) MoGe module with the MoGe encoder and its adapter to embed a mixture of geometry and semantic features; 3) LLM module with interleaved geometry and semantic embeddings together with text tokens as inputs to generate question answering.
You can obtain detailed information about the source code and model on the following website.
We compare SpatialGeo with SOTA MLLMs, i.e., LLaVA-v1.5-7B [1], GPT-4.1 [2], SpatialRGPT [3] on spatial VQA datasets, including SpatialRGPT-Bench [3] and SpatialScore [4]
We select different types of questions from SpatialRGPT-Bench for demonstration.
Fig.1 Vertical Distance
Fig.2 Width Data
Fig.3 Height Data
We select different types of questions from SpatialScore for presentation.
Fig.4 Boundingboxs Distance
Fig.5 Objects Distance
Fig.6 Objects Distance
Fig.7 Object Localization
Fig.8 Camera and Image Transformation
________________________
[1] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. “Improved baselines with visual instruction tuning.” CVPR, pp. 26296-26306. 2024.
[2] OpenAI. “Introducing gpt-4.1 in the API.” April 2025. [Online]. Available: https://openai.com/index/gpt-4-1/
[3] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. “SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models.” NeurIPS, pp.135062-135093. 2025.
[4] Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. “SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding.” arXiv preprint arXiv:2505.17012 2025.