SpatialGeoPages

SpatialGeo: Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion


SpatialGeo enhances spatial reasoning in MLLMs based on the novel vision encoder generating spatial-aware visual embedding.

The overall architecture of SpatialGeo is shown in the figure below, which is composed of three major modules: 1) CLIP module with the CLIP encoder and its adapter to extract instance-level semantic features; 2) MoGe module with the MoGe encoder and its adapter to embed a mixture of geometry and semantic features; 3) LLM module with interleaved geometry and semantic embeddings together with text tokens as inputs to generate question answering.



Code

You can obtain detailed information about the source code and model on the following website.


Spatial VQA Datasets

We compare SpatialGeo with SOTA MLLMs, i.e., LLaVA-v1.5-7B [1], GPT-4.1 [2], SpatialRGPT [3] on spatial VQA datasets, including SpatialRGPT-Bench [3] and SpatialScore [4]


Examples From SpatialRGPT-Bench

We select different types of questions from SpatialRGPT-Bench for demonstration.

Fig.1 Vertical Distance


Fig.2 Width Data


Fig.3 Height Data



Examples From SpatialScore

We select different types of questions from SpatialScore for presentation.

Fig.4 Boundingboxs Distance


Fig.5 Objects Distance


Fig.6 Objects Distance


Fig.7 Object Localization


Fig.8 Camera and Image Transformation


________________________

Reference

[1] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. “Improved baselines with visual instruction tuning.” CVPR, pp. 26296-26306. 2024.​

[2] OpenAI. “Introducing gpt-4.1 in the API.” April 2025. [Online]. Available: https://openai.com/index/gpt-4-1/

[3] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. “SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models.” NeurIPS, pp.135062-135093. 2025.

[4] Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, and Weidi Xie. “SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding.” arXiv preprint arXiv:2505.17012 2025.​