๐
๐
Old Age
GeoX-Bench: Benchmarking Cross-View Geo-Localization and Pose Estimation Capabilities of Large Multimodal Models
November 17, 2025 ยท Declared Dead ยท ๐ arXiv.org
Authors
Yushuo Zheng, Jiangyong Ying, Huiyu Duan, Chunyi Li, Zicheng Zhang, Jing Liu, Xiaohong Liu, Guangtao Zhai
arXiv ID
2511.13259
Category
cs.CV: Computer Vision
Cross-listed
cs.AI
Citations
0
Venue
arXiv.org
Repository
https://github.com/IntMeGroup/GeoX-Bench}
Last Checked
2 months ago
Abstract
Large multimodal models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks, however their knowledge and abilities in the cross-view geo-localization and pose estimation domains remain unexplored, despite potential benefits for navigation, autonomous driving, outdoor robotics, \textit{etc}. To bridge this gap, we introduce \textbf{GeoX-Bench}, a comprehensive \underline{Bench}mark designed to explore and evaluate the capabilities of LMMs in \underline{cross}-view \underline{Geo}-localization and pose estimation. Specifically, GeoX-Bench contains 10,859 panoramic-satellite image pairs spanning 128 cities in 49 countries, along with corresponding 755,976 question-answering (QA) pairs. Among these, 42,900 QA pairs are designated for benchmarking, while the remaining are intended to enhance the capabilities of LMMs. Based on GeoX-Bench, we evaluate the capabilities of 25 state-of-the-art LMMs on cross-view geo-localization and pose estimation tasks, and further explore the empowered capabilities of instruction-tuning. Our benchmark demonstrate that while current LMMs achieve impressive performance in geo-localization tasks, their effectiveness declines significantly on the more complex pose estimation tasks, highlighting a critical area for future improvement, and instruction-tuning LMMs on the training data of GeoX-Bench can significantly improve the cross-view geo-sense abilities. The GeoX-Bench is available at \textcolor{magenta}{https://github.com/IntMeGroup/GeoX-Bench}.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted
Rethinking the Inception Architecture for Computer Vision
Died the same way โ ๐ 404 Not Found
R.I.P.
๐
404 Not Found
Deep High-Resolution Representation Learning for Visual Recognition
R.I.P.
๐
404 Not Found
HuggingFace's Transformers: State-of-the-art Natural Language Processing
R.I.P.
๐
404 Not Found
CCNet: Criss-Cross Attention for Semantic Segmentation
R.I.P.
๐
404 Not Found