๐
๐
Old Age
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts
December 07, 2024 ยท Declared Dead ยท ๐ arXiv.org
Authors
Xu Liu, Zhouhui Lian
arXiv ID
2412.05679
Category
cs.CV: Computer Vision
Citations
24
Venue
arXiv.org
Repository
https://github.com/xuliu-cyber/RSUniVLM}{here}
Last Checked
1 month ago
Abstract
Remote Sensing Vision-Language Models (RS VLMs) have made much progress in the tasks of remote sensing (RS) image comprehension. While performing well in multi-modal reasoning and multi-turn conversations, the existing models lack pixel-level understanding and struggle with multi-image inputs. In this work, we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive vision understanding across multiple granularity, including image-level, region-level, and pixel-level tasks. RSUniVLM also performs effectively in multi-image analysis, with instances of change detection and change captioning. To enhance the model's ability to capture visual information at different levels without increasing model size, we design a novel architecture called Granularity-oriented Mixture of Experts to constraint the model to about 1 billion parameters. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain, encompassing various tasks such as object localization, visual question answering, and semantic segmentation. Substantial experiments have been conducted to validate the superiority of the proposed RSUniVLM up to state-of-the-art across various RS tasks. Code and model will be available at \href{https://github.com/xuliu-cyber/RSUniVLM}{here}.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Computer Vision
๐
๐
Old Age
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
R.I.P.
๐ป
Ghosted
You Only Look Once: Unified, Real-Time Object Detection
๐
๐
Old Age
SSD: Single Shot MultiBox Detector
๐
๐
Old Age
Squeeze-and-Excitation Networks
R.I.P.
๐ป
Ghosted
Rethinking the Inception Architecture for Computer Vision
Died the same way โ ๐ 404 Not Found
R.I.P.
๐
404 Not Found
Deep High-Resolution Representation Learning for Visual Recognition
R.I.P.
๐
404 Not Found
HuggingFace's Transformers: State-of-the-art Natural Language Processing
R.I.P.
๐
404 Not Found
CCNet: Criss-Cross Attention for Semantic Segmentation
R.I.P.
๐
404 Not Found