Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.
Characteristics:
Multi-level annotations: The paired image-mask-QA pairs assisst for relational reasoning-based remote sensing visual question answering.
Applicable QA pairs: All QA pairs are designed based on the actual city planning needs.
Paper | Code | Results | Date | Stars |
---|