Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feeds these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users for identifying changes that user expected. In this paper, we introduce a new benchmark named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. To this end, we construct the first CDQAG dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction.
Figure 3. The architecture of our VisTA model as a simple baseline for the CDQAG task. Firstly, the given two remote sensing images and a question are encoded into vision features $F_v$ and language features $[F_s,F_w]$, respectively. $F_v$ and $F_w$ are fed into a vision-language decoder to produce the refined multimodal features $F_{vl}$. Next, the Q&A selector is used to generate Q&A features $F_s$. Subsequently, $F_s$ is activated as a selection weight to filter the pixel decoder's output, resulting in coarse mask $\text{M}_c$. Finally, the text-visual answer decoder is employed to predict the textual answer and the corresponding visual answer.
@article{li2024show,
title={Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection},
author={Li, Ke and Dong, Fuyu and Wang, Di and Li, Shaofeng and Wang, Quan and Gao, Xinbo and Chua, Tat-Seng},
journal={arXiv preprint arXiv:2410.23828},
year={2024}
}