Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection

Abstract

Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feeds these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users for identifying changes that user expected. In this paper, we introduce a new benchmark named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. To this end, we construct the first CDQAG dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction.

Task: CDQAG

CDQAG takes a pair of remote sensing images and a question as input. The output is a textual answer and a corresponding visual segmentation. Unlike classic VQA methods that provide only natural language responses, CDQAG can offer both textual answers and correlative visual explanations (as shown in Figure 1), which is critical for reasonable remote sensing interpretation.

Benchmark Dataset: QAG-360K

Figure 2. Examples of the proposed QAG-360K dataset.

Benchmark Method: VisTA

Figure 3. The architecture of our VisTA model as a simple baseline for the CDQAG task. Firstly, the given two remote sensing images and a question are encoded into vision features $F_v$ and language features $[F_s,F_w]$, respectively. $F_v$ and $F_w$ are fed into a vision-language decoder to produce the refined multimodal features $F_{vl}$. Next, the Q&A selector is used to generate Q&A features $F_s$. Subsequently, $F_s$ is activated as a selection weight to filter the pixel decoder's output, resulting in coarse mask $\text{M}_c$. Finally, the text-visual answer decoder is employed to predict the textual answer and the corresponding visual answer.

Experiments

We benchmark the state-of-the-art methods on AGC-360K and CDVQA datasets to the best of our knowledge.

Table 1. VisTA results: comparison on QAG-360K test set

Table 2. VisTA results: comparison on CDVQA test set

Downloads

BibTeX

Please consider to cite CDQAG if it helps your research.

        
@article{li2024show,
  title={Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection},
  author={Li, Ke and Dong, Fuyu and Wang, Di and Li, Shaofeng and Wang, Quan and Gao, Xinbo and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2410.23828},
  year={2024}
}