Show Me What and Where has Changed?
Question Answering and Grounding for Remote Sensing Change Detection

1. Xidian University, 2. Chongqing University of Posts and Telecommunications,
3. National University of Singapore
Corresponding Author

Figure 1. Change detection (CD) identifies surface changes from multi-temporal images. Classic visual question answering (VQA) only supports textual answers. In comparison, the proposed change detection qusetion answering and grounding (CDQAG) supports well-founded answers, i.e., textual answers (“what has changed”) and relevant visual feedback (“where has changed”)


Abstract

Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feeds these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users for identifying changes that user expected. In this paper, we introduce a new benchmark named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. To this end, we construct the first CDQAG dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction.

Task: CDQAG

CDQAG takes a pair of remote sensing images and a question as input. The output is a textual answer and a corresponding visual segmentation. Unlike classic VQA methods that provide only natural language responses, CDQAG can offer both textual answers and correlative visual explanations (as shown in Figure 1), which is critical for reasonable remote sensing interpretation.

Benchmark Dataset: QAG-360K

Figure 2. Examples of the proposed QAG-360K dataset.

Benchmark Method: VisTA

Figure 3. The architecture of our VisTA model as a simple baseline for the CDQAG task. Firstly, the given two remote sensing images and a question are encoded into vision features $F_v$ and language features $[F_s,F_w]$, respectively. $F_v$ and $F_w$ are fed into a vision-language decoder to produce the refined multimodal features $F_{vl}$. Next, the Q&A selector is used to generate Q&A features $F_s$. Subsequently, $F_s$ is activated as a selection weight to filter the pixel decoder's output, resulting in coarse mask $\text{M}_c$. Finally, the text-visual answer decoder is employed to predict the textual answer and the corresponding visual answer.


Experiments

We benchmark the state-of-the-art methods on AGC-360K and CDVQA datasets to the best of our knowledge.

Table 1. VisTA results: comparison on QAG-360K test set


Table 2. VisTA results: comparison on CDVQA test set

Downloads



BibTeX

Please consider to cite CDQAG if it helps your research.
        @misc{li2024changedquestionansweringgrounding,
      title={Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection}, 
      author={Ke Li and Fuyu Dong and Di Wang and Shaofeng Li and Quan Wang and Xinbo Gao and Tat-Seng Chua},
      year={2024},
      eprint={2410.23828},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.23828}, 
}