What question did this study set out to answer?

The aim is to improve land cover change interpretation in satellite imagery through interactive analysis.

February 10, 2026Open Access

DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception

Key Points

The aim is to improve land cover change interpretation in satellite imagery through interactive analysis.
Developed a vision language model, DeltaVLM, for remote sensing image change analysis.
Introduced a fine-tuned bi-temporal vision encoder to extract features from input images.
Implemented a visual difference perception module using a cross-semantic relation mechanism.
Constructed ChangeChat-105k, a dataset with over 105k instruction-following samples.
DeltaVLM outperforms existing methods in single-turn captioning and multi-turn change analysis.
Achieved state-of-the-art performance compared to both multimodal and remote sensing models.

Abstract

The accurate interpretation of land cover changes in multi-temporal satellite imagery is critical for Earth observation. However, existing methods typically yield static outputs—such as binary masks or fixed captions—lacking interactivity and user guidance. To address this limitation, we introduce remote sensing image change analysis (RSICA), a novel paradigm that enables the instruction-guided, multi-turn exploration of temporal differences in bi-temporal images through visual question answering. To realize RSICA, we propose DeltaVLM, a vision language model specifically designed for interactive change understanding. DeltaVLM comprises three key components: (1) a fine-tuned bi-temporal vision encoder that independently extracts semantic features from each image in the input pair; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former that selects query-relevant change features and aligns them with a frozen large language model to generate context-aware responses. We also present ChangeChat-105k, a large-scale instruction-following dataset containing over 105k diverse samples. Extensive experiments show that DeltaVLM achieves state-of-the-art performance in both single-turn captioning and multi-turn interactive change analysis, surpassing both general multimodal models and specialized remote sensing vision language models.

Bookmark

View Full Paper

Bookmark

View Full Paper

DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception

Key Points

Abstract

Cite This Study