Sagar Soni*, Akshay Dudhane*, Hiyam Debary*, Mustansar Fiaz*, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan and Fahad Khan
* Equally contributing first authors
- Time line: EarthDial paper status to CVPR-25 π.
- Time line: EarthDial paper is released arxiv link. π₯π₯
Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and resource management. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and a few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding. To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection. Our extensive experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks.
- Clone this repository and navigate to EarthDial folder
git clone https://github.com/hiyamdebary/EarthDial.git
cd EarthDial
- Install Package
conda create -n earthdial python=3.9 -y
conda activate earthdial
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install flash-attn==2.3.6 --no-build-isolation
git pull
pip uninstall transformers
pip install -e .
The EarthDial models are available on the Hugging Face Hub.
You can download them using the huggingface_hub
Python package.
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="akshaydudhane/EarthDial_4B_RGB",
repo_type="model",
local_dir="checkpoints/EarthDial_4B_RGB"
)
Please check out our Model for all public EarthDial checkpoints, and check demo section for instructions on how to run the earthdial demo.
Before downloading the datasets, ensure that the following directory structure exists:
mkdir -p src/earthdial/trainset
Download training data from Huggingface repo Use the following Python script to download the training set:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="akshaydudhane/EarthDial-Dataset",
repo_type="dataset",
allow_patterns="training_set/**",
local_dir=""
)
We train EarthDial on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
Both hyperparameters used in pretraining and finetuning are provided below.
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
EarthDial-4B | 128 | 4e-5 | 1 | 4096 | 0.1 |
We evaluate EarthDial on a diverse set of 44 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. Evaluation script will updated soon.
-
RS multimodal instruction following dataset. We present a novel data generation pipeline, to leverage existing object detection dataset to create short descriptions of the images, to create conversations using the generated text alone. Further, we add visual question-answering and scene classification abilities using their corresponding datasets. This results in a total of 11.11M instruction pairs for RS domain.
-
EarthDial. Leveraging our dataset, we finetune InternVL2 to create the remote sensing-domain vision-language model - EarthDial. Our 3 Stage model fine-tuning is efficient and avoids forgetting the necessary context embedded in RGB,Multi temporal and Multi spectral datasets, whose MLP projection is trained to align images into the word embedding space of the LLM (Phi-3 Mini). This allows EarthDial to retain the conversation and instruction following abilities of SoTA VLMS and extend its domain-knowledge to remote sensing tasks.
-
Evaluation Benchmark. We also address the lack of evaluation benchmarks to assess the capability of existing VLMs on remote-sensing conversations. To this end, we setup evaluation protocols for conversation grounding in RS, as well as a setup a suite of tasks to allow comparisons with future efforts in this direction. We show various supervised as well as zero-shot evaluations for different remote sensing tasks, including image captioning, visual question answering and scene classification to demonstrate the generalisability of EarthDial conversational VLM.
EarthDial can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability.
An overview of EarthDial - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Phi3 mini). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. EarthDial can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.
Types of annotations available in the EarthDial instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations. We created our 11.11M RS instruciton pairs and data will be available soon.
Please refer to demo folder for the more information about Earthdial demo.
@article{soni2024earthdial,
title={EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues},
author={Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan, Fahad Khan},
journal={ArXiv},
year={2025},
url={https://arxiv.org/abs/2412.15190}
}
IBM Research and MBZUAI University for their collaborative support and guidance.