EarthDial : Turning Multi-sensory Earth Observations to Interactive Dialogues [CVPR-2025]

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan and Fahad Khan

* Equally contributing first authors

IBM Research Lab, Mohamed bin Zayed University of AI

📢 Latest Updates

Time line: EarthDial paper status to CVPR-25 🎉.
Time line: EarthDial paper is released arxiv link. 🔥🔥

EarthDial: Overview

Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and resource management. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and a few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding. To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection. Our extensive experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks.

Install

Clone this repository and navigate to EarthDial folder

git clone https://github.com/hiyamdebary/EarthDial.git
cd EarthDial

Install Package

conda create -n earthdial python=3.9 -y
conda activate earthdial
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install flash-attn==2.3.6 --no-build-isolation

Upgrade to latest code base

git pull
pip uninstall transformers
pip install -e .

EarthDial Weights and Demo

The EarthDial models are available on the Hugging Face Hub.

EarthDial_4B_RGB

EarthDial_4B_MS

EarthDial_4B_Methane_UHI

🧩 Download Instructions

You can download them using the huggingface_hub Python package.

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="akshaydudhane/EarthDial_4B_RGB",
    repo_type="model",
    local_dir="checkpoints/EarthDial_4B_RGB"
)

Please check out our Model for all public EarthDial checkpoints, and check demo section for instructions on how to run the earthdial demo.

Train

Download Training Data

Before downloading the datasets, ensure that the following directory structure exists:

mkdir -p src/earthdial/trainset

Download training data from Huggingface repo Use the following Python script to download the training set:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="akshaydudhane/EarthDial-Dataset",
    repo_type="dataset",
    allow_patterns="training_set/**",
    local_dir=""
)

Stage 1 : RS Conversational Pretraining

Stage 2 : RS RGB and Temporal finetuning

Stage 3 : RS Multispectral and SAR Finetuning:

We train EarthDial on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
EarthDial-4B	128	4e-5	1	4096	0.1

Evaluation

We evaluate EarthDial on a diverse set of 44 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. Evaluation script will updated soon.

🏆 Contributions

RS multimodal instruction following dataset. We present a novel data generation pipeline, to leverage existing object detection dataset to create short descriptions of the images, to create conversations using the generated text alone. Further, we add visual question-answering and scene classification abilities using their corresponding datasets. This results in a total of 11.11M instruction pairs for RS domain.
EarthDial. Leveraging our dataset, we finetune InternVL2 to create the remote sensing-domain vision-language model - EarthDial. Our 3 Stage model fine-tuning is efficient and avoids forgetting the necessary context embedded in RGB,Multi temporal and Multi spectral datasets, whose MLP projection is trained to align images into the word embedding space of the LLM (Phi-3 Mini). This allows EarthDial to retain the conversation and instruction following abilities of SoTA VLMS and extend its domain-knowledge to remote sensing tasks.
Evaluation Benchmark. We also address the lack of evaluation benchmarks to assess the capability of existing VLMs on remote-sensing conversations. To this end, we setup evaluation protocols for conversation grounding in RS, as well as a setup a suite of tasks to allow comparisons with future efforts in this direction. We show various supervised as well as zero-shot evaluations for different remote sensing tasks, including image captioning, visual question answering and scene classification to demonstrate the generalisability of EarthDial conversational VLM.

👁️💬 EarthDial : Grounded Large Vision-Language Model for Remote Sensing

EarthDial can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability.

🛰️ EarthDial : Architecture

An overview of EarthDial - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Phi3 mini). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. EarthDial can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.

🔍 RS Multimodal Instruction Dataset

Types of annotations available in the EarthDial instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations. We created our 11.11M RS instruciton pairs and data will be available soon.

Model Demo

Please refer to demo folder for the more information about Earthdial demo.

📜 Citation

@article{soni2024earthdial,
  title={EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues}, 
  author={Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan, Fahad Khan},
  journal={ArXiv},
  year={2025},
  url={https://arxiv.org/abs/2412.15190}
}

🙏 Acknowledgement

IBM Research and MBZUAI University for their collaborative support and guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
demo		demo
docs		docs
images		images
src		src
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EarthDial : Turning Multi-sensory Earth Observations to Interactive Dialogues [CVPR-2025]

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan and Fahad Khan

IBM Research Lab, Mohamed bin Zayed University of AI

📢 Latest Updates

EarthDial: Overview

Contents

Install

Upgrade to latest code base

EarthDial Weights and Demo

🧩 Download Instructions

Train

Download Training Data

Stage 1 : RS Conversational Pretraining

Stage 2 : RS RGB and Temporal finetuning

Stage 3 : RS Multispectral and SAR Finetuning:

Hyperparameters

Evaluation

🏆 Contributions

👁️💬 EarthDial : Grounded Large Vision-Language Model for Remote Sensing

🛰️ EarthDial : Architecture

🔍 RS Multimodal Instruction Dataset

Model Demo

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

hiyamdebary/EarthDial

Folders and files

Latest commit

History

Repository files navigation

EarthDial : Turning Multi-sensory Earth Observations to Interactive Dialogues [CVPR-2025]

Sagar Soni*, Akshay Dudhane*, Hiyam Debary*, Mustansar Fiaz*, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan and Fahad Khan

IBM Research Lab, Mohamed bin Zayed University of AI

📢 Latest Updates

EarthDial: Overview

Contents

Install

Upgrade to latest code base

EarthDial Weights and Demo

🧩 Download Instructions

Train

Download Training Data

Stage 1 : RS Conversational Pretraining

Stage 2 : RS RGB and Temporal finetuning

Stage 3 : RS Multispectral and SAR Finetuning:

Hyperparameters

Evaluation

🏆 Contributions

👁️💬 EarthDial : Grounded Large Vision-Language Model for Remote Sensing

🛰️ EarthDial : Architecture

🔍 RS Multimodal Instruction Dataset

Model Demo

📜 Citation

🙏 Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan and Fahad Khan

Packages