Skip to content

[CVPR 2025 πŸ”₯] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.

License

Notifications You must be signed in to change notification settings

hiyamdebary/EarthDial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EarthDial : Turning Multi-sensory Earth Observations to Interactive Dialogues [CVPR-2025]

Oryx EarthDial-ChatGPT

* Equally contributing first authors

IBM Research Lab, Mohamed bin Zayed University of AI

paper hf_dataset hf_model


πŸ“’ Latest Updates

  • Time line: EarthDial paper status to CVPR-25 πŸŽ‰.
  • Time line: EarthDial paper is released arxiv link. πŸ”₯πŸ”₯

EarthDial: Overview

Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and resource management. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and a few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding. To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection. Our extensive experimental results on 44 downstream datasets demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks.


Contents

Install

  1. Clone this repository and navigate to EarthDial folder
git clone https://github.com/hiyamdebary/EarthDial.git
cd EarthDial
  1. Install Package
conda create -n earthdial python=3.9 -y
conda activate earthdial
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install flash-attn==2.3.6 --no-build-isolation

Upgrade to latest code base

git pull
pip uninstall transformers
pip install -e .

EarthDial Weights and Demo

The EarthDial models are available on the Hugging Face Hub.

EarthDial_4B_RGB

EarthDial_4B_MS

EarthDial_4B_Methane_UHI

🧩 Download Instructions

You can download them using the huggingface_hub Python package.

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="akshaydudhane/EarthDial_4B_RGB",
    repo_type="model",
    local_dir="checkpoints/EarthDial_4B_RGB"
)

Please check out our Model for all public EarthDial checkpoints, and check demo section for instructions on how to run the earthdial demo.

Train

Download Training Data

Before downloading the datasets, ensure that the following directory structure exists:

mkdir -p src/earthdial/trainset

Download training data from Huggingface repo Use the following Python script to download the training set:

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="akshaydudhane/EarthDial-Dataset",
    repo_type="dataset",
    allow_patterns="training_set/**",
    local_dir=""
)

Stage 1 : RS Conversational Pretraining

Stage 2 : RS RGB and Temporal finetuning

Stage 3 : RS Multispectral and SAR Finetuning:

We train EarthDial on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

Both hyperparameters used in pretraining and finetuning are provided below.

Hyperparameter Global Batch Size Learning rate Epochs Max length Weight decay
EarthDial-4B 128 4e-5 1 4096 0.1

Evaluation

We evaluate EarthDial on a diverse set of 44 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs. Evaluation script will updated soon.

πŸ† Contributions

  • RS multimodal instruction following dataset. We present a novel data generation pipeline, to leverage existing object detection dataset to create short descriptions of the images, to create conversations using the generated text alone. Further, we add visual question-answering and scene classification abilities using their corresponding datasets. This results in a total of 11.11M instruction pairs for RS domain.

  • EarthDial. Leveraging our dataset, we finetune InternVL2 to create the remote sensing-domain vision-language model - EarthDial. Our 3 Stage model fine-tuning is efficient and avoids forgetting the necessary context embedded in RGB,Multi temporal and Multi spectral datasets, whose MLP projection is trained to align images into the word embedding space of the LLM (Phi-3 Mini). This allows EarthDial to retain the conversation and instruction following abilities of SoTA VLMS and extend its domain-knowledge to remote sensing tasks.

  • Evaluation Benchmark. We also address the lack of evaluation benchmarks to assess the capability of existing VLMs on remote-sensing conversations. To this end, we setup evaluation protocols for conversation grounding in RS, as well as a setup a suite of tasks to allow comparisons with future efforts in this direction. We show various supervised as well as zero-shot evaluations for different remote sensing tasks, including image captioning, visual question answering and scene classification to demonstrate the generalisability of EarthDial conversational VLM.


πŸ‘οΈπŸ’¬ EarthDial : Grounded Large Vision-Language Model for Remote Sensing

EarthDial can accomplish multiple tasks for remote-sensing (RS) image comprehension in a unified framework. Given suitable task tokens and user queries, the model can generate visually grounded responses (text with corresponding object locations - shown on top), visual question answering on images and regions (top left and bottom right, respectively) as well as scene classification (top right) and normal natural language conversations (bottom). This makes it the first RS VLM with grounding capability.


πŸ›°οΈ EarthDial : Architecture

An overview of EarthDial - the first grounded large vision-language model for remote sensing. Given an image input together with a user query, a visual backbone is first used to encode patch-level tokens at a higher resolution via interpolating positional encodings. A multi-layer perceptron (MLP) is used to adapt vision-tokens to language space suitable for input to a Large Language Model (Phi3 mini). Besides visual inputs, region locations can also be input to the model together with task-specific prompts that specify the desired task required by the user. Given this context, the LLM can generate natural language responses interleaved with corresponding object locations. EarthDial can perform multiple tasks as shown on top e.g., scene classification, image/region captioning, VQA and grounded conversations.

EarthDial Architectural


πŸ” RS Multimodal Instruction Dataset

Types of annotations available in the EarthDial instruction-set. For a given RS image, we obtain object attribute and relationship information, referring expressions and region captions along with their corresponding region annotations. We created our 11.11M RS instruciton pairs and data will be available soon.

Model Demo

Please refer to demo folder for the more information about Earthdial demo.

EarthDial Architectural



πŸ“œ Citation

@article{soni2024earthdial,
  title={EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues}, 
  author={Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danisho, Paolo Fraccaro, Campbell Watson, Levente J. Klein, Salman Khan, Fahad Khan},
  journal={ArXiv},
  year={2025},
  url={https://arxiv.org/abs/2412.15190}
}

πŸ™ Acknowledgement

IBM Research and MBZUAI University for their collaborative support and guidance.


About

[CVPR 2025 πŸ”₯] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published