Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# NanoVLM Test Case

This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch.


## 1. Prerequisites

This guide assumes that you have the following:

- A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS.
- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well.
- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory.

Make sure that your current directory is under a shared filesystem such as `/fsx`.

## 2. Clone this repo

```bash
cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/
```


## 3. Install Dependencies and Prepare Virtual Environment

Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections.

```bash
sudo apt install python3.10-venv
python3 -m venv nanoVLM-env
source nanoVLM-env/bin/activate
pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval

```

## 4. Hugging Face token

First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens.


### Get huggingface token

```bash
huggingface-cli login
```

You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential.

```

_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|

To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /fsx/ubuntu/.cache/huggingface/token
Login successful
```

Then export the saved token `${HF_TOKEN}` to use in the subsequent steps

```bash
export HF_TOKEN=$(cat /path_where_the_token_is_saved_from_the_above_step)
```
for example:
```bash
export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)
```

## 5. Clone the nanoVLM repository

```bash
git clone https://github.com/huggingface/nanoVLM.git
cd nanoVLM
git checkout 9de5e17ac2f4c578c32085131d966464cdd252b5
cd ..
```
This sample has been developed with the above commit hash.

## 6. Download the dataset required for the training

Specify path to download dataset for example:

```bash
export DATASET_DIR=$PWD/datasets/cauldron
```

The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"].

### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way you put this ### (Optional) here, it seems that the command sbatch download_dataset.sbatch is also optional. Put the Optional information as a markdown blockquote instead.


```bash
configs = get_dataset_config_names("HuggingFaceM4/the_cauldron")
```

```bash
cd slurm
sbatch download_dataset.sbatch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a mandatory command. Move out of this optional section.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. 253 seconds.

Add an example output that the user can see on their log files:

Downloading 1/3: clevr
✓ Saved clevr in 111.5s
Downloading 2/3: vqav2
✓ Saved vqav2 in 100.7s
Downloading 3/3: docvqa

```

## 7. Update the dataset and checkpoint path in the NanoVLM config

```bash
cd ..
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice to tell the user where they are or they should land. Example:

Now, let's move back one folder where you have created the dataset folder.

cd ..

You should find yourself on the awsome-distributed-training/3.test_cases/pytorch/nanoVLM folder.

etc....

sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '$DATASET_DIR'|" $PWD/nanoVLM/models/config.py
```

Since this demo is just to showcase the workflow, we can also reduce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:

```bash
sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" $PWD/nanoVLM/models/config.py
```

```bash
export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain what this line do. Add an explanation before telling the user to run this command.

```

```bash
sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you can move this line to be on the same code block as the export CHECKPOINTS_DIR.

```

Disable logging metrics to wandb for this sample:
```bash
sed -i "s/log_wandb: bool = True/log_wandb: bool = False/" $PWD/nanoVLM/models/config.py
```

### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues.
```bash

sed -i \
-e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \
-e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \
-e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \
-e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \
-e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \
-e 's/eval_interval: int = 500/eval_interval: int = 50/' \
-e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \
-e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \
-e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \
-e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \
-e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \
-e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \
-e 's/log_wandb: bool = True/log_wandb: bool = False/' \
-e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \
$PWD/nanoVLM/models/config.py
```

## 8. Build and Configure the NaNoVLM Job Container
The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies:

```bash
docker build -t nanovlm:latest -f nanovlm.Dockerfile .
enroot import -o nanovlm.sqsh dockerd://nanovlm:latest
```
## 9. Launch Training

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some explanation what this sbatch script does?

```bash
cd slurm
sbatch launch_training.sbatch
```
Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation

For example:

```
/fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450
```

## 10. Run evaluation

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some explanation what this sbatch script does?

```bash
sbatch launch_evaluation.sbatch
```

## 11. Test generation

```bash
cd ../nanoVLM
python generate.py --checkpoint $CHECKPOINT_DIR

```
12 changes: 12 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM public.ecr.aws/hpc-cloud/nccl-tests:latest

RUN apt update && apt install -y nvtop

RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval
RUN pip install sagemaker-mlflow
RUN mkdir -p /nanoVLM
RUN ln -s /usr/bin/python3 /usr/bin/python

COPY nanoVLM/ /nanoVLM/

WORKDIR /nanoVLM
41 changes: 41 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash
#SBATCH --job-name=download_dataset
#SBATCH --output=logs/download_%A.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/download_%A.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --nodes=2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need 2 nodes to download datasets?

#SBATCH --cpus-per-task=48
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to specificy CPU per task? And what if there are more or less than 48?

#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? It will fail if the cluster doesn't have this partition name.


HF_TOKEN=$HF_TOKEN

mkdir -p logs
mkdir -p $DATASET_DIR


python3 << 'EOF'
from datasets import load_dataset
import os
import time

local_path = os.environ["DATASET_DIR"]

os.makedirs(local_path, exist_ok=True)

configs = ["clevr", "vqav2", "docvqa"]
total_start = time.time()

for i, config in enumerate(configs):
print(f"Downloading {i+1}/{len(configs)}: {config}")
start_time = time.time()
try:
dataset = load_dataset("HuggingFaceM4/the_cauldron", config)
dataset.save_to_disk(f"{local_path}/{config}")
elapsed = time.time() - start_time
print(f"✓ Saved {config} in {elapsed:.1f}s")
except Exception as e:
elapsed = time.time() - start_time
print(f"✗ Failed {config} after {elapsed:.1f}s: {e}")

total_elapsed = time.time() - total_start
print(f"Total time: {total_elapsed:.1f}s")
EOF
41 changes: 41 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash
#SBATCH --job-name=nanovlm_eval_workshop
#SBATCH --output=logs/eval/%A_%a.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/eval/%A_%a.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? It will fail if the cluster doesn't have this partition name.

#SBATCH --array=0

cd ..

mkdir -p logs/eval

export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh

export FSX_MOUNT=$(pwd):$(pwd)
export PYTHONPATH=$(pwd)/nanoVLM:$PYTHONPATH

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

declare -a ARGS=(
--container-image $CONTAINER_IMAGE
--container-mounts $FSX_MOUNT
)

export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py

declare -a EVAL_ARGS=(
--checkpoints_dir $CHECKPOINT_DIR
--steps 450
--eval_tasks mmstar,mmmu
--batch_size 1
)

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}"
57 changes: 57 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
#SBATCH --job-name=train_nanoVLM
#SBATCH --output=logs/train_nanoVLM/%A.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/train_nanoVLM/%A.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --time=01:00:00
#SBATCH --nodes=4
#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need partition name. Will fail if doesn't exist.


GPUS_PER_NODE=8 #set to 1 for g5.8xlarge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on?


cd ..

export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh

export FSX_MOUNT=$(pwd):$(pwd)


export NCCL_DEBUG=INFO
export FI_PROVIDER=efa
#export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit.
## Switching SYNC_MEMOPS to zero can boost throughput with FSDP
## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
## Reduces memory synchronizations
## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
# LD_PRELOAD is required for PyTorch to find the NCCL library
# This path assumes you are using the Deep Learning AMI
# If you are not using the DLAMI, you may need to update this path
export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so
export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth

declare -a ARGS=(
--container-image $CONTAINER_IMAGE
--container-mounts $FSX_MOUNT
)

declare -a TORCHRUN_ARGS=(
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$SLURM_JOB_NUM_NODES
--rdzv_id=$SLURM_JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=$SLURMD_NODENAME:29500
)

export TRAIN_SCRIPT=train.py

declare -a TRAINING_ARGS=(
--no_log_wandb
)

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"