Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# NanoVLM Test Case

This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch.


## 1. Prerequisites

This guide assumes that you have the following:

- A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS.
- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well.
- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory.

Make sure that your current directory is under a shared filesystem such as `/fsx`.

## 2. Install Dependencies and Prepare Virtual Environment

Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections.

```bash
python3 -m venv nanoVLM-env
source nanoVLM-env/bin/activate
pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval

```

## 3. Hugging Face token

First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens.

Save the token onto the head node and download the Llama model:

### Get huggingface token

```bash
huggingface-cli login
```

You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential.

```

_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|

To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /fsx/ubuntu/.cache/huggingface/token
Login successful
```

Then use the saved token `${HF_TOKEN}` to create configuration.

## 4. Clone this repo

```bash
cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/slurm
```

## 5. Download the dataset required for the training
The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please change the code so it take a different directory as the base path? Then change the sbatch file to parse that variable and use it. The idea is to allow anyone to define the base path (ex: /home/user or /lustre/ubuntu) and instead of using a hard coded choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the (optional) section you have a sbatch download_datasets.sbatch . Is that optional? Or optional is just editing the file to download additional datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


```bash
configs = get_dataset_config_names("HuggingFaceM4/the_cauldron")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where should I place this? Can you please add the existing line and how it should look?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

```

```bash
sbatch download_dataset.sbatch
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a mandatory command. Move out of this optional section.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. 253 seconds.

Add an example output that the user can see on their log files:

Downloading 1/3: clevr
✓ Saved clevr in 111.5s
Downloading 2/3: vqav2
✓ Saved vqav2 in 100.7s
Downloading 3/3: docvqa

```

```
Downloading 1/3: clevr
✓ Saved clevr in 113.5s
Downloading 2/3: vqav2
✓ Saved vqav2 in 101.2s
Downloading 3/3: docvqa
✓ Saved docvqa in 41.7s
Total time: 256.3s
```

## 6. Clone the nanoVLM repository

```bash
cd ..
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice to tell the user where they are or they should land. Example:

Now, let's move back one folder where you have created the dataset folder.

cd ..

You should find yourself on the awsome-distributed-training/3.test_cases/pytorch/nanoVLM folder.

etc....

git clone https://github.com/huggingface/nanoVLM.git
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure you clone a specific commit hash as the external repo might change without notice and the example stop working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

cd nanoVLM
```

## 7. Update the dataset path in the config

```bash
sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please allow for diff base path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

```

Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: redunce should be reduce

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


```bash
sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the path here should be ./awsome-distributed-training/3.test_cases/pytorch/nanoVLM/nanoVLM/models/config.py instead of /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

```

sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsxl/rallela/nanoVLM/nanoVLM/models/config.py

### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues.
```bash

sed -i \
-e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \
-e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \
-e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \
-e "s|vlm_checkpoint_path: str = 'checkpoints'|vlm_checkpoint_path: str = '/fsx/ubuntu/nanoVLM/checkpoints'|" \
-e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \
-e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \
-e 's/eval_interval: int = 500/eval_interval: int = 50/' \
-e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \
-e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \
-e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \
-e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \
-e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \
-e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \
-e 's/log_wandb: bool = True/log_wandb: bool = False/' \
-e 's/use_lmms_eval: bool = True/use_lmms_eval: bool = False/' \
-e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \
/fsx/ubuntu/nanoVLM/nanoVLM/models/config.py
```

## 8. Build and Configure the NaNoVLM Job Container
The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies:

```bash
cd ..
docker build -t nanovlm:latest -f nanovlm.Dockerfile .
enroot import -o nanovlm.sqsh dockerd://nanovlm:latest


```
## 9. Launch Training

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some explanation what this sbatch script does?

```bash
cd
sbatch launch_training.sbatch
```
Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation

For example:

```
/fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450
```

## 10. Run evaluation
Update the checkpoint directory in launch_evaluation.sh to the checkpoint we generated above

```
export CHECKPOINT_DIR="your-checkpoint-directory"
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some explanation what this sbatch script does?

```bash
cd
sbatch launch_evaluation.sbatch
```

## 11. Test generation
Export the checkpoint directory in your terminal

```
export CHECKPOINT_DIR="your-checkpoint-directory"
```

```bash
cd ..
python generate.py --checkpoint $CHECKPOINT_DIR

```
12 changes: 12 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/nanovlm.dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM public.ecr.aws/hpc-cloud/nccl-tests:latest

RUN apt update && apt install -y nvtop

RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval

RUN mkdir -p /nanoVLM
RUN ln -s /usr/bin/python3 /usr/bin/python

COPY nanoVLM/ /nanoVLM/

WORKDIR /nanoVLM
45 changes: 45 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/bin/bash
#SBATCH --job-name=download_dataset
#SBATCH --output=logs/download_%A.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/download_%A.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --nodes=2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need 2 nodes to download datasets?

#SBATCH --cpus-per-task=48
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to specificy CPU per task? And what if there are more or less than 48?

#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? It will fail if the cluster doesn't have this partition name.


export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of sourcing that token from a cache directory, add instructions to the README asking the user to export HF_TOKEN and then you use the env variable on your sbatch file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


cd /fsx/ubuntu/nanoVLM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow users to define a different base path. Example: instead of /fsx/ubuntu I want to use /home/user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


source nanoVLM_env/bin/activate

mkdir -p logs
mkdir -p /fsx/ubuntu/datasets/nanoVLM/cauldron
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. I want to use a different base path.



python3 << 'EOF'
from datasets import load_dataset
import os
import time

local_path = "/fsx/ubuntu/datasets/nanoVLM/cauldron"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again, different base path.


os.makedirs(local_path, exist_ok=True)

configs = ["clevr", "vqav2", "docvqa"]
total_start = time.time()

for i, config in enumerate(configs):
print(f"Downloading {i+1}/{len(configs)}: {config}")
start_time = time.time()
try:
dataset = load_dataset("HuggingFaceM4/the_cauldron", config)
dataset.save_to_disk(f"{local_path}/{config}")
elapsed = time.time() - start_time
print(f"✓ Saved {config} in {elapsed:.1f}s")
except Exception as e:
elapsed = time.time() - start_time
print(f"✗ Failed {config} after {elapsed:.1f}s: {e}")

total_elapsed = time.time() - total_start
print(f"Total time: {total_elapsed:.1f}s")
EOF
43 changes: 43 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
#!/bin/bash
#SBATCH --job-name=nanovlm_eval_workshop
#SBATCH --output=logs/eval_new/%A_%a.out
#SBATCH --error=logs/eval_new/%A_%a.err
#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? It will fail if the cluster doesn't have this partition name.

#SBATCH --array=0

cd /fsx/ubuntu/nanoVLM
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another place to allow for different base path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh

export FSX_MOUNT=/fsx/ubuntu:/fsx/ubuntu

export PYTHONPATH=/fsx/ubuntu/nanoVLM:$PYTHONPATH

export CHECKPOINT_DIR="" # UPDATE


export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

declare -a ARGS=(
--container-image $CONTAINER_IMAGE
--container-mounts $FSX_MOUNT
)

export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py

declare -a EVAL_ARGS=(
--checkpoints_dir $CHECKPOINT_DIR
--steps 450
--batch_size 1
)


AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}"
57 changes: 57 additions & 0 deletions 3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
#SBATCH --job-name=train_nanoVLM
#SBATCH --output=logs/train_nanoVLM/%A.out
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --error=logs/train_nanoVLM/%A.err
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way.

#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --partition=p5en
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need partition name. Will fail if doesn't exist.


GPUS_PER_NODE=8 #set to 1 for g5.8xlarge
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on?


cd /fsx/ubuntu/nanoVLM

export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh

export FSX_MOUNT=/fsx/ubuntu:/fsx/ubuntu


export NCCL_DEBUG=INFO
export FI_PROVIDER=efa
#export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit.
## Switching SYNC_MEMOPS to zero can boost throughput with FSDP
## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
## Reduces memory synchronizations
## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
# LD_PRELOAD is required for PyTorch to find the NCCL library
# This path assumes you are using the Deep Learning AMI
# If you are not using the DLAMI, you may need to update this path
export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so
export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth

declare -a ARGS=(
--container-image $CONTAINER_IMAGE
--container-mounts $FSX_MOUNT
)

declare -a TORCHRUN_ARGS=(
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$SLURM_JOB_NUM_NODES
--rdzv_id=$SLURM_JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=$SLURMD_NODENAME:29500
)

export TRAIN_SCRIPT=train.py

declare -a TRAINING_ARGS=(
--no_log_wandb
)

AUTO_RESUME=""
if [ -d "/opt/sagemaker_cluster" ]; then
echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
AUTO_RESUME="--auto-resume=1"
fi

srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"