aws-samples · allela-roy · Sep 25, 2025 · Sep 30, 2025 · Oct 1, 2025 · Oct 7, 2025
diff --git a/3.test_cases/pytorch/nanoVLM/README.md b/3.test_cases/pytorch/nanoVLM/README.md
@@ -0,0 +1,187 @@
+# NanoVLM Test Case
+
+This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. 
+
+
+## 1. Prerequisites
+
+This guide assumes that you have the following:
+
+- A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS.
+- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well.
+- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory.
+
+Make sure that your current directory is under a shared filesystem such as `/fsx`. 
+
+## 2. Clone this repo
+
+  ```bash
+  cd ~
+  git clone https://github.com/aws-samples/awsome-distributed-training/
+  cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/
+  ```
+
+
+## 3. Install Dependencies and Prepare Virtual Environment
+
+Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections.
+
+  ```bash
+  sudo apt install python3.10-venv
+  python3 -m venv nanoVLM-env
+  source nanoVLM-env/bin/activate
+  pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval
+
+  ```
+
+## 4. Hugging Face token
+
+First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens. 
+
+
+### Get huggingface token
+
+```bash
+huggingface-cli login
+```
+
+You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential.
+
+```
+
+    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
+    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
+    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
+    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
+    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
+
+    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
+Enter your token (input will not be visible): 
+Add token as git credential? (Y/n) n
+Token is valid (permission: read).
+Your token has been saved to /fsx/ubuntu/.cache/huggingface/token
+Login successful
+```
+
+Then export the saved token `${HF_TOKEN}` to use in the subsequent steps
+
+```bash
+export HF_TOKEN=$(cat /path_where_the_token_is_saved_from_the_above_step)
+```
+for example:
+```bash
+export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token)
+```
+
+## 5. Clone the nanoVLM repository
+
+```bash
+git clone https://github.com/huggingface/nanoVLM.git
+cd nanoVLM
+git checkout 9de5e17ac2f4c578c32085131d966464cdd252b5
+cd ..
+```
+This sample has been developed with the above commit hash. 
+
+## 6. Download the dataset required for the training
+
+Specify path to download dataset for example:
+
+```bash
+export DATASET_DIR=$PWD/datasets/cauldron
+```
+
+The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"]. 
+
+### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file:
+
+```bash
+configs = get_dataset_config_names("HuggingFaceM4/the_cauldron")
+```
+
+```bash
+cd slurm
+sbatch download_dataset.sbatch
+```
+
+## 7. Update the dataset and checkpoint path in the NanoVLM config 
+
+```bash
+cd ..
+sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '$DATASET_DIR'|" $PWD/nanoVLM/models/config.py
+```
+
+Since this demo is just to showcase the workflow, we can also reduce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below:
+
+```bash
+sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" $PWD/nanoVLM/models/config.py
+```
+
+```bash
+export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints
+```
+
+```bash
+sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py
+```
+
+Disable logging metrics to wandb for this sample:
+```bash
+sed -i "s/log_wandb: bool = True/log_wandb: bool = False/" $PWD/nanoVLM/models/config.py
+```
+
+### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues.
+```bash
+
+sed -i \
+  -e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \
+  -e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \
+  -e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \
+  -e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \
+  -e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \
+  -e 's/eval_interval: int = 500/eval_interval: int = 50/' \
+  -e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \
+  -e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \
+  -e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \
+  -e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \
+  -e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \
+  -e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \
+  -e 's/log_wandb: bool = True/log_wandb: bool = False/' \
+  -e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \
+  $PWD/nanoVLM/models/config.py
+```
+
+## 8. Build and Configure the NaNoVLM Job Container
+The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies:
+
+```bash
+docker build -t nanovlm:latest -f nanovlm.Dockerfile .
+enroot import -o nanovlm.sqsh  dockerd://nanovlm:latest
+```
+## 9. Launch Training
+
+```bash
+cd slurm
+sbatch launch_training.sbatch
+```
+Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation
+
+For example:
+
+```
+/fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450
+```
+
+## 10. Run evaluation
+
+```bash
+sbatch launch_evaluation.sbatch
+```
+
+## 11. Test generation
+
+```bash
+cd ../nanoVLM
+python generate.py --checkpoint $CHECKPOINT_DIR
+
+```
diff --git a/3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile b/3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile
@@ -0,0 +1,12 @@
+FROM public.ecr.aws/hpc-cloud/nccl-tests:latest
+
+RUN apt update && apt install -y nvtop
+
+RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval
+RUN pip install sagemaker-mlflow
+RUN mkdir -p /nanoVLM 
+RUN ln -s /usr/bin/python3 /usr/bin/python
+
+COPY nanoVLM/ /nanoVLM/
+
+WORKDIR /nanoVLM
diff --git a/3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch
@@ -0,0 +1,41 @@
+#!/bin/bash
+#SBATCH --job-name=download_dataset
+#SBATCH --output=logs/download_%A.out
+#SBATCH --error=logs/download_%A.err
+#SBATCH --nodes=2
+#SBATCH --cpus-per-task=48
+#SBATCH --partition=p5en
+
+HF_TOKEN=$HF_TOKEN
+
+mkdir -p logs
+mkdir -p $DATASET_DIR
+
+
+python3 << 'EOF'
+from datasets import load_dataset
+import os
+import time
+
+local_path = os.environ["DATASET_DIR"]
+
+os.makedirs(local_path, exist_ok=True)
+
+configs = ["clevr", "vqav2", "docvqa"]
+total_start = time.time()
+
+for i, config in enumerate(configs):
+    print(f"Downloading {i+1}/{len(configs)}: {config}")
+    start_time = time.time()
+    try:
+        dataset = load_dataset("HuggingFaceM4/the_cauldron", config)
+        dataset.save_to_disk(f"{local_path}/{config}")
+        elapsed = time.time() - start_time
+        print(f"✓ Saved {config} in {elapsed:.1f}s")
+    except Exception as e:
+        elapsed = time.time() - start_time
+        print(f"✗ Failed {config} after {elapsed:.1f}s: {e}")
+
+total_elapsed = time.time() - total_start
+print(f"Total time: {total_elapsed:.1f}s")
+EOF
diff --git a/3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch
@@ -0,0 +1,41 @@
+#!/bin/bash
+#SBATCH --job-name=nanovlm_eval_workshop
+#SBATCH --output=logs/eval/%A_%a.out
+#SBATCH --error=logs/eval/%A_%a.err
+#SBATCH --time=48:00:00
+#SBATCH --nodes=1
+#SBATCH --partition=p5en
+#SBATCH --array=0
+
+cd ..
+
+mkdir -p logs/eval
+
+export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh
+
+export FSX_MOUNT=$(pwd):$(pwd)
+export PYTHONPATH=$(pwd)/nanoVLM:$PYTHONPATH
+
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+declare -a ARGS=(
+    --container-image $CONTAINER_IMAGE
+    --container-mounts $FSX_MOUNT
+)
+
+export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py
+
+declare -a EVAL_ARGS=(
+    --checkpoints_dir $CHECKPOINT_DIR
+    --steps 450
+    --eval_tasks mmstar,mmmu 
+    --batch_size 1
+)
+
+AUTO_RESUME=""
+if [ -d "/opt/sagemaker_cluster" ]; then
+    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
+    AUTO_RESUME="--auto-resume=1"
+fi
+
+srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}"
diff --git a/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch
@@ -0,0 +1,57 @@
+#!/bin/bash
+#SBATCH --job-name=train_nanoVLM
+#SBATCH --output=logs/train_nanoVLM/%A.out
+#SBATCH --error=logs/train_nanoVLM/%A.err
+#SBATCH --time=01:00:00
+#SBATCH --nodes=4
+#SBATCH --partition=p5en
+
+GPUS_PER_NODE=8 #set to 1 for g5.8xlarge
+
+cd ..
+
+export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh
+
+export FSX_MOUNT=$(pwd):$(pwd)
+
+
+export NCCL_DEBUG=INFO
+export FI_PROVIDER=efa
+#export FI_EFA_USE_HUGE_PAGE=0    # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory.  Disabling huge page causes minor performance hit.
+## Switching SYNC_MEMOPS to zero can boost throughput with FSDP
+## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
+## Reduces memory synchronizations
+## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html
+export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
+# LD_PRELOAD is required for PyTorch to find the NCCL library
+# This path assumes you are using the Deep Learning AMI
+# If you are not using the DLAMI, you may need to update this path
+export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so
+export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth
+
+declare -a ARGS=(
+    --container-image $CONTAINER_IMAGE
+    --container-mounts $FSX_MOUNT
+)
+
+declare -a TORCHRUN_ARGS=(
+    --nproc_per_node=$GPUS_PER_NODE
+    --nnodes=$SLURM_JOB_NUM_NODES
+    --rdzv_id=$SLURM_JOB_ID
+    --rdzv_backend=c10d
+    --rdzv_endpoint=$SLURMD_NODENAME:29500 
+)
+
+export TRAIN_SCRIPT=train.py
+
+declare -a TRAINING_ARGS=(
+    --no_log_wandb
+)
+
+AUTO_RESUME=""
+if [ -d "/opt/sagemaker_cluster" ]; then
+    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"
+    AUTO_RESUME="--auto-resume=1"
+fi
+
+srun ${AUTO_RESUME}  -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"