diff --git a/3.test_cases/pytorch/nanoVLM/README.md b/3.test_cases/pytorch/nanoVLM/README.md new file mode 100644 index 000000000..46d9a7a0a --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/README.md @@ -0,0 +1,187 @@ +# NanoVLM Test Case + +This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. + + +## 1. Prerequisites + +This guide assumes that you have the following: + +- A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS. +- Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well. +- An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory. + +Make sure that your current directory is under a shared filesystem such as `/fsx`. + +## 2. Clone this repo + + ```bash + cd ~ + git clone https://github.com/aws-samples/awsome-distributed-training/ + cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/ + ``` + + +## 3. Install Dependencies and Prepare Virtual Environment + +Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections. + + ```bash + sudo apt install python3.10-venv + python3 -m venv nanoVLM-env + source nanoVLM-env/bin/activate + pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval + + ``` + +## 4. Hugging Face token + +First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens. + + +### Get huggingface token + +```bash +huggingface-cli login +``` + +You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential. + +``` + + _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_| + _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| + _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_| + _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| + _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_| + + To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens . +Enter your token (input will not be visible): +Add token as git credential? (Y/n) n +Token is valid (permission: read). +Your token has been saved to /fsx/ubuntu/.cache/huggingface/token +Login successful +``` + +Then export the saved token `${HF_TOKEN}` to use in the subsequent steps + +```bash +export HF_TOKEN=$(cat /path_where_the_token_is_saved_from_the_above_step) +``` +for example: +```bash +export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token) +``` + +## 5. Clone the nanoVLM repository + +```bash +git clone https://github.com/huggingface/nanoVLM.git +cd nanoVLM +git checkout 9de5e17ac2f4c578c32085131d966464cdd252b5 +cd .. +``` +This sample has been developed with the above commit hash. + +## 6. Download the dataset required for the training + +Specify path to download dataset for example: + +```bash +export DATASET_DIR=$PWD/datasets/cauldron +``` + +The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"]. + +### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file: + +```bash +configs = get_dataset_config_names("HuggingFaceM4/the_cauldron") +``` + +```bash +cd slurm +sbatch download_dataset.sbatch +``` + +## 7. Update the dataset and checkpoint path in the NanoVLM config + +```bash +cd .. +sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '$DATASET_DIR'|" $PWD/nanoVLM/models/config.py +``` + +Since this demo is just to showcase the workflow, we can also reduce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below: + +```bash +sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" $PWD/nanoVLM/models/config.py +``` + +```bash +export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints +``` + +```bash +sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py +``` + +Disable logging metrics to wandb for this sample: +```bash +sed -i "s/log_wandb: bool = True/log_wandb: bool = False/" $PWD/nanoVLM/models/config.py +``` + +### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues. +```bash + +sed -i \ + -e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \ + -e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \ + -e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \ + -e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \ + -e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \ + -e 's/eval_interval: int = 500/eval_interval: int = 50/' \ + -e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \ + -e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \ + -e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \ + -e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \ + -e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \ + -e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \ + -e 's/log_wandb: bool = True/log_wandb: bool = False/' \ + -e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \ + $PWD/nanoVLM/models/config.py +``` + +## 8. Build and Configure the NaNoVLM Job Container +The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies: + +```bash +docker build -t nanovlm:latest -f nanovlm.Dockerfile . +enroot import -o nanovlm.sqsh dockerd://nanovlm:latest +``` +## 9. Launch Training + +```bash +cd slurm +sbatch launch_training.sbatch +``` +Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation + +For example: + +``` +/fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450 +``` + +## 10. Run evaluation + +```bash +sbatch launch_evaluation.sbatch +``` + +## 11. Test generation + +```bash +cd ../nanoVLM +python generate.py --checkpoint $CHECKPOINT_DIR + +``` diff --git a/3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile b/3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile new file mode 100644 index 000000000..07854a3f8 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/nanovlm.Dockerfile @@ -0,0 +1,12 @@ +FROM public.ecr.aws/hpc-cloud/nccl-tests:latest + +RUN apt update && apt install -y nvtop + +RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval +RUN pip install sagemaker-mlflow +RUN mkdir -p /nanoVLM +RUN ln -s /usr/bin/python3 /usr/bin/python + +COPY nanoVLM/ /nanoVLM/ + +WORKDIR /nanoVLM diff --git a/3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch new file mode 100644 index 000000000..2dde95c01 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/slurm/download_dataset.sbatch @@ -0,0 +1,41 @@ +#!/bin/bash +#SBATCH --job-name=download_dataset +#SBATCH --output=logs/download_%A.out +#SBATCH --error=logs/download_%A.err +#SBATCH --nodes=2 +#SBATCH --cpus-per-task=48 +#SBATCH --partition=p5en + +HF_TOKEN=$HF_TOKEN + +mkdir -p logs +mkdir -p $DATASET_DIR + + +python3 << 'EOF' +from datasets import load_dataset +import os +import time + +local_path = os.environ["DATASET_DIR"] + +os.makedirs(local_path, exist_ok=True) + +configs = ["clevr", "vqav2", "docvqa"] +total_start = time.time() + +for i, config in enumerate(configs): + print(f"Downloading {i+1}/{len(configs)}: {config}") + start_time = time.time() + try: + dataset = load_dataset("HuggingFaceM4/the_cauldron", config) + dataset.save_to_disk(f"{local_path}/{config}") + elapsed = time.time() - start_time + print(f"✓ Saved {config} in {elapsed:.1f}s") + except Exception as e: + elapsed = time.time() - start_time + print(f"✗ Failed {config} after {elapsed:.1f}s: {e}") + +total_elapsed = time.time() - total_start +print(f"Total time: {total_elapsed:.1f}s") +EOF diff --git a/3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch new file mode 100644 index 000000000..c25de3423 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/slurm/launch_evaluation.sbatch @@ -0,0 +1,41 @@ +#!/bin/bash +#SBATCH --job-name=nanovlm_eval_workshop +#SBATCH --output=logs/eval/%A_%a.out +#SBATCH --error=logs/eval/%A_%a.err +#SBATCH --time=48:00:00 +#SBATCH --nodes=1 +#SBATCH --partition=p5en +#SBATCH --array=0 + +cd .. + +mkdir -p logs/eval + +export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh + +export FSX_MOUNT=$(pwd):$(pwd) +export PYTHONPATH=$(pwd)/nanoVLM:$PYTHONPATH + +export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + +declare -a ARGS=( + --container-image $CONTAINER_IMAGE + --container-mounts $FSX_MOUNT +) + +export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py + +declare -a EVAL_ARGS=( + --checkpoints_dir $CHECKPOINT_DIR + --steps 450 + --eval_tasks mmstar,mmmu + --batch_size 1 +) + +AUTO_RESUME="" +if [ -d "/opt/sagemaker_cluster" ]; then + echo "Detected Hyperpod cluster.. enabling --auto-resume=1" + AUTO_RESUME="--auto-resume=1" +fi + +srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}" diff --git a/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch b/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch new file mode 100644 index 000000000..ae09ee7d1 --- /dev/null +++ b/3.test_cases/pytorch/nanoVLM/slurm/launch_training.sbatch @@ -0,0 +1,57 @@ +#!/bin/bash +#SBATCH --job-name=train_nanoVLM +#SBATCH --output=logs/train_nanoVLM/%A.out +#SBATCH --error=logs/train_nanoVLM/%A.err +#SBATCH --time=01:00:00 +#SBATCH --nodes=4 +#SBATCH --partition=p5en + +GPUS_PER_NODE=8 #set to 1 for g5.8xlarge + +cd .. + +export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh + +export FSX_MOUNT=$(pwd):$(pwd) + + +export NCCL_DEBUG=INFO +export FI_PROVIDER=efa +#export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. +## Switching SYNC_MEMOPS to zero can boost throughput with FSDP +## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS +## Reduces memory synchronizations +## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html +export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 +# LD_PRELOAD is required for PyTorch to find the NCCL library +# This path assumes you are using the Deep Learning AMI +# If you are not using the DLAMI, you may need to update this path +export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so +export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth + +declare -a ARGS=( + --container-image $CONTAINER_IMAGE + --container-mounts $FSX_MOUNT +) + +declare -a TORCHRUN_ARGS=( + --nproc_per_node=$GPUS_PER_NODE + --nnodes=$SLURM_JOB_NUM_NODES + --rdzv_id=$SLURM_JOB_ID + --rdzv_backend=c10d + --rdzv_endpoint=$SLURMD_NODENAME:29500 +) + +export TRAIN_SCRIPT=train.py + +declare -a TRAINING_ARGS=( + --no_log_wandb +) + +AUTO_RESUME="" +if [ -d "/opt/sagemaker_cluster" ]; then + echo "Detected Hyperpod cluster.. enabling --auto-resume=1" + AUTO_RESUME="--auto-resume=1" +fi + +srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" \ No newline at end of file