-
Notifications
You must be signed in to change notification settings - Fork 152
Adding nanoVLM sample #864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,187 @@ | ||
| # NanoVLM Test Case | ||
|
|
||
| This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. | ||
|
|
||
|
|
||
| ## 1. Prerequisites | ||
|
|
||
| This guide assumes that you have the following: | ||
|
|
||
| - A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS. | ||
| - Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well. | ||
| - An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory. | ||
|
|
||
| Make sure that your current directory is under a shared filesystem such as `/fsx`. | ||
|
|
||
| ## 2. Clone this repo | ||
|
|
||
| ```bash | ||
| cd ~ | ||
| git clone https://github.com/aws-samples/awsome-distributed-training/ | ||
| cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/ | ||
| ``` | ||
|
|
||
|
|
||
| ## 3. Install Dependencies and Prepare Virtual Environment | ||
|
|
||
| Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections. | ||
|
|
||
| ```bash | ||
| sudo apt install python3.10-venv | ||
| python3 -m venv nanoVLM-env | ||
| source nanoVLM-env/bin/activate | ||
| pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval | ||
|
|
||
| ``` | ||
|
|
||
| ## 4. Hugging Face token | ||
|
|
||
| First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens. | ||
|
|
||
|
|
||
| ### Get huggingface token | ||
|
|
||
| ```bash | ||
| huggingface-cli login | ||
| ``` | ||
|
|
||
| You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential. | ||
|
|
||
| ``` | ||
|
|
||
| _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_| | ||
| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| | ||
| _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_| | ||
| _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| | ||
| _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_| | ||
|
|
||
| To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens . | ||
| Enter your token (input will not be visible): | ||
| Add token as git credential? (Y/n) n | ||
| Token is valid (permission: read). | ||
| Your token has been saved to /fsx/ubuntu/.cache/huggingface/token | ||
| Login successful | ||
| ``` | ||
|
|
||
| Then export the saved token `${HF_TOKEN}` to use in the subsequent steps | ||
|
|
||
| ```bash | ||
| export HF_TOKEN=$(cat /path_where_the_token_is_saved_from_the_above_step) | ||
| ``` | ||
| for example: | ||
| ```bash | ||
| export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token) | ||
| ``` | ||
|
|
||
| ## 5. Clone the nanoVLM repository | ||
|
|
||
| ```bash | ||
| git clone https://github.com/huggingface/nanoVLM.git | ||
| cd nanoVLM | ||
| git checkout 9de5e17ac2f4c578c32085131d966464cdd252b5 | ||
| cd .. | ||
| ``` | ||
| This sample has been developed with the above commit hash. | ||
|
|
||
| ## 6. Download the dataset required for the training | ||
|
|
||
| Specify path to download dataset for example: | ||
|
|
||
| ```bash | ||
| export DATASET_DIR=$PWD/datasets/cauldron | ||
| ``` | ||
|
|
||
| The default dataset path will be $DATASET_DIR and the datasets are ["clevr", "vqav2", "docvqa"]. | ||
|
|
||
| ### (Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below in Line 24 in slurm/download_dataset.sbatch file: | ||
|
|
||
| ```bash | ||
| configs = get_dataset_config_names("HuggingFaceM4/the_cauldron") | ||
| ``` | ||
|
|
||
| ```bash | ||
| cd slurm | ||
| sbatch download_dataset.sbatch | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a mandatory command. Move out of this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. Add an example output that the user can see on their log files: |
||
| ``` | ||
|
|
||
| ## 7. Update the dataset and checkpoint path in the NanoVLM config | ||
|
|
||
| ```bash | ||
| cd .. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's nice to tell the user where they are or they should land. Example:
cd ..
etc.... |
||
| sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '$DATASET_DIR'|" $PWD/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| Since this demo is just to showcase the workflow, we can also reduce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below: | ||
|
|
||
| ```bash | ||
| sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" $PWD/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| ```bash | ||
| export CHECKPOINT_DIR=$PWD/nanoVLM/checkpoints | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. explain what this line do. Add an explanation before telling the user to run this command. |
||
| ``` | ||
|
|
||
| ```bash | ||
| sed -i "s|vlm_checkpoint_path: str = '[^']*'|vlm_checkpoint_path: str = '$CHECKPOINT_DIR'|" $PWD/nanoVLM/models/config.py | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and then you can move this line to be on the same code block as the export CHECKPOINTS_DIR. |
||
| ``` | ||
|
|
||
| Disable logging metrics to wandb for this sample: | ||
| ```bash | ||
| sed -i "s/log_wandb: bool = True/log_wandb: bool = False/" $PWD/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| ### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues. | ||
| ```bash | ||
|
|
||
| sed -i \ | ||
| -e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \ | ||
| -e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \ | ||
| -e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \ | ||
| -e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \ | ||
| -e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \ | ||
| -e 's/eval_interval: int = 500/eval_interval: int = 50/' \ | ||
| -e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \ | ||
| -e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \ | ||
| -e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \ | ||
| -e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \ | ||
| -e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \ | ||
| -e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \ | ||
| -e 's/log_wandb: bool = True/log_wandb: bool = False/' \ | ||
| -e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \ | ||
| $PWD/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| ## 8. Build and Configure the NaNoVLM Job Container | ||
| The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies: | ||
|
|
||
| ```bash | ||
| docker build -t nanovlm:latest -f nanovlm.Dockerfile . | ||
| enroot import -o nanovlm.sqsh dockerd://nanovlm:latest | ||
| ``` | ||
| ## 9. Launch Training | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe some explanation what this sbatch script does? |
||
| ```bash | ||
| cd slurm | ||
| sbatch launch_training.sbatch | ||
| ``` | ||
| Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation | ||
|
|
||
| For example: | ||
|
|
||
| ``` | ||
| /fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450 | ||
| ``` | ||
|
|
||
| ## 10. Run evaluation | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe some explanation what this sbatch script does? |
||
| ```bash | ||
| sbatch launch_evaluation.sbatch | ||
| ``` | ||
|
|
||
| ## 11. Test generation | ||
|
|
||
| ```bash | ||
| cd ../nanoVLM | ||
| python generate.py --checkpoint $CHECKPOINT_DIR | ||
|
|
||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| FROM public.ecr.aws/hpc-cloud/nccl-tests:latest | ||
|
|
||
| RUN apt update && apt install -y nvtop | ||
|
|
||
| RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval | ||
| RUN pip install sagemaker-mlflow | ||
| RUN mkdir -p /nanoVLM | ||
| RUN ln -s /usr/bin/python3 /usr/bin/python | ||
|
|
||
| COPY nanoVLM/ /nanoVLM/ | ||
|
|
||
| WORKDIR /nanoVLM |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=download_dataset | ||
| #SBATCH --output=logs/download_%A.out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --error=logs/download_%A.err | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --nodes=2 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you need 2 nodes to download datasets? |
||
| #SBATCH --cpus-per-task=48 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you need to specificy CPU per task? And what if there are more or less than 48? |
||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this necessary? It will fail if the cluster doesn't have this partition name. |
||
|
|
||
| HF_TOKEN=$HF_TOKEN | ||
|
|
||
| mkdir -p logs | ||
| mkdir -p $DATASET_DIR | ||
|
|
||
|
|
||
| python3 << 'EOF' | ||
| from datasets import load_dataset | ||
| import os | ||
| import time | ||
|
|
||
| local_path = os.environ["DATASET_DIR"] | ||
|
|
||
| os.makedirs(local_path, exist_ok=True) | ||
|
|
||
| configs = ["clevr", "vqav2", "docvqa"] | ||
| total_start = time.time() | ||
|
|
||
| for i, config in enumerate(configs): | ||
| print(f"Downloading {i+1}/{len(configs)}: {config}") | ||
| start_time = time.time() | ||
| try: | ||
| dataset = load_dataset("HuggingFaceM4/the_cauldron", config) | ||
| dataset.save_to_disk(f"{local_path}/{config}") | ||
| elapsed = time.time() - start_time | ||
| print(f"✓ Saved {config} in {elapsed:.1f}s") | ||
| except Exception as e: | ||
| elapsed = time.time() - start_time | ||
| print(f"✗ Failed {config} after {elapsed:.1f}s: {e}") | ||
|
|
||
| total_elapsed = time.time() - total_start | ||
| print(f"Total time: {total_elapsed:.1f}s") | ||
| EOF | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=nanovlm_eval_workshop | ||
| #SBATCH --output=logs/eval/%A_%a.out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of |
||
| #SBATCH --error=logs/eval/%A_%a.err | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --time=48:00:00 | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this necessary? It will fail if the cluster doesn't have this partition name. |
||
| #SBATCH --array=0 | ||
|
|
||
| cd .. | ||
|
|
||
| mkdir -p logs/eval | ||
|
|
||
| export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh | ||
|
|
||
| export FSX_MOUNT=$(pwd):$(pwd) | ||
| export PYTHONPATH=$(pwd)/nanoVLM:$PYTHONPATH | ||
|
|
||
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
|
|
||
| declare -a ARGS=( | ||
| --container-image $CONTAINER_IMAGE | ||
| --container-mounts $FSX_MOUNT | ||
| ) | ||
|
|
||
| export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py | ||
|
|
||
| declare -a EVAL_ARGS=( | ||
| --checkpoints_dir $CHECKPOINT_DIR | ||
| --steps 450 | ||
| --eval_tasks mmstar,mmmu | ||
| --batch_size 1 | ||
| ) | ||
|
|
||
| AUTO_RESUME="" | ||
| if [ -d "/opt/sagemaker_cluster" ]; then | ||
| echo "Detected Hyperpod cluster.. enabling --auto-resume=1" | ||
| AUTO_RESUME="--auto-resume=1" | ||
| fi | ||
|
|
||
| srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=train_nanoVLM | ||
| #SBATCH --output=logs/train_nanoVLM/%A.out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --error=logs/train_nanoVLM/%A.err | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --time=01:00:00 | ||
| #SBATCH --nodes=4 | ||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't need partition name. Will fail if doesn't exist. |
||
|
|
||
| GPUS_PER_NODE=8 #set to 1 for g5.8xlarge | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on? |
||
|
|
||
| cd .. | ||
|
|
||
| export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh | ||
|
|
||
| export FSX_MOUNT=$(pwd):$(pwd) | ||
|
|
||
|
|
||
| export NCCL_DEBUG=INFO | ||
| export FI_PROVIDER=efa | ||
| #export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. | ||
| ## Switching SYNC_MEMOPS to zero can boost throughput with FSDP | ||
| ## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS | ||
| ## Reduces memory synchronizations | ||
| ## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html | ||
| export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 | ||
| # LD_PRELOAD is required for PyTorch to find the NCCL library | ||
| # This path assumes you are using the Deep Learning AMI | ||
| # If you are not using the DLAMI, you may need to update this path | ||
| export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so | ||
| export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth | ||
|
|
||
| declare -a ARGS=( | ||
| --container-image $CONTAINER_IMAGE | ||
| --container-mounts $FSX_MOUNT | ||
| ) | ||
|
|
||
| declare -a TORCHRUN_ARGS=( | ||
| --nproc_per_node=$GPUS_PER_NODE | ||
| --nnodes=$SLURM_JOB_NUM_NODES | ||
| --rdzv_id=$SLURM_JOB_ID | ||
| --rdzv_backend=c10d | ||
| --rdzv_endpoint=$SLURMD_NODENAME:29500 | ||
| ) | ||
|
|
||
| export TRAIN_SCRIPT=train.py | ||
|
|
||
| declare -a TRAINING_ARGS=( | ||
| --no_log_wandb | ||
| ) | ||
|
|
||
| AUTO_RESUME="" | ||
| if [ -d "/opt/sagemaker_cluster" ]; then | ||
| echo "Detected Hyperpod cluster.. enabling --auto-resume=1" | ||
| AUTO_RESUME="--auto-resume=1" | ||
| fi | ||
|
|
||
| srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way you put this
### (Optional)here, it seems that the commandsbatch download_dataset.sbatchis also optional. Put theOptionalinformation as a markdownblockquoteinstead.