-
Notifications
You must be signed in to change notification settings - Fork 152
Adding nanoVLM sample #864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,182 @@ | ||
| # NanoVLM Test Case | ||
|
|
||
| This test case demonstrates distributed training of [NanoVLM](https://github.com/huggingface/nanoVLM/), a repository for training/finetuning a small sized Vision-Language Model with a lightweight implementation in pure PyTorch. | ||
|
|
||
|
|
||
| ## 1. Prerequisites | ||
|
|
||
| This guide assumes that you have the following: | ||
|
|
||
| - A functional Slurm cluster on AWS. This test case also assumes that the cluster node uses Ubuntu-based OS. | ||
| - Docker, for Slurm [Pyxis](https://github.com/NVIDIA/pyxis) and [Enroot](https://github.com/NVIDIA/enroot) need to be installed as well. | ||
| - An FSx for Lustre filesystem mounted on `/fsx` in all Slurm nodes. Also, this test case assumes that the home directory is also a shared directory. | ||
|
|
||
| Make sure that your current directory is under a shared filesystem such as `/fsx`. | ||
|
|
||
| ## 2. Install Dependencies and Prepare Virtual Environment | ||
|
|
||
| Create Virtual environment and install the dependencies to download our dataset and test the generation in subsequent sections. | ||
|
|
||
| ```bash | ||
| python3 -m venv nanoVLM-env | ||
| source nanoVLM-env/bin/activate | ||
| pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval | ||
|
|
||
| ``` | ||
|
|
||
| ## 3. Hugging Face token | ||
|
|
||
| First, create a Hugging Face account to retrieve a [token](https://huggingface.co/settings/tokens.). Log in to your account and create an access token from Hugging Face Tokens. | ||
|
|
||
| Save the token onto the head node and download the Llama model: | ||
|
|
||
| ### Get huggingface token | ||
|
|
||
| ```bash | ||
| huggingface-cli login | ||
| ``` | ||
|
|
||
| You will be prompted to input the token. Paste the token and answer `n` when asked to add the token as a git credential. | ||
|
|
||
| ``` | ||
|
|
||
| _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_| | ||
| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| | ||
| _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_| | ||
| _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| | ||
| _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_| | ||
|
|
||
| To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens . | ||
| Enter your token (input will not be visible): | ||
| Add token as git credential? (Y/n) n | ||
| Token is valid (permission: read). | ||
| Your token has been saved to /fsx/ubuntu/.cache/huggingface/token | ||
| Login successful | ||
| ``` | ||
|
|
||
| Then use the saved token `${HF_TOKEN}` to create configuration. | ||
|
|
||
| ## 4. Clone this repo | ||
|
|
||
| ```bash | ||
| cd ~ | ||
| git clone https://github.com/aws-samples/awsome-distributed-training/ | ||
| cd awsome-distributed-training/3.test_cases/pytorch/nanoVLM/slurm | ||
| ``` | ||
|
|
||
| ## 5. Download the dataset required for the training | ||
| The default dataset path will be '/fsx/ubuntu/datasets/nanoVLM/cauldron' and the datasets are ["clevr", "vqav2", "docvqa"]. | ||
|
|
||
| ### Optional) You can modify this as needed to dowload the entire dataset by setting the configs to the entry below: | ||
|
||
|
|
||
| ```bash | ||
| configs = get_dataset_config_names("HuggingFaceM4/the_cauldron") | ||
|
||
| ``` | ||
|
|
||
| ```bash | ||
| sbatch download_dataset.sbatch | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a mandatory command. Move out of this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add how long it takes. I know it depends on the internet connection, but it gives the user an idea time to run this step. Mine took a total time of. Add an example output that the user can see on their log files: |
||
| ``` | ||
|
|
||
| ``` | ||
| Downloading 1/3: clevr | ||
| ✓ Saved clevr in 113.5s | ||
| Downloading 2/3: vqav2 | ||
| ✓ Saved vqav2 in 101.2s | ||
| Downloading 3/3: docvqa | ||
| ✓ Saved docvqa in 41.7s | ||
| Total time: 256.3s | ||
| ``` | ||
|
|
||
| ## 6. Clone the nanoVLM repository | ||
|
|
||
| ```bash | ||
| cd .. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's nice to tell the user where they are or they should land. Example:
cd ..
etc.... |
||
| git clone https://github.com/huggingface/nanoVLM.git | ||
|
||
| cd nanoVLM | ||
| ``` | ||
|
|
||
| ## 7. Update the dataset path in the config | ||
|
|
||
| ```bash | ||
| sed -i "s|train_dataset_path: str = '[^']*'|train_dataset_path: str = '/fsx/ubuntu/datasets/nanoVLM/cauldron'|" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py | ||
|
||
| ``` | ||
|
|
||
| Since this demo is just to showcase the workflow, we can also redunce the number of evaluation tasks from [mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa] to just using [mmstar,mmmu] with the command below: | ||
|
||
|
|
||
| ```bash | ||
| sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py | ||
|
||
| ``` | ||
|
|
||
| sed -i "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" /fsxl/rallela/nanoVLM/nanoVLM/models/config.py | ||
|
|
||
| ### (Optional) If training and running evaluations on g5 instances, update the configuration as below to avoid OOM issues. | ||
| ```bash | ||
|
|
||
| sed -i \ | ||
| -e 's/lm_max_position_embeddings: int = 8192/lm_max_position_embeddings: int = 2048/' \ | ||
| -e 's/lm_max_length: int = 8192/lm_max_length: int = 2048/' \ | ||
| -e 's/max_img_size: int = 2048/max_img_size: int = 1024/' \ | ||
| -e "s|vlm_checkpoint_path: str = 'checkpoints'|vlm_checkpoint_path: str = '/fsx/ubuntu/nanoVLM/checkpoints'|" \ | ||
| -e 's/data_cutoff_idx: int = None/data_cutoff_idx: int = 5000/' \ | ||
| -e 's/gradient_accumulation_steps: int = 8/gradient_accumulation_steps: int = 4/' \ | ||
| -e 's/eval_interval: int = 500/eval_interval: int = 50/' \ | ||
| -e 's/stats_log_interval: int = 100/stats_log_interval: int = 10/' \ | ||
| -e 's/max_training_steps: int = 80100/max_training_steps: int = 500/' \ | ||
| -e 's/max_images_per_example: int = 8/max_images_per_example: int = 2/' \ | ||
| -e 's/max_images_per_knapsack: int = 36/max_images_per_knapsack: int = 8/' \ | ||
| -e 's/max_sample_length: int = 8192/max_sample_length: int = 2048/' \ | ||
| -e 's/train_dataset_name: tuple\[str, ...\] = ("all", )/train_dataset_name: tuple[str, ...] = ("default",)/' \ | ||
| -e 's/log_wandb: bool = True/log_wandb: bool = False/' \ | ||
| -e 's/use_lmms_eval: bool = True/use_lmms_eval: bool = False/' \ | ||
| -e "s/lmms_eval_tasks: str = 'mmstar,mmmu,ocrbench,textvqa,docvqa,scienceqa,mme,infovqa'/lmms_eval_tasks: str = 'mmstar,mmmu'/" \ | ||
| /fsx/ubuntu/nanoVLM/nanoVLM/models/config.py | ||
| ``` | ||
|
|
||
| ## 8. Build and Configure the NaNoVLM Job Container | ||
| The provided Dockerfile (`nanoVLM.Dockerfile`) will set up the environment with all required dependencies: | ||
|
|
||
| ```bash | ||
| cd .. | ||
| docker build -t nanovlm:latest -f nanovlm.Dockerfile . | ||
| enroot import -o nanovlm.sqsh dockerd://nanovlm:latest | ||
|
|
||
|
|
||
| ``` | ||
| ## 9. Launch Training | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe some explanation what this sbatch script does? |
||
| ```bash | ||
| cd | ||
| sbatch launch_training.sbatch | ||
| ``` | ||
| Note the path where the checkpoints will be generated from the slurm.out log file as this will be used in the subsequent sections for evaluation and generation | ||
|
|
||
| For example: | ||
|
|
||
| ``` | ||
| /fsx/ubuntu/nanoVLM/checkpoints/nanoVLM_siglip2-base-patch16-512_1024_mp4_SmolLM2-360M-Instruct_2xGPU_5000samples_bs8_500_lr_vision_5e-05-language_5e-05-0.00512_0923-230408/step_450 | ||
| ``` | ||
|
|
||
| ## 10. Run evaluation | ||
| Update the checkpoint directory in launch_evaluation.sh to the checkpoint we generated above | ||
|
|
||
| ``` | ||
| export CHECKPOINT_DIR="your-checkpoint-directory" | ||
| ``` | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe some explanation what this sbatch script does? |
||
| ```bash | ||
| cd | ||
| sbatch launch_evaluation.sbatch | ||
| ``` | ||
|
|
||
| ## 11. Test generation | ||
| Export the checkpoint directory in your terminal | ||
|
|
||
| ``` | ||
| export CHECKPOINT_DIR="your-checkpoint-directory" | ||
| ``` | ||
|
|
||
| ```bash | ||
| cd .. | ||
| python generate.py --checkpoint $CHECKPOINT_DIR | ||
|
|
||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| FROM public.ecr.aws/hpc-cloud/nccl-tests:latest | ||
|
|
||
| RUN apt update && apt install -y nvtop | ||
|
|
||
| RUN pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb einops accelerate loguru lmms_eval | ||
|
|
||
| RUN mkdir -p /nanoVLM | ||
| RUN ln -s /usr/bin/python3 /usr/bin/python | ||
|
|
||
| COPY nanoVLM/ /nanoVLM/ | ||
|
|
||
| WORKDIR /nanoVLM |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=download_dataset | ||
| #SBATCH --output=logs/download_%A.out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --error=logs/download_%A.err | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --nodes=2 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you need 2 nodes to download datasets? |
||
| #SBATCH --cpus-per-task=48 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you need to specificy CPU per task? And what if there are more or less than 48? |
||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this necessary? It will fail if the cluster doesn't have this partition name. |
||
|
|
||
| export HF_TOKEN=$(cat /fsx/ubuntu/.cache/huggingface/token) | ||
|
||
|
|
||
| cd /fsx/ubuntu/nanoVLM | ||
|
||
|
|
||
| source nanoVLM_env/bin/activate | ||
|
|
||
| mkdir -p logs | ||
| mkdir -p /fsx/ubuntu/datasets/nanoVLM/cauldron | ||
|
||
|
|
||
|
|
||
| python3 << 'EOF' | ||
| from datasets import load_dataset | ||
| import os | ||
| import time | ||
|
|
||
| local_path = "/fsx/ubuntu/datasets/nanoVLM/cauldron" | ||
|
||
|
|
||
| os.makedirs(local_path, exist_ok=True) | ||
|
|
||
| configs = ["clevr", "vqav2", "docvqa"] | ||
| total_start = time.time() | ||
|
|
||
| for i, config in enumerate(configs): | ||
| print(f"Downloading {i+1}/{len(configs)}: {config}") | ||
| start_time = time.time() | ||
| try: | ||
| dataset = load_dataset("HuggingFaceM4/the_cauldron", config) | ||
| dataset.save_to_disk(f"{local_path}/{config}") | ||
| elapsed = time.time() - start_time | ||
| print(f"✓ Saved {config} in {elapsed:.1f}s") | ||
| except Exception as e: | ||
| elapsed = time.time() - start_time | ||
| print(f"✗ Failed {config} after {elapsed:.1f}s: {e}") | ||
|
|
||
| total_elapsed = time.time() - total_start | ||
| print(f"Total time: {total_elapsed:.1f}s") | ||
| EOF | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=nanovlm_eval_workshop | ||
| #SBATCH --output=logs/eval_new/%A_%a.out | ||
| #SBATCH --error=logs/eval_new/%A_%a.err | ||
| #SBATCH --time=48:00:00 | ||
| #SBATCH --nodes=1 | ||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this necessary? It will fail if the cluster doesn't have this partition name. |
||
| #SBATCH --array=0 | ||
|
|
||
| cd /fsx/ubuntu/nanoVLM | ||
|
||
|
|
||
| export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh | ||
|
|
||
| export FSX_MOUNT=/fsx/ubuntu:/fsx/ubuntu | ||
|
|
||
| export PYTHONPATH=/fsx/ubuntu/nanoVLM:$PYTHONPATH | ||
|
|
||
| export CHECKPOINT_DIR="" # UPDATE | ||
|
|
||
|
|
||
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
|
|
||
| declare -a ARGS=( | ||
| --container-image $CONTAINER_IMAGE | ||
| --container-mounts $FSX_MOUNT | ||
| ) | ||
|
|
||
| export EVAL_SCRIPT=utils/run_checkpoint_evaluations.py | ||
|
|
||
| declare -a EVAL_ARGS=( | ||
| --checkpoints_dir $CHECKPOINT_DIR | ||
| --steps 450 | ||
| --batch_size 1 | ||
| ) | ||
|
|
||
|
|
||
| AUTO_RESUME="" | ||
| if [ -d "/opt/sagemaker_cluster" ]; then | ||
| echo "Detected Hyperpod cluster.. enabling --auto-resume=1" | ||
| AUTO_RESUME="--auto-resume=1" | ||
| fi | ||
|
|
||
| srun ${AUTO_RESUME} -l "${ARGS[@]}" python $EVAL_SCRIPT "${EVAL_ARGS[@]}" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| #!/bin/bash | ||
| #SBATCH --job-name=train_nanoVLM | ||
| #SBATCH --output=logs/train_nanoVLM/%A.out | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --error=logs/train_nanoVLM/%A.err | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this an array job? I don't think so.... instead of %A and %a you can use %j. The name is cleaner that way. |
||
| #SBATCH --time=01:00:00 | ||
| #SBATCH --nodes=2 | ||
| #SBATCH --partition=p5en | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't need partition name. Will fail if doesn't exist. |
||
|
|
||
| GPUS_PER_NODE=8 #set to 1 for g5.8xlarge | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we have a step on the README explaining this? Maybe a command you run to setup this based on the instance you are running on? |
||
|
|
||
| cd /fsx/ubuntu/nanoVLM | ||
|
|
||
| export CONTAINER_IMAGE=$(pwd)/nanovlm.sqsh | ||
|
|
||
| export FSX_MOUNT=/fsx/ubuntu:/fsx/ubuntu | ||
|
|
||
|
|
||
| export NCCL_DEBUG=INFO | ||
| export FI_PROVIDER=efa | ||
| #export FI_EFA_USE_HUGE_PAGE=0 # Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Disabling huge page causes minor performance hit. | ||
| ## Switching SYNC_MEMOPS to zero can boost throughput with FSDP | ||
| ## Disables CU_POINTER_ATTRIBUTE_SYNC_MEMOPS | ||
| ## Reduces memory synchronizations | ||
| ## https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html | ||
| export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 | ||
| # LD_PRELOAD is required for PyTorch to find the NCCL library | ||
| # This path assumes you are using the Deep Learning AMI | ||
| # If you are not using the DLAMI, you may need to update this path | ||
| export LD_PRELOAD=/usr/local/cuda-12.8/lib/libnccl.so | ||
| export NCCL_SOCKET_IFNAME=^docker,lo,veth,eth | ||
|
|
||
| declare -a ARGS=( | ||
| --container-image $CONTAINER_IMAGE | ||
| --container-mounts $FSX_MOUNT | ||
| ) | ||
|
|
||
| declare -a TORCHRUN_ARGS=( | ||
| --nproc_per_node=$GPUS_PER_NODE | ||
| --nnodes=$SLURM_JOB_NUM_NODES | ||
| --rdzv_id=$SLURM_JOB_ID | ||
| --rdzv_backend=c10d | ||
| --rdzv_endpoint=$SLURMD_NODENAME:29500 | ||
| ) | ||
|
|
||
| export TRAIN_SCRIPT=train.py | ||
|
|
||
| declare -a TRAINING_ARGS=( | ||
| --no_log_wandb | ||
| ) | ||
|
|
||
| AUTO_RESUME="" | ||
| if [ -d "/opt/sagemaker_cluster" ]; then | ||
| echo "Detected Hyperpod cluster.. enabling --auto-resume=1" | ||
| AUTO_RESUME="--auto-resume=1" | ||
| fi | ||
|
|
||
| srun ${AUTO_RESUME} -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please change the code so it take a different directory as the base path? Then change the sbatch file to parse that variable and use it. The idea is to allow anyone to define the base path (ex: /home/user or /lustre/ubuntu) and instead of using a hard coded choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed