-
Notifications
You must be signed in to change notification settings - Fork 51
Add llama3-1-70b-2node-bf16-seq8192-gbs2048 recipes on A4 #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
20 changes: 20 additions & 0 deletions
20
training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/Chart.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # Copyright 2025 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| apiVersion: v2 | ||
| name: a4_jobset_workload | ||
| description: a4_jobset_workload | ||
| type: application | ||
| version: 0.1.0 | ||
| appVersion: "1.16.0" |
153 changes: 153 additions & 0 deletions
153
...4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| <!-- mdformat global-off --> | ||
| # Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 workloads on a4 GKE Node pools with Nvidia NeMo Framework | ||
|
|
||
| This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 pretraining | ||
| workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the | ||
| [NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). | ||
|
|
||
| ## Orchestration and deployment tools | ||
|
|
||
| For this recipe, the following setup is used: | ||
|
|
||
| - Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) | ||
| - Pretraining job configuration and deployment - A Helm chart is used to | ||
| configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the | ||
| [NeMo pretraining workload](https://github.com/NVIDIA/nemo). | ||
|
|
||
| ## Test environment | ||
|
|
||
| This recipe has been optimized for and tested with the following configuration: | ||
|
|
||
| - GKE cluster | ||
| Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) | ||
| to create your a4 GKE cluster. | ||
|
|
||
| ## Training dataset | ||
|
|
||
| This recipe uses a mock pretraining dataset provided by the NeMo framework. | ||
|
|
||
| ## Docker container image | ||
|
|
||
| This recipe uses the following docker images: | ||
|
|
||
| - `nvcr.io/nvidia/nemo:25.07` | ||
| - `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0` | ||
|
|
||
| ## Run the recipe | ||
|
|
||
| From your client workstation, complete the following steps: | ||
|
|
||
| ### Configure environment settings | ||
|
|
||
| Set the environment variables to match your environment: | ||
|
|
||
| ```bash | ||
| export PROJECT_ID=<PROJECT_ID> | ||
| export CLUSTER_REGION=<CLUSTER_REGION> | ||
| export CLUSTER_NAME=<CLUSTER_NAME> | ||
| export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs:// | ||
| export KUEUE_NAME=<KUEUE_NAME> | ||
| ``` | ||
|
|
||
| Replace the following values: | ||
|
|
||
| - `<PROJECT_ID>`: your Google Cloud project ID. | ||
| - `<CLUSTER_REGION>`: the region where your cluster is located. | ||
| - `<CLUSTER_NAME>`: the name of your GKE cluster. | ||
| - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. | ||
| - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. | ||
|
|
||
| Set the default project: | ||
|
|
||
| ```bash | ||
| gcloud config set project $PROJECT_ID | ||
| ``` | ||
|
|
||
| ### Get the recipe | ||
|
|
||
| Clone the `gpu-recipes` repository and set a reference to the recipe folder. | ||
|
|
||
| ``` | ||
| git clone https://github.com/ai-hypercomputer/gpu-recipes.git | ||
| cd gpu-recipes | ||
| export REPO_ROOT=`git rev-parse --show-toplevel` | ||
| export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16/nemo-pretraining-gke/2_nodes | ||
| cd $RECIPE_ROOT | ||
| ``` | ||
|
|
||
| ### Get cluster credentials | ||
|
|
||
| ``` | ||
| gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION | ||
| ``` | ||
|
|
||
| ### Configure and submit a pretraining job | ||
|
|
||
| #### Using 2 node (16 gpus) bf16 precision | ||
| To execute the job with the default settings, run the following command from | ||
| your client: | ||
|
|
||
| ```bash | ||
| cd $RECIPE_ROOT | ||
| export WORKLOAD_NAME=$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node | ||
| helm install $WORKLOAD_NAME . -f values.yaml \ | ||
| --set-file workload_launcher=launcher.sh \ | ||
| --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \ | ||
| --set workload.image=nvcr.io/nvidia/nemo:25.07 \ | ||
| --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ | ||
| --set volumes.gcsMounts[0].mountPath=/job-logs \ | ||
| --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ | ||
| --set queue=${KUEUE_NAME} | ||
| ``` | ||
|
|
||
| **Examples** | ||
|
|
||
| - To set the number of training steps to 100, run the following command from | ||
| your client: | ||
|
|
||
| ```bash | ||
| cd $RECIPE_ROOT | ||
| export WORKLOAD_NAME=$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node | ||
| helm install $WORKLOAD_NAME . -f values.yaml \ | ||
| --set-file workload_launcher=launcher.sh \ | ||
| --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \ | ||
| --set workload.image=nvcr.io/nvidia/nemo:25.07 \ | ||
| --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ | ||
| --set volumes.gcsMounts[0].mountPath=/job-logs \ | ||
| --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ | ||
| --set queue=${KUEUE_NAME} \ | ||
| --set workload.arguments[0]="trainer.max_steps=100" | ||
| ``` | ||
|
|
||
| ### Monitor the job | ||
|
|
||
| To check the status of pods in your job, run the following command: | ||
|
|
||
| ``` | ||
| kubectl get pods | grep $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node | ||
| ``` | ||
|
|
||
| Replace the following: | ||
|
|
||
| - JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node. | ||
|
|
||
| To get the logs for one of the pods, run the following command: | ||
|
|
||
| ``` | ||
| kubectl logs POD_NAME | ||
| ``` | ||
|
|
||
| Information about the training job's progress, including crucial details such as | ||
| loss, step count, and step time, is generated by the rank 0 process. | ||
| This process runs on the pod whose name begins with | ||
| `JOB_NAME_PREFIX-workload-0-0`. | ||
| For example: `$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node-workload-0-0-s9zrv`. | ||
|
|
||
| ### Uninstall the Helm release | ||
|
|
||
| You can delete the job and other resources created by the Helm chart. To | ||
| uninstall Helm, run the following command from your client: | ||
|
|
||
| ```bash | ||
| helm uninstall $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus16-2node | ||
| ``` | ||
105 changes: 105 additions & 0 deletions
105
training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/launcher.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| usage() | ||
| { | ||
| cat << EOF | ||
| usage: bash ./launcher.sh [config-override [config-override ...]] | ||
| config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. | ||
| EOF | ||
| } | ||
|
|
||
| parse_args() { | ||
| while [ "$1" != "" ]; do | ||
| case $(grep -o "=" <<< "$1" | wc -l) in | ||
| 1 ) | ||
| config_overrides+=("$1") | ||
| ;; | ||
| * ) | ||
| echo "Invalid config override: $1" | ||
| usage | ||
| exit 1 | ||
| esac | ||
| shift | ||
| done | ||
| config_overrides="${config_overrides[*]}" | ||
| } | ||
|
|
||
| config_overrides=() | ||
| parse_args "$@" | ||
|
|
||
| if [ -z "${config_overrides}" ]; then | ||
| echo "No NeMo config overrides specified" | ||
| else | ||
| echo "NeMo config overrides:" | ||
| echo " ${config_overrides}" | ||
| fi | ||
|
|
||
| export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH" | ||
| ldconfig $LD_LIBRARY_PATH | ||
| echo "Added $LD_LIBRARY_PATH to ldconfig:" | ||
| ldconfig -p | grep libcuda | sed 's/^/ /' | ||
| echo "" | ||
|
|
||
| if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then | ||
| explicit_log_dir=${EXPLICIT_LOG_DIR} | ||
| else | ||
| explicit_log_dir=workload_logs | ||
| fi | ||
| echo "Logging to ${explicit_log_dir}" | ||
|
|
||
| if [[ -n "${TOKENIZER_PATH}" ]]; then | ||
| echo "Getting tokenizer files" | ||
| cp ${TOKENIZER_PATH}/* . | ||
| echo "" | ||
| fi | ||
|
|
||
| echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" | ||
|
|
||
|
|
||
| # Update nemo run so we can export the config. | ||
| pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33 | ||
| pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger | ||
|
|
||
|
|
||
| # Export the nemo2 config to yaml. | ||
| python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ | ||
| trainer.num_nodes="$NNODES" \ | ||
| log.explicit_log_dir="${explicit_log_dir}" \ | ||
| trainer.max_steps=25 \ | ||
| trainer.num_nodes=2 \ | ||
| trainer.devices=8 \ | ||
| ${config_overrides} \ | ||
| --to-yaml exported_nemo_config.yaml | ||
|
|
||
| # Create the nsys directory. | ||
| mkdir -p ${explicit_log_dir}/nsys | ||
|
|
||
| OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \ | ||
| /usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \ | ||
| -o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \ | ||
| --session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \ | ||
| --wait all \ | ||
| torchrun \ | ||
| --nproc-per-node="${GPUS_PER_NODE}" \ | ||
| --nnodes="${NNODES}" \ | ||
| --node_rank="${JOB_COMPLETION_INDEX}" \ | ||
| --rdzv_id="${JOB_IDENTIFIER}" \ | ||
| --master_addr="${MASTER_ADDR}" \ | ||
| --master_port="${MASTER_PORT}" \ | ||
| ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ | ||
| trainer.num_nodes="$NNODES" \ | ||
| log.explicit_log_dir="${explicit_log_dir}" \ | ||
| trainer.max_steps=25 \ | ||
| trainer.num_nodes=2 \ | ||
| trainer.devices=8 \ | ||
| ${config_overrides} | ||
|
|
||
| if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then | ||
| mkdir -p ${ARTIFACT_DIR} | ||
| cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/ | ||
| cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py | ||
| cp dllogger.json ${ARTIFACT_DIR}/dllogger.json | ||
| cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml | ||
| env > ${ARTIFACT_DIR}/environ.txt | ||
| ls ${ARTIFACT_DIR} | ||
| fi | ||
| echo "Training completed" | ||
| echo "Pod on $(hostname --fqdn) is exiting" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.