diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/Chart.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/Chart.yaml new file mode 100644 index 0000000..af46c11 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: a4_jobset_workload +description: a4_jobset_workload +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/README.md b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/README.md new file mode 100644 index 0000000..bd4bdac --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/README.md @@ -0,0 +1,153 @@ + +# Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 workloads on a4 GKE Node pools with Nvidia NeMo Framework + +This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus16 pretraining +workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). + +## Orchestration and deployment tools + +For this recipe, the following setup is used: + +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +- Pretraining job configuration and deployment - A Helm chart is used to + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the + [NeMo pretraining workload](https://github.com/NVIDIA/nemo). + +## Test environment + +This recipe has been optimized for and tested with the following configuration: + +- GKE cluster +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) +to create your a4 GKE cluster. + +## Training dataset + +This recipe uses a mock pretraining dataset provided by the NeMo framework. + +## Docker container image + +This recipe uses the following docker images: + +- `nvcr.io/nvidia/nemo:25.07` +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0` + +## Run the recipe + +From your client workstation, complete the following steps: + +### Configure environment settings + +Set the environment variables to match your environment: + + ```bash + export PROJECT_ID= + export CLUSTER_REGION= + export CLUSTER_NAME= + export GCS_BUCKET= # Note: path should not be prefixed with gs:// + export KUEUE_NAME= + ``` + +Replace the following values: + + - ``: your Google Cloud project ID. + - ``: the region where your cluster is located. + - ``: the name of your GKE cluster. + - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. + - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. + +Set the default project: + + ```bash + gcloud config set project $PROJECT_ID + ``` + +### Get the recipe + +Clone the `gpu-recipes` repository and set a reference to the recipe folder. + +``` +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=`git rev-parse --show-toplevel` +export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16/nemo-pretraining-gke/2_nodes +cd $RECIPE_ROOT +``` + +### Get cluster credentials + +``` +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + +### Configure and submit a pretraining job + +#### Using 2 node (16 gpus) bf16 precision +To execute the job with the default settings, run the following command from +your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama3-1-70b + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \ + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} + ``` + +**Examples** + +- To set the number of training steps to 100, run the following command from + your client: + + ```bash + cd $RECIPE_ROOT + export WORKLOAD_NAME=$USER-a4-llama3-1-70b + helm install $WORKLOAD_NAME . -f values.yaml \ + --set-file workload_launcher=launcher.sh \ + --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py \ + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ + --set volumes.gcsMounts[0].mountPath=/job-logs \ + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ + --set queue=${KUEUE_NAME} \ + --set workload.arguments[0]="trainer.max_steps=100" + ``` + +### Monitor the job + +To check the status of pods in your job, run the following command: + +``` +kubectl get pods | grep $USER-a4-llama3-1-70b +``` + +Replace the following: + +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b. + +To get the logs for one of the pods, run the following command: + +``` +kubectl logs POD_NAME +``` + +Information about the training job's progress, including crucial details such as +loss, step count, and step time, is generated by the rank 0 process. +This process runs on the pod whose name begins with +`JOB_NAME_PREFIX-workload-0-0`. +For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`. + +### Uninstall the Helm release + +You can delete the job and other resources created by the Helm chart. To +uninstall Helm, run the following command from your client: + +```bash +helm uninstall $USER-a4-llama3-1-70b +``` diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/launcher.sh b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/launcher.sh new file mode 100644 index 0000000..f848d79 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/launcher.sh @@ -0,0 +1,105 @@ +usage() +{ +cat << EOF +usage: bash ./launcher.sh [config-override [config-override ...]] +config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000. +EOF +} + +parse_args() { + while [ "$1" != "" ]; do + case $(grep -o "=" <<< "$1" | wc -l) in + 1 ) + config_overrides+=("$1") + ;; + * ) + echo "Invalid config override: $1" + usage + exit 1 + esac + shift + done + config_overrides="${config_overrides[*]}" +} + +config_overrides=() +parse_args "$@" + +if [ -z "${config_overrides}" ]; then + echo "No NeMo config overrides specified" +else + echo "NeMo config overrides:" + echo " ${config_overrides}" +fi + +export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH" +ldconfig $LD_LIBRARY_PATH +echo "Added $LD_LIBRARY_PATH to ldconfig:" +ldconfig -p | grep libcuda | sed 's/^/ /' +echo "" + +if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then + explicit_log_dir=${EXPLICIT_LOG_DIR} +else + explicit_log_dir=workload_logs +fi +echo "Logging to ${explicit_log_dir}" + +if [[ -n "${TOKENIZER_PATH}" ]]; then + echo "Getting tokenizer files" + cp ${TOKENIZER_PATH}/* . + echo "" +fi + +echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes" + + +# Update nemo run so we can export the config. +pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33 +pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger + + +# Export the nemo2 config to yaml. +python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ +trainer.num_nodes="$NNODES" \ +log.explicit_log_dir="${explicit_log_dir}" \ +trainer.max_steps=25 \ +trainer.num_nodes=2 \ +trainer.devices=8 \ +${config_overrides} \ +--to-yaml exported_nemo_config.yaml + +# Create the nsys directory. +mkdir -p ${explicit_log_dir}/nsys + +OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \ +/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \ +-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \ +--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \ +--wait all \ +torchrun \ +--nproc-per-node="${GPUS_PER_NODE}" \ +--nnodes="${NNODES}" \ +--node_rank="${JOB_COMPLETION_INDEX}" \ +--rdzv_id="${JOB_IDENTIFIER}" \ +--master_addr="${MASTER_ADDR}" \ +--master_port="${MASTER_PORT}" \ +${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \ +trainer.num_nodes="$NNODES" \ +log.explicit_log_dir="${explicit_log_dir}" \ +trainer.max_steps=25 \ +trainer.num_nodes=2 \ +trainer.devices=8 \ +${config_overrides} + +if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then + mkdir -p ${ARTIFACT_DIR} + cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/ + cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py + cp dllogger.json ${ARTIFACT_DIR}/dllogger.json + cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml + env > ${ARTIFACT_DIR}/environ.txt + ls ${ARTIFACT_DIR} +fi +echo "Training completed" +echo "Pod on $(hostname --fqdn) is exiting" \ No newline at end of file diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py new file mode 100644 index 0000000..36fcd20 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py @@ -0,0 +1,142 @@ +"""Nemo2 pretraining recipe for Llama 3.1 70B model.""" + +from nemo.collections import llm +from nemo.collections.llm.recipes import llama31_70b +from nemo.lightning.pytorch.callbacks import NsysCallback +from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback +from nemo.utils.loggers.dllogger import DLLogger +import nemo_run as run +from scripts.performance.helpers import ( + set_primary_perf_configs, +) +from scripts.performance.utils import get_comm_overlap_callback_idx + + +def recipe( + profile_enabled: bool = False, + profile_start_step: int = 0, + profile_end_step: int = 0, + profile_ranks: str = "0", +) -> run.Partial: + """Returns a Nemo2 training recipe for Llama 3.1 70B model. + + Args: + profile_enabled: Whether to enable Nsys profiling. + profile_start_step: The step to start profiling. + profile_end_step: The step to end profiling. + profile_ranks: The ranks to profile, comma separated. + + Returns: + A Nemo2 training recipe. + """ + # Start from the Nemo standard recipe. + pretrain = llama31_70b.pretrain_recipe(performance_mode=True) + + num_nodes = 2 + num_gpus_per_node = 8 + mbs = 1 + gbs = 2048 + max_steps = 25 + tp_size = 2 + pp_size = 2 + cp_size = 2 + vp_size = 5 # Virtual Pipeline Parallelism + ep_size = 1 # Expert Parallelism + enable_cuda_graphs = False + compute_dtype = "bf16" + fp8_recipe = None # Not needed for bf16 + nccl_communicator_config_path = None + use_mcore_fsdp = False + use_fsdp_double_buffer = False + use_user_buffer_registration = False + use_sharp = False + keep_fsdp_fp8_transpose_cache = False + + pretrain = set_primary_perf_configs( + pretrain, + "pre_train", + num_nodes=num_nodes, + num_gpus_per_node=num_gpus_per_node, + mbs=mbs, + gbs=gbs, + max_steps=max_steps, + tp_size=tp_size, + pp_size=pp_size, + cp_size=cp_size, + vp_size=vp_size, + ep_size=ep_size, + enable_cuda_graphs=enable_cuda_graphs, + compute_dtype=compute_dtype, + fp8_recipe=fp8_recipe, + nccl_communicator_config_path=nccl_communicator_config_path, + use_mcore_fsdp=use_mcore_fsdp, + use_fsdp_double_buffer=use_fsdp_double_buffer, + use_user_buffer_registration=use_user_buffer_registration, + use_sharp=use_sharp, + keep_fsdp_fp8_transpose_cache=keep_fsdp_fp8_transpose_cache, + ) + + # Sequence Length (model and data) + pretrain.model.config.seq_length = 8192 + pretrain.data.seq_length = 8192 + + # Set the number of steps to 50 for a quicker benchmark. + pretrain.trainer.max_steps = 50 + + # Disable validation batches. + pretrain.trainer.limit_val_batches = 0.0 + pretrain.trainer.val_check_interval = 100 + + # Add the Nsys profiling callback if enabled. + if profile_enabled: + pretrain.trainer.callbacks.append( + run.Config( + NsysCallback, + start_step=profile_start_step, + end_step=profile_end_step, + ranks=[int(x) for x in profile_ranks.split(",")], + gen_shape=False, + ) + ) + + # Add the FLOPs measurement callback. + pretrain.trainer.callbacks.append( + run.Config( + FLOPsMeasurementCallback, + model_name="llama31-70b", + model_config=pretrain.model.config, + data_config=pretrain.data, + ) + ) + + # When `performance_mode` is enabled, the Megatron communication overlap + # callback is already added to the recipe. + # https://github.com/NVIDIA-NeMo/NeMo/blob/90a396a567ebb4e8c1c41e454dc00cb71f911317/nemo/collections/llm/recipes/llama31_70b.py#L231 + comm_overlap_callback_idx = get_comm_overlap_callback_idx( + pretrain.trainer.callbacks + ) + pretrain.trainer.callbacks[ + comm_overlap_callback_idx + ].tp_comm_bootstrap_backend = "nccl" + + # Disable checkpointing. + pretrain.log.ckpt = None + pretrain.trainer.enable_checkpointing = False + + # Log every step. + pretrain.trainer.log_every_n_steps = 1 + + # Enable DLLogger + dllogger_config = run.Config( + DLLogger, + verbose=True, + stdout=True, + json_file="dllogger.json", + ) + pretrain.log.extra_loggers = [dllogger_config] + + return pretrain + + +if __name__ == "__main__": + run.cli.main(llm.pretrain, default_factory=recipe) diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/recipe_launch_command.sh b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/recipe_launch_command.sh new file mode 100644 index 0000000..3b59f17 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/recipe_launch_command.sh @@ -0,0 +1 @@ +helm install vishwasreddy-ubench-distributed-3ubr . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py --set workload.image=nvcr.io/nvidia/nemo:25.07 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/vishwasreddy-ubench-distributed-3ubr \ No newline at end of file diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-config-configmap.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-config-configmap.yaml new file mode 100644 index 0000000..f34b508 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-config-configmap.yaml @@ -0,0 +1,26 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + workload-configuration: |- +{{- if .Values.workload_config }} +{{ .Values.workload_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-job.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-job.yaml new file mode 100644 index 0000000..baa2c1e --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-job.yaml @@ -0,0 +1,329 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{$timestamp := now | date "2006-01-02-15-04-05"}} +{{$jobSuffix := randAlphaNum 4 | lower}} +{{$jobuuid := uuidv4}} +{{$nodes := div .Values.workload.gpus 8 | max 1}} +{{$gpusPerNode := min .Values.workload.gpus 8}} +{{- $root := . -}} + +apiVersion: jobset.x-k8s.io/v1alpha2 +kind: JobSet +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + {{- if $root.Values.queue }} + suspend: true + {{- end }} + failurePolicy: + maxRestarts: {{ default 0 $root.Values.workload.max_workload_restarts }} + replicatedJobs: + - name: workload + replicas: 1 + template: + spec: + parallelism: {{ $nodes }} + completions: {{ $nodes }} + backoffLimit: 0 + completionMode: Indexed + activeDeadlineSeconds: 14400 # 4 hours (4 * 60 * 60) + ttlSecondsAfterFinished: 43200 # 12 hours (12 * 60 * 60) + template: + metadata: + annotations: + kubectl.kubernetes.io/default-container: workload + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "500m" + gke-gcsfuse/memory-limit: "1Ti" + gke-gcsfuse/ephemeral-storage-limit: "2Ti" + {{- end }} + {{- if $root.Values.volumes.psVolumes }} + gke-parallelstore/volumes: "true" + gke-parallelstore/cpu-limit: "0" + gke-parallelstore/memory-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.tasSettings.topologyRequest }} + {{- toYaml .Values.tasSettings.topologyRequest | nindent 14 }} + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Never + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "In" + values: + {{- range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + {{- if $root.Values.avoidNodes }} + {{- if not $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + {{- end }} + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: "NotIn" + values: + {{- range $hostname := $root.Values.avoidNodes }} + - {{ $hostname }} + {{- end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + + volumes: + {{ if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{ end }} + + - name: workload-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: workload-configuration + path: {{ $root.Values.workload.configFile | default "workload-configuration" }} + + - name: workload-launcher + configMap: + name: "{{.Release.Name}}-launcher" + + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + persistentVolumeClaim: + claimName: "{{ $pvc.claimName }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end}} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{ if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{ end}} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + {{- if $root.Values.workload.psSidecarImage }} + - name: gke-parallelstore-sidecar + image: {{ $root.Values.workload.psSidecarImage }} + {{- end }} + + - name: workload + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: JOB_IDENTIFIER + value: "{{ .Release.Name }}-{{ $timestamp }}" + - name: JOB_TIMESTAMP + value: "{{ $timestamp }}" + - name: JOB_UUID + value: "{{ $jobuuid }}" + - name: JOB_ORCHESTRATOR + value: "gke" + # Add RANK based on the pod's index provided by the Indexed Job + # This is crucial for torch.distributed initialization. + - name: JOB_COMPLETION_INDEX + valueFrom: + fieldRef: + fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] + - name: RANK_0_FQDN + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: HOSTNAME_PREFIX + value: "{{.Release.Name}}-workload-" + - name: DOMAIN_NAME + value: "{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_ADDR + value: "{{.Release.Name}}-workload-0-0.{{.Release.Name}}.default.svc.cluster.local" + - name: MASTER_PORT + value: "6002" + - name: WORLD_SIZE + value: "{{ $root.Values.workload.gpus }}" + - name: NNODES + value: "{{ $nodes }}" + - name: GPUS_PER_NODE + value: "{{ $gpusPerNode }}" + + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64 + + {{ if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{ end }} + + {{ if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 14 }} + {{ end }} + + {{ if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 14 }} + {{ end }} + + command: + - bash + - -c + - | + echo "Pod on $(hostname --fqdn) is running" + echo "Pod is assigned job index of $JOB_COMPLETION_INDEX" + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + fi + + # Overriding NCCL_SOCKET_IFNAME definition + export NCCL_SOCKET_IFNAME="eth0,eth1" + export NCCL_TUNER_CONFIG_PATH=/usr/local/gib/configs/tuner_config_a4.txtpb + + echo "Launching workload with the following arguments:" + {{- range $root.Values.workload.defaultArguments }} + echo " {{ . }}" + {{- end }} + {{- range $root.Values.workload.arguments }} + echo " {{ . }}" + {{- end }} + echo "" + + sleep 10 + + bash /workload/launcher/launch-workload.sh \ + {{- range $root.Values.workload.defaultArguments }} + {{ . }} \ + {{- end }} + {{- range $root.Values.workload.arguments }} + {{ . }} \ + {{- end }} + + + volumeMounts: + {{ if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{ end }} + + - name: workload-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + + - name: workload-launcher + mountPath: /workload/launcher + + - name: shared-memory + mountPath: /dev/shm + + {{- range $pvc := $root.Values.volumes.pvcMounts }} + - name: "{{ $pvc.claimName }}" + mountPath: "{{ $pvc.mountPath }}" + {{- end }} + + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + limits: + nvidia.com/gpu: {{ $gpusPerNode }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-launcher-configmap.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-launcher-configmap.yaml new file mode 100644 index 0000000..7026e0f --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-launcher-configmap.yaml @@ -0,0 +1,28 @@ +# yamllint disable +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-svc.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-svc.yaml new file mode 100644 index 0000000..7cfe220 --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/templates/workload-svc.yaml @@ -0,0 +1,22 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: "{{ .Release.Name }}" +spec: + clusterIP: None + selector: + jobset.sigs.k8s.io/jobset-name: "{{ .Release.Name }}" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/values.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/values.yaml new file mode 100644 index 0000000..832131a --- /dev/null +++ b/training/a4/llama3-1-70b/nemo-pretraining-gke/2node-bf16-seq8192-gbs2048/recipe/values.yaml @@ -0,0 +1,33 @@ +dwsSettings: + maxRunDurationSeconds: null +network: + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0 + hostNetwork: true + ncclSettings: + - name: NCCL_DEBUG + value: WARN + subnetworks[]: null +queue: null +tasSettings: + topologyRequest: + kueue.x-k8s.io/podset-preferred-topology: kubernetes.io/hostname +volumes: + gcsMounts: + - bucketName: null + mountPath: null + gcsVolumes: true + psVolumes: false +workload: + arguments[]: null + configFile: llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py + configPath: /workload/configs/ + defaultArguments[]: null + envs: + - name: ARTIFACT_DIR + value: null + - name: GLOO_SOCKET_IFNAME + value: eth0 + - name: NEMO_LAUNCH_SCRIPT + value: /workload/configs/llama3-1-70b-seq8192-gbs2048-mbs1-gpus16.py + gpus: 16 + image: nvcr.io/nvidia/nemo:25.07