Skip to content

Commit 5be5988

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- 3bcbd7d6d51b34722d3d1a5ebbfc4164bfec891e Adding Llama3.1-70B MaxText for A4 pretraining recipe GitOrigin-RevId: 3bcbd7d6d51b34722d3d1a5ebbfc4164bfec891e
1 parent 0211dc1 commit 5be5988

File tree

5 files changed

+330
-0
lines changed

5 files changed

+330
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ Models | GPU Machine Type
3939

4040
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
4141
------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
42+
**Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | MaxText | Pre-training | GKE | [Link](./training/a4/llama3-1-70b/maxtext-pretraining-gke/README.md)
4243
**Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/llama3-1-70b/nemo-pretraining-gke/README.md)
4344
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | MaxText | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/maxtext-pretraining-gke/README.md)
4445
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/nemo-pretraining-gke/README.md)
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 1
3+
dcn_fsdp_parallelism: 32
4+
ici_fsdp_parallelism: 8
5+
per_device_batch_size: 8
6+
max_target_length: 8192
7+
learning_rate: 0.001
8+
model_name: llama3.1-70b
9+
enable_checkpointing: false
10+
attention: cudnn_flash_te
11+
remat_policy: full
12+
use_iota_embed: true
13+
dataset_type: synthetic
14+
logits_dot_in_fp32: false
15+
scan_layers: True
16+
enable_goodput_recording: false
17+
monitor_goodput: false
18+
save_config_to_gcs: true
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
hardware: gpu
2+
dcn_data_parallelism: 1
3+
dcn_fsdp_parallelism: 32
4+
ici_fsdp_parallelism: 8
5+
per_device_batch_size: 8
6+
max_target_length: 8192
7+
learning_rate: 0.001
8+
model_name: llama3.1-70b
9+
enable_checkpointing: false
10+
quantization: fp8
11+
attention: cudnn_flash_te
12+
remat_policy: full
13+
use_iota_embed: true
14+
dataset_type: synthetic
15+
logits_dot_in_fp32: false
16+
scan_layers: True
17+
enable_goodput_recording: false
18+
monitor_goodput: false
19+
save_config_to_gcs: true
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# Pretrain Llama-3.1-70B workloads on A4 GKE Node pools using MaxText
2+
3+
This recipe outlines the steps for running a Llama-3.1-70B pretraining workload
4+
on [A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
5+
[MaxText framework](https://github.com/AI-Hypercomputer/maxtext).
6+
7+
## Orchestration and deployment tools
8+
9+
For this recipe, the following setup is used:
10+
11+
- Orchestration -
12+
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Job configuration and deployment - Helm chart is used to configure and
14+
deploy the
15+
[Kubernetes Index Job](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs).
16+
This job encapsulates the
17+
[MaxText pretraining workload](https://github.com/AI-Hypercomputer/maxtext/blob/main/MaxText/train.py).
18+
The chart generates the job's manifest, adhering to best practices for using
19+
RDMA Over Ethernet (RoCE) with Google Kubernetes Engine (GKE).
20+
21+
## Test environment
22+
23+
This recipe has been optimized for and tested with the following configuration:
24+
25+
- A cluster with 32
26+
[a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)
27+
machines.
28+
- Machine placement in the cluster is configured using a
29+
[compact placement policy](https://cloud.google.com/kubernetes-engine/docs/how-to/compact-placement)
30+
- FP8 and BF16 precision training
31+
- Uses a synthetic pretraining dataset provided by the MaxText framework. By
32+
default, the job is configured to execute 15 training steps. If you want to
33+
change the number of training steps, see
34+
[Configure and submit a pretraining job](#configure-and-submit-a-pretraining-job).
35+
36+
## Prerequisites
37+
38+
Before running this recipe, ensure your environment is configured as follows:
39+
40+
- A GKE cluster with the following setup:
41+
- An A4 node pool (32 nodes - 256 GPUs)
42+
- Kueue Topology-aware scheduling enabled
43+
- A Google Cloud Storage (GCS) bucket to store results. *Important: This
44+
bucket must be in the same region as the GKE cluster*.
45+
- A client workstation with the following pre-installed:
46+
- Google Cloud SDK
47+
- Helm
48+
- kubectl
49+
50+
To prepare the required environment, see
51+
[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md).
52+
53+
## Run the recipe
54+
55+
It is recommended to use Cloud Shell as your client to complete the steps. Cloud
56+
Shell comes pre-installed with the necessary utilities, including `kubectl`,
57+
`the Google Cloud SDK`, and `Helm`.
58+
59+
### Launch Cloud Shell
60+
61+
In the Google Cloud console, start a
62+
[Cloud Shell Instance](https://console.cloud.google.com/?cloudshell=true).
63+
64+
### Configure environment settings
65+
66+
From your client, complete the following steps:
67+
68+
1. Set the environment variables to match your environment:
69+
70+
```bash
71+
export PROJECT_ID=<PROJECT_ID>
72+
export REGION=<REGION>
73+
export CLUSTER_REGION=<CLUSTER_REGION>
74+
export CLUSTER_NAME=<CLUSTER_NAME>
75+
export GCS_BUCKET=<GCS_BUCKET>
76+
export KUEUE_NAME=<KUEUE_NAME>
77+
```
78+
79+
Replace the following values:
80+
81+
- `<PROJECT_ID>`: your Google Cloud project ID
82+
- `<REGION>`: the region where you want to run Cloud Build
83+
- `<CLUSTER_REGION>`: the region where your cluster is located
84+
- `<CLUSTER_NAME>`: the name of your GKE cluster
85+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Do not include
86+
the `gs://` prefix
87+
- `<KUEUE_NAME>`: the name of the Kueue queue configured for TAS. The
88+
default queue created by the cluster toolkit is `a4-high`. Please verify
89+
the name of your local queue by running `kubectl get queues` and modify
90+
it as needed.
91+
92+
1. Set the default project:
93+
94+
```bash
95+
gcloud config set project $PROJECT_ID
96+
```
97+
98+
### Get the recipe
99+
100+
From your client, clone the `gpu-recipes` repository and set a reference to the
101+
recipe folder.
102+
103+
```
104+
cd
105+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
106+
cd gpu-recipes
107+
export REPO_ROOT=`git rev-parse --show-toplevel`
108+
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/maxtext-pretraining-gke
109+
```
110+
111+
### Get cluster credentials
112+
113+
From your client, get the credentials for your cluster.
114+
115+
```
116+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
117+
```
118+
119+
### Configure and submit a pretraining job
120+
121+
#### Using 32 nodes (256 GPUs)
122+
123+
The default job setting is 15 training steps and fp8 precision. To execute the
124+
job with the default settings, run the following command from your client:
125+
126+
```bash
127+
cd $RECIPE_ROOT
128+
helm install -f values.yaml \
129+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a4/maxtext-configs/llama3-1-70b-256gpus-a4-fp8.yaml \
130+
--set workload.image=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317 \
131+
--set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8 \
132+
--set workload.gpus=256 \
133+
--set queue=$KUEUE_NAME \
134+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
135+
$USER-llama-3-1-70b-maxtext-fp8 \
136+
$REPO_ROOT/src/helm-charts/a4/maxtext-training
137+
```
138+
139+
- For bf16 precision:
140+
```bash
141+
cd $RECIPE_ROOT
142+
helm install -f values.yaml \
143+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a4/maxtext-configs/llama3-1-70b-256gpus-a4-bf16.yaml \
144+
--set workload.image=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317 \
145+
--set workload.run_name=$USER-llama-3-1-70b-maxtext-bf16 \
146+
--set workload.gpus=256 \
147+
--set queue=$KUEUE_NAME \
148+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
149+
$USER-llama-3-1-70b-maxtext-bf16 \
150+
$REPO_ROOT/src/helm-charts/a4/maxtext-training
151+
```
152+
153+
#### Configure job settings
154+
155+
**Examples**
156+
157+
- To set the number of training steps to 100, run the following command from
158+
your client:
159+
160+
```bash
161+
cd $RECIPE_ROOT
162+
helm install -f values.yaml \
163+
--set-file maxtext_config=$REPO_ROOT/src/frameworks/a4/maxtext-configs/llama3-1-70b-256gpus-a4-fp8.yaml \
164+
--set workload.image=us-central1-docker.pkg.dev/deeplearning-images/reproducibility/jax-maxtext-gpu:jax0.5.1-cuda_dl25.02-rev1-maxtext-20150317 \
165+
--set workload.run_name=$USER-llama-3-1-70b-maxtext-fp8 \
166+
--set workload.gpus=256 \
167+
--set queue=$KUEUE_NAME \
168+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
169+
--set workload.steps=100 \
170+
$USER-llama-3-1-70b-maxtext-fp8 \
171+
$REPO_ROOT/src/helm-charts/a4/maxtext-training
172+
```
173+
174+
### Monitor the job
175+
176+
To check the status of pods in the indexed job, run the following command from
177+
your client:
178+
179+
```
180+
kubectl get pods | grep $USER-llama-3-1-70b-maxtext-fp8
181+
```
182+
183+
To get the logs for one of the pods, run the following command from your client:
184+
185+
```
186+
kubectl logs "<pod_name>"
187+
```
188+
189+
### Analyze results
190+
191+
When completed, the job creates tensorboard logs in the following location:
192+
193+
```
194+
gs://${GCS_BUCKET}/maxtext/$JOB_ID/tensorboard/$JOB_ID/
195+
├── events.out.tfevents....
196+
...
197+
```
198+
199+
To inspect the text logs generated by MaxText, retrieve them from any Pod in the
200+
job using the following command: `kubectl logs "<pod_name>"`
201+
202+
Here is an example of an entry in :
203+
204+
```
205+
completed step: 11, seconds: 20.935, TFLOP/s/device: 1507.142, Tokens/s/device: 3130.521, total_weights: 16777216, loss: 12.329
206+
```
207+
208+
The logs will show you the step time in seconds and the TFLOP/s/device.
209+
210+
### Calculate training performance metrics (eMFU)
211+
212+
This section explains how to calculate the effective Model FLOPS Utilization
213+
(eMFU), using the logs from the pods. Using the example logs from the previous
214+
step, and considering the number of TFLOP/s/device of 903.017, you can compute
215+
the eMFU using the following formula:
216+
217+
```
218+
TFLOP/s/device 1507.142
219+
eMFU = ------------------- = --------- = 0.6737 = 67.37%
220+
MAX TFLOP B200 2237
221+
222+
```
223+
224+
MAX TFLOP B200 BF16: 2237
225+
226+
### Uninstall the Helm release
227+
228+
You can delete the job and other resources created by the Helm chart. To
229+
uninstall Helm, run the following command from your client:
230+
231+
```bash
232+
helm uninstall $USER-llama-3-1-70b-maxtext-fp8
233+
helm uninstall $USER-llama-3-1-70b-maxtext-bf16
234+
```
235+
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Copyright 2024 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
queue:
16+
podsetPreferredTopology: "kubernetes.io/hostname"
17+
18+
volumes:
19+
# The VM host path for SSDs is assumed at /mnt/stateful_partition/kube-ephemeral-ssd
20+
ssdMountPath: "/ssd"
21+
22+
gcsMounts:
23+
- bucketName:
24+
mountPath: /gcs
25+
26+
workload:
27+
gpus: 256 # This should be one of: {<= 8, multiple of 8}
28+
steps: 15
29+
30+
network:
31+
hostNetwork: True
32+
subnetworks[]:
33+
gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5
34+
ncclSettings:
35+
- name: NCCL_DEBUG
36+
value: "INFO"
37+
- name: JAX_ENABLE_COMPILATION_CACHE
38+
value: False
39+
40+
xlaFlags: >-
41+
--xla_gpu_enable_latency_hiding_scheduler=true
42+
--xla_gpu_enable_triton_gemm=false
43+
--xla_gpu_enable_command_buffer=FUSION,CUSTOM_CALL
44+
--xla_gpu_all_reduce_combine_threshold_bytes=2147483648
45+
--xla_gpu_all_gather_combine_threshold_bytes=2147483648
46+
--xla_gpu_reduce_scatter_combine_threshold_bytes=16777216
47+
--xla_gpu_enable_pipelined_all_gather=true
48+
--xla_gpu_enable_pipelined_reduce_scatter=true
49+
--xla_gpu_enable_pipelined_all_reduce=true
50+
--xla_gpu_enable_while_loop_double_buffering=true
51+
--xla_gpu_enable_all_gather_combine_by_dim=false
52+
--xla_gpu_enable_reduce_scatter_combine_by_dim=false
53+
--xla_disable_hlo_passes=rematerialization
54+
55+
56+
57+

0 commit comments

Comments
 (0)