Skip to content

Commit 2850f84

Browse files
author
Copybara
committed
Copybara import of gpu-recipes:
- 40fdebcca607c41b2db9c54761568e5c9fb3701b Change a4high to a4 to match external comms GitOrigin-RevId: 40fdebcca607c41b2db9c54761568e5c9fb3701b
1 parent 748a454 commit 2850f84

File tree

19 files changed

+550
-514
lines changed

19 files changed

+550
-514
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,12 +35,12 @@ Models | GPU Machine Type
3535
**Llama-3.1-405B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo. | Pre-training | GKE | [Link](./training/a3ultra/llama3-1-405b/nemo-pretraining-gke/README.md)
3636
**Mixtral-8-7B** | [A3 Ultra (NVIDIA H200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a3-ultra-vms) | NeMo | Pre-training | GKE | [Link](./training/a3ultra/mixtral-8x7b/nemo-pretraining-gke/README.md)
3737

38-
### Training benchmarks A4 High
38+
### Training benchmarks A4
3939

40-
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
41-
------------------ | ----------------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
42-
**Llama-3.1-405B** | [A4 High (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) | MaxText | Pre-training | GKE | [Link](./training/a4high/llama3-1-405b/maxtext-pretraining-gke/README.md)
43-
**Llama-3.1-405B** | [A4 High (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) | NeMo | Pre-training | GKE | [Link](./training/a4high/llama3-1-405b/nemo-pretraining-gke/README.md)
40+
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
41+
------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
42+
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | MaxText | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/maxtext-pretraining-gke/README.md)
43+
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/nemo-pretraining-gke/README.md)
4444

4545
### Inference benchmarks A3 Mega
4646

docs/configuring-environment-gke-a4-high.md renamed to docs/configuring-environment-gke-a4.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 High Node Pools
1+
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools
22

3-
This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4 High](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools.
3+
This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools.
44

55
## Prerequisites
66

@@ -26,7 +26,7 @@ Before you begin, ensure you have completed the following:
2626

2727
## Reserve capacity
2828

29-
To ensure that your workloads have the A4 High GPU resources required for these
29+
To ensure that your workloads have the A4 GPU resources required for these
3030
instructions, you can create a [future reservation request](https://cloud.google.com/compute/docs/instances/future-reservations-overview).
3131
With this request, you can reserve blocks of capacity for a defined duration in the
3232
future. At that date and time in the future, Compute Engine automatically
@@ -77,7 +77,7 @@ The environment comprises of the following components:
7777
- [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview): serves as a
7878
private container registry for storing and managing Docker images used in the deployment.
7979
- [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview)
80-
Cluster with A4 High Node Pools: provides a managed Kubernetes environment to run benchmark
80+
Cluster with A4 Node Pools: provides a managed Kubernetes environment to run benchmark
8181
recipes.
8282

8383
## Set up the client workstation
@@ -150,16 +150,16 @@ Replace the following:
150150
repository descriptions are not encrypted.
151151

152152

153-
## Create a GKE Cluster with A4 High Node Pools
153+
## Create a GKE Cluster with A4 Node Pools
154154

155155
Follow [this guide]() for
156-
detailed instructions to create a GKE cluster with A4 High node pools and required GPU driver versions.
156+
detailed instructions to create a GKE cluster with A4 node pools and required GPU driver versions.
157157

158158
The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) to create your GKE cluster quickly while incorporating best practices:
159159

160160
- Creation of the necessary VPC networks and subnets.
161161
- Creation of a GKE cluster with multi-networking enabled.
162-
- Creation of an A4 High node pool with NVIDIA B200 GPUs.
162+
- Creation of an A4 node pool with NVIDIA B200 GPUs.
163163
- Installation of the required components for GPUDirect-RDMA and NCCL plugin.
164164

165165
1. [Launch Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a
@@ -205,13 +205,13 @@ The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolk
205205
previous step to store the state of Terraform deployment.
206206
* `PROJECT_ID`: your Google Cloud project ID.
207207
* `COMPUTE_REGION`: the compute region for the cluster.
208-
* `COMPUTE_ZONE`: the compute zone for the node pool of A4 High machines.
208+
* `COMPUTE_ZONE`: the compute zone for the node pool of A4 machines.
209209
* `IP_ADDRESS/SUFFIX`: The IP address range that you want to allow to
210210
connect with the cluster. This CIDR block must include the IP address of
211211
the machine to call Terraform.
212212
* `RESERVATION_NAME`: the name of your reservation.
213213
* `BLOCK_NAME`: the name of a specific block within the reservation.
214-
* `NODE_COUNT`: the number of A4 High nodes in your cluster.
214+
* `NODE_COUNT`: the number of A4 nodes in your cluster.
215215

216216
To modify advanced settings, edit
217217
`examples/gke-a4-highgpu/gke-a4-highgpu.yaml`.
@@ -220,7 +220,7 @@ The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolk
220220
to provide access to Terraform.
221221

222222
1. Deploy the blueprint to provision the GKE infrastructure
223-
using A4 High machine types:
223+
using A4 machine types:
224224

225225
```sh
226226
cd ~/cluster-toolkit
@@ -242,7 +242,7 @@ VPC networks and GKE cluster:
242242

243243
## What's next
244244

245-
Once you have set up your GKE cluster with A4 High node pools, you can proceed to deploy and
245+
Once you have set up your GKE cluster with A4 node pools, you can proceed to deploy and
246246
run your [benchmark recipes](../README.md#benchmarks-support-matrix).
247247

248248
## Get Help

0 commit comments

Comments
 (0)