You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/configuring-environment-gke-a4.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
-
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 High Node Pools
1
+
# Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools
2
2
3
-
This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4 High](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools.
3
+
This [guide](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute) outlines the steps to configure the environment required to run benchmark recipes on a [Google Kubernetes Engine (GKE) cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/kubernetes-engine-overview) with [A4](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) node pools.
4
4
5
5
## Prerequisites
6
6
@@ -26,7 +26,7 @@ Before you begin, ensure you have completed the following:
26
26
27
27
## Reserve capacity
28
28
29
-
To ensure that your workloads have the A4 High GPU resources required for these
29
+
To ensure that your workloads have the A4 GPU resources required for these
30
30
instructions, you can create a [future reservation request](https://cloud.google.com/compute/docs/instances/future-reservations-overview).
31
31
With this request, you can reserve blocks of capacity for a defined duration in the
32
32
future. At that date and time in the future, Compute Engine automatically
@@ -77,7 +77,7 @@ The environment comprises of the following components:
77
77
-[Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview): serves as a
78
78
private container registry for storing and managing Docker images used in the deployment.
Cluster with A4 High Node Pools: provides a managed Kubernetes environment to run benchmark
80
+
Cluster with A4 Node Pools: provides a managed Kubernetes environment to run benchmark
81
81
recipes.
82
82
83
83
## Set up the client workstation
@@ -150,16 +150,16 @@ Replace the following:
150
150
repository descriptions are not encrypted.
151
151
152
152
153
-
## Create a GKE Cluster with A4 High Node Pools
153
+
## Create a GKE Cluster with A4 Node Pools
154
154
155
155
Follow [this guide]() for
156
-
detailed instructions to create a GKE cluster with A4 High node pools and required GPU driver versions.
156
+
detailed instructions to create a GKE cluster with A4 node pools and required GPU driver versions.
157
157
158
158
The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview) to create your GKE cluster quickly while incorporating best practices:
159
159
160
160
- Creation of the necessary VPC networks and subnets.
161
161
- Creation of a GKE cluster with multi-networking enabled.
162
-
- Creation of an A4 High node pool with NVIDIA B200 GPUs.
162
+
- Creation of an A4 node pool with NVIDIA B200 GPUs.
163
163
- Installation of the required components for GPUDirect-RDMA and NCCL plugin.
164
164
165
165
1.[Launch Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a
@@ -205,13 +205,13 @@ The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolk
205
205
previous step to store the state of Terraform deployment.
206
206
*`PROJECT_ID`: your Google Cloud project ID.
207
207
*`COMPUTE_REGION`: the compute region for the cluster.
208
-
*`COMPUTE_ZONE`: the compute zone for the node pool of A4 High machines.
208
+
*`COMPUTE_ZONE`: the compute zone for the node pool of A4 machines.
209
209
*`IP_ADDRESS/SUFFIX`: The IP address range that you want to allow to
210
210
connect with the cluster. This CIDR block must include the IP address of
211
211
the machine to call Terraform.
212
212
*`RESERVATION_NAME`: the name of your reservation.
213
213
*`BLOCK_NAME`: the name of a specific block within the reservation.
214
-
*`NODE_COUNT`: the number of A4 High nodes in your cluster.
214
+
*`NODE_COUNT`: the number of A4 nodes in your cluster.
215
215
216
216
To modify advanced settings, edit
217
217
`examples/gke-a4-highgpu/gke-a4-highgpu.yaml`.
@@ -220,7 +220,7 @@ The documentation uses [ Cluster Toolkit](https://cloud.google.com/cluster-toolk
220
220
to provide access to Terraform.
221
221
222
222
1. Deploy the blueprint to provision the GKE infrastructure
223
-
using A4 High machine types:
223
+
using A4 machine types:
224
224
225
225
```sh
226
226
cd~/cluster-toolkit
@@ -242,7 +242,7 @@ VPC networks and GKE cluster:
242
242
243
243
## What's next
244
244
245
-
Once you have set up your GKE cluster with A4 High node pools, you can proceed to deploy and
245
+
Once you have set up your GKE cluster with A4 node pools, you can proceed to deploy and
246
246
run your [benchmark recipes](../README.md#benchmarks-support-matrix).
0 commit comments