Skip to content

Commit 2aa1fb0

Browse files
authored
Fix single-host TPU nodepools, always set num-nodes if not flex. (#822)
* Fix single-host TPU nodepools, always set num-nodes if not flex.
1 parent d5a2b2e commit 2aa1fb0

11 files changed

+413
-11
lines changed

goldens.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ goldens:
55
command: xpk cluster create --project=golden-project --zone=us-central1-a --enable-autoprovisioning --cluster=golden-cluster --tpu-type=tpu7x-8 --on-demand --dry-run
66
"Basic cluster create":
77
command: xpk cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --dry-run
8+
"Cluster create for multi-host nodepool":
9+
command: xpk cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-16 --spot --dry-run
810
"Cluster create with CPU and memory limits below capacity":
911
command: xpk cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --cpu-limit=1 --memory-limit=1Mi --dry-run
1012
"Cluster create with CPU and memory limits above capacity":

goldens/Basic_cluster_create.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="Conf
5252
[XPK] Existing node pool names ['0']
5353
[XPK] Task: `Retrieve resource policy` is implemented by the following command not running since it is a dry run.
5454
gcloud compute resource-policies describe tpu7x-8-2x2x1-placement-policy --project=golden-project --region=us-central1
55-
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --location=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --spot --placement-policy=tpu7x-8-2x2x1-placement-policy --enable-gvnic --node-version=0 --num-nodes=1 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --max-pods-per-node 15
55+
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --location=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --spot --placement-policy=tpu7x-8-2x2x1-placement-policy --enable-gvnic --node-version=0 --num-nodes=1 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform"
5656
[XPK] Breaking up a total of 1 commands into 1 batches
5757
[XPK] Pretending all the jobs succeeded
5858
[XPK] Create or delete node pool request complete.
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
$ xpk cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-16 --spot --dry-run
2+
[XPK] Starting xpk v0.14.3
3+
[XPK] Starting cluster create for cluster golden-cluster:
4+
[XPK] Working on golden-project and us-central1-a
5+
[XPK] Task: `Determine server supported GKE versions for default rapid gke version` is implemented by the following command not running since it is a dry run.
6+
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)"
7+
[XPK] Task: `Determine server supported GKE versions for valid versions` is implemented by the following command not running since it is a dry run.
8+
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.validVersions)"
9+
[XPK] Task: `Find if Cluster Exists` is implemented by the following command not running since it is a dry run.
10+
gcloud container clusters list --project=golden-project --filter=location~"us-central1.*" --format="csv[no-heading](name)"
11+
[XPK] Task: `GKE Cluster Create` is implemented by the following command not running since it is a dry run.
12+
gcloud beta container clusters create golden-cluster --project=golden-project --region=us-central1 --node-locations=us-central1-a --cluster-version=0 --machine-type=e2-standard-16 --enable-autoscaling --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6 --enable-dns-access --autoscaling-profile=optimize-utilization --labels=gke_product_type=xpk --location-policy=BALANCED --scopes=storage-full,gke-default
13+
[XPK] Task: `Find cluster region or zone` is implemented by the following command not running since it is a dry run.
14+
gcloud container clusters list --project=golden-project --filter=name=golden-cluster --format="value(location)"
15+
[XPK] Task: `Check if Private Nodes is enabled in cluster.` is implemented by the following command not running since it is a dry run.
16+
gcloud container clusters describe golden-cluster --project=golden-project --location=us-central1 --format="value(privateClusterConfig.enablePrivateNodes)"
17+
[XPK] Private Nodes is not enabled on the cluster.
18+
[XPK] Cluster is public and no need to authorize networks.
19+
[XPK] Try 1: get-credentials-dns-endpoint to cluster golden-cluster
20+
[XPK] Task: `get-credentials-dns-endpoint to cluster golden-cluster` is implemented by the following command not running since it is a dry run.
21+
gcloud container clusters get-credentials golden-cluster --location=us-central1 --dns-endpoint --project=golden-project && kubectl config view && kubectl config set-context --current --namespace=default
22+
[XPK] Testing credentials with kubectl...
23+
[XPK] Task: `kubectl get pods` is implemented by the following command not running since it is a dry run.
24+
kubectl get pods
25+
[XPK] Credentials test succeeded.
26+
[XPK] Finished get-credentials and kubectl setup.
27+
[XPK] Task: 'Checking CoreDNS deployment existence' in progress for namespace: kube-system
28+
[XPK] Task: `Check CoreDNS deployment in kube-system` is implemented by the following command not running since it is a dry run.
29+
kubectl get deployment coredns -n kube-system
30+
[XPK] Now verifying CoreDNS readiness...
31+
[XPK] Task: `Waiting for kubeDNS to be checked.` is implemented by the following command not running since it is a dry run.
32+
kubectl get deployment kube-dns -n kube-system --ignore-not-found
33+
[XPK] kube-dns deployment not found.
34+
[XPK] Verifying if CoreDNS is available...
35+
[XPK] Task: `Wait for coredns available` is implemented by the following command not running since it is a dry run.
36+
kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-system --timeout=240s
37+
[XPK] CoreDNS has successfully started and passed verification.
38+
[XPK] CoreDNS deployment 'coredns' found in namespace 'kube-system'.
39+
[XPK] Skipping CoreDNS deployment since it already exists.
40+
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
41+
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
42+
[XPK] Creating 1 node pool or pools of tpu7x-16
43+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, requires_workload_policy=True)
44+
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
45+
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
46+
[XPK] Creating 1 node pool or pools of tpu7x-16
47+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x2', vms_per_slice=2, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-16', supports_sub_slicing=False, requires_workload_policy=True)
48+
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
49+
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
50+
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
51+
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
52+
[XPK] Existing node pool names ['0']
53+
[XPK] Task: `Retrieve resource policy` is implemented by the following command not running since it is a dry run.
54+
gcloud compute resource-policies describe tpu7x-16-2x2x2-placement-policy --project=golden-project --region=us-central1
55+
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --location=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --spot --placement-policy=tpu7x-16-2x2x2-placement-policy --enable-gvnic --node-version=0 --num-nodes=2 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --max-pods-per-node 15
56+
[XPK] Breaking up a total of 1 commands into 1 batches
57+
[XPK] Pretending all the jobs succeeded
58+
[XPK] Create or delete node pool request complete.
59+
[XPK] Creating ConfigMap for cluster
60+
[XPK] Breaking up a total of 2 commands into 1 batches
61+
[XPK] Pretending all the jobs succeeded
62+
[XPK] Enabling the jobset API on our cluster, to be deprecated when Jobset is globally available
63+
[XPK] Try 1: Install Jobset on golden-cluster
64+
[XPK] Task: `Install Jobset on golden-cluster` is implemented by the following command not running since it is a dry run.
65+
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml
66+
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
67+
kubectl get node --no-headers | wc -l
68+
[XPK] Try 1: Updating jobset Controller Manager resources
69+
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
70+
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
71+
[XPK] Try 1: Install PathwaysJob on golden-cluster
72+
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
73+
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
74+
[XPK] Enabling Kueue on the cluster
75+
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
76+
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'
77+
[XPK] Installing Kueue version v0.14.3...
78+
[XPK] Try 1: Install Kueue
79+
[XPK] Task: `Install Kueue` is implemented by the following command not running since it is a dry run.
80+
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.14.3/manifests.yaml
81+
[XPK] Task: `Wait for Kueue to be available` is implemented by the following command not running since it is a dry run.
82+
kubectl wait deploy/kueue-controller-manager -n kueue-system --for=condition=available --timeout=10m
83+
[XPK] Applying following Kueue resources:
84+
apiVersion: kueue.x-k8s.io/v1beta1
85+
kind: ResourceFlavor
86+
metadata:
87+
name: "1xtpu7x-16"
88+
spec:
89+
nodeLabels: {"cloud.google.com/gke-tpu-accelerator": "tpu7x", "cloud.google.com/gke-tpu-topology": "2x2x2"}
90+
91+
---
92+
93+
apiVersion: kueue.x-k8s.io/v1beta1
94+
kind: AdmissionCheck
95+
metadata:
96+
name: dws-prov
97+
spec:
98+
controllerName: kueue.x-k8s.io/provisioning-request
99+
parameters:
100+
apiGroup: kueue.x-k8s.io
101+
kind: ProvisioningRequestConfig
102+
name: dws-config
103+
---
104+
apiVersion: kueue.x-k8s.io/v1beta1
105+
kind: ProvisioningRequestConfig
106+
metadata:
107+
name: dws-config
108+
spec:
109+
provisioningClassName: queued-provisioning.gke.io
110+
podSetUpdates:
111+
nodeSelector:
112+
- key: autoscaling.gke.io/provisioning-request
113+
valueFromProvisioningClassDetail: ResizeRequestName
114+
managedResources:
115+
- google.com/tpu
116+
---
117+
apiVersion: kueue.x-k8s.io/v1beta1
118+
kind: ClusterQueue
119+
metadata:
120+
name: "cluster-queue"
121+
spec:
122+
preemption:
123+
reclaimWithinCohort: Never # Don't preempt other queues in the cohort.
124+
withinClusterQueue: LowerPriority
125+
namespaceSelector: {} # match all.
126+
resourceGroups: [{'coveredResources': ['google.com/tpu'], 'flavors': [{'name': '1xtpu7x-16', 'resources': [{'name': 'google.com/tpu', 'nominalQuota': 8}]}]}]
127+
128+
---
129+
apiVersion: kueue.x-k8s.io/v1beta1
130+
kind: LocalQueue
131+
metadata:
132+
namespace: default
133+
name: multislice-queue
134+
spec:
135+
clusterQueue: cluster-queue
136+
---
137+
apiVersion: scheduling.k8s.io/v1
138+
kind: PriorityClass
139+
metadata:
140+
name: very-low
141+
value: 100
142+
globalDefault: false
143+
description: "Very Low"
144+
---
145+
apiVersion: scheduling.k8s.io/v1
146+
kind: PriorityClass
147+
metadata:
148+
name: low
149+
value: 250
150+
globalDefault: false
151+
description: "Low"
152+
---
153+
apiVersion: scheduling.k8s.io/v1
154+
kind: PriorityClass
155+
metadata:
156+
name: medium
157+
value: 500
158+
globalDefault: false
159+
description: "Medium"
160+
---
161+
apiVersion: scheduling.k8s.io/v1
162+
kind: PriorityClass
163+
metadata:
164+
name: high
165+
value: 750
166+
globalDefault: false
167+
description: "High"
168+
---
169+
apiVersion: scheduling.k8s.io/v1
170+
kind: PriorityClass
171+
metadata:
172+
name: very-high
173+
value: 1000
174+
globalDefault: false
175+
description: "Very High"
176+
[XPK] Task: `Applying Kueue Custom Resources` is implemented by the following command not running since it is a dry run.
177+
kubectl apply -f f0a510ac08b9c6d8f549478c49836dca41a72a347c491acac1fa70272d531056
178+
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
179+
kubectl get node --no-headers | wc -l
180+
[XPK] Try 1: Updating Kueue Controller Manager resources
181+
[XPK] Task: `Updating Kueue Controller Manager resources` is implemented by the following command not running since it is a dry run.
182+
kubectl patch deployment kueue-controller-manager -n kueue-system --type='strategic' --patch='{"spec": {"template": {"spec": {"containers": [{"name": "manager", "resources": {"limits": {"memory": "4096Mi"}}}]}}}}'
183+
[XPK] Verifying kjob installation
184+
[XPK] Task: `Verify kjob installation ` is implemented by the following command not running since it is a dry run.
185+
kubectl-kjob help
186+
[XPK] kjob found
187+
[XPK] Applying kjob CDRs
188+
[XPK] Task: `Create kjob CRDs on cluster` is implemented by the following command not running since it is a dry run.
189+
kubectl kjob printcrds | kubectl apply --server-side -f -
190+
[XPK] Creating kjob CRDs succeeded
191+
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
192+
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
193+
[XPK] Task: `Creating JobTemplate` is implemented by the following command not running since it is a dry run.
194+
kubectl apply -f 4abb796ed6e7c9d7256a51f13124efd989fc12ee83839bed432fcf7d64f68e61
195+
[XPK] Task: `Creating PodTemplate` is implemented by the following command not running since it is a dry run.
196+
kubectl apply -f a63aa3c4593c38ad90671fd8b067d1886f6313ad558379b364b51791aa50f4e8
197+
[XPK] Task: `Creating AppProfile` is implemented by the following command not running since it is a dry run.
198+
kubectl apply -f 1d13ddebae3c90a05ba26b312df088982dd0df0edc4f4013b88384e476c20486
199+
[XPK] GKE commands done! Resources are created.
200+
[XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/golden-cluster/details?project=golden-project
201+
[XPK] Exiting XPK cleanly

0 commit comments

Comments
 (0)