-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
I am following this recipe: https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md
I have already enabled Cloud Storage Fuse CSI driver.
gcloud container node-pools create a3-mega
--location=${ZONE}
--num-nodes=2
--machine-type=a3-megagpu-8g
--accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST
--placement-type=COMPACT
--cluster=${CLUSTER_NAME}
--spot
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m45s default-scheduler Successfully assigned default/tiangel-serving-deepseek-r1-model-0 to gke-tiangel-cluster-1-a3-mega-91214fa0-kz3w
Normal Pulling 6m45s kubelet Pulling image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761"
Normal Pulled 6m44s kubelet Successfully pulled image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761" in 604ms (604ms including waiting). Image size: 31687282 bytes.
Normal Created 6m44s kubelet Created container: gke-gcsfuse-sidecar
Normal Started 6m44s kubelet Started container gke-gcsfuse-sidecar
Normal Pulling 6m44s kubelet Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1"
Normal Pulled 5m29s kubelet Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1" in 1m14.607s (1m14.607s including waiting). Image size: 7599167601 bytes.
Normal Created 5m29s kubelet Created container: nccl-plugin-installer
Normal Started 5m29s kubelet Started container nccl-plugin-installer
Normal Pulled 4m39s kubelet Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 46.581s (46.581s including waiting). Image size: 5243380426 bytes.
Normal Pulling 4m39s kubelet Pulling image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt"
Normal Pulled 2m57s kubelet Successfully pulled image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt" in 1m41.712s (1m41.712s including waiting). Image size: 11753203350 bytes.
Normal Created 2m57s kubelet Created container: sglang-leader
Normal Started 2m57s kubelet Started container sglang-leader
Normal Pulling 2m56s (x2 over 5m26s) kubelet Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14"
Normal Created 2m56s (x2 over 4m39s) kubelet Created container: tcpxo-daemon
Normal Started 2m56s (x2 over 4m39s) kubelet Started container tcpxo-daemon
Normal Pulled 2m56s kubelet Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 682ms (682ms including waiting). Image size: 5243380426 bytes.
Normal Killing 2m53s kubelet Stopping container gke-gcsfuse-sidecar
Normal Killing 2m53s kubelet Stopping container tcpxo-daemon
Normal Killing 2m53s kubelet Stopping container sglang-leader
Warning Unhealthy 2m35s kubelet Readiness probe failed: dial tcp 10.84.3.5:30000: connect: connection refused
Metadata
Metadata
Assignees
Labels
No labels