Skip to content

a3 mega/deepseek/recipe : deployment failure for both vllm&sglang #3

@salander0411

Description

@salander0411

I am following this recipe: https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md

I have already enabled Cloud Storage Fuse CSI driver.

gcloud container node-pools create a3-mega
--location=${ZONE}
--num-nodes=2
--machine-type=a3-megagpu-8g
--accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST
--placement-type=COMPACT
--cluster=${CLUSTER_NAME}
--spot

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m45s                  default-scheduler  Successfully assigned default/tiangel-serving-deepseek-r1-model-0 to gke-tiangel-cluster-1-a3-mega-91214fa0-kz3w
  Normal   Pulling    6m45s                  kubelet            Pulling image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761"
  Normal   Pulled     6m44s                  kubelet            Successfully pulled image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761" in 604ms (604ms including waiting). Image size: 31687282 bytes.
  Normal   Created    6m44s                  kubelet            Created container: gke-gcsfuse-sidecar
  Normal   Started    6m44s                  kubelet            Started container gke-gcsfuse-sidecar
  Normal   Pulling    6m44s                  kubelet            Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1"
  Normal   Pulled     5m29s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1" in 1m14.607s (1m14.607s including waiting). Image size: 7599167601 bytes.
  Normal   Created    5m29s                  kubelet            Created container: nccl-plugin-installer
  Normal   Started    5m29s                  kubelet            Started container nccl-plugin-installer
  Normal   Pulled     4m39s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 46.581s (46.581s including waiting). Image size: 5243380426 bytes.
  Normal   Pulling    4m39s                  kubelet            Pulling image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt"
  Normal   Pulled     2m57s                  kubelet            Successfully pulled image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt" in 1m41.712s (1m41.712s including waiting). Image size: 11753203350 bytes.
  Normal   Created    2m57s                  kubelet            Created container: sglang-leader
  Normal   Started    2m57s                  kubelet            Started container sglang-leader
  Normal   Pulling    2m56s (x2 over 5m26s)  kubelet            Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14"
  Normal   Created    2m56s (x2 over 4m39s)  kubelet            Created container: tcpxo-daemon
  Normal   Started    2m56s (x2 over 4m39s)  kubelet            Started container tcpxo-daemon
  Normal   Pulled     2m56s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 682ms (682ms including waiting). Image size: 5243380426 bytes.
  Normal   Killing    2m53s                  kubelet            Stopping container gke-gcsfuse-sidecar
  Normal   Killing    2m53s                  kubelet            Stopping container tcpxo-daemon
  Normal   Killing    2m53s                  kubelet            Stopping container sglang-leader
  Warning  Unhealthy  2m35s                  kubelet            Readiness probe failed: dial tcp 10.84.3.5:30000: connect: connection refused

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions