feat(llm-katan): Add Kubernetes deployment support #710

noalimoy · 2025-11-20T23:42:47Z

Summary

This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.

Kubernetes Manifests (Kustomize-based)

Base deployment with security contexts and health probes
PersistentVolumeClaim (5Gi) for efficient model caching
Service (ClusterIP) exposing port 8000
Namespace isolation (llm-katan-system)

Multi-Instance Support (Overlays)

gpt35 overlay: Serves gpt-3.5-turbo alias
claude overlay: Serves claude-3-haiku-20240307 alias
Isolated PVCs per instance (prevents ReadWriteOnce conflicts)
Common labels component for consistent resource labeling

Model Caching Optimization

InitContainer (model-downloader) pre-downloads models to PVC
Smart caching: Skips download if model exists
Uses python:3.11-slim + hf download for ~45MB lightweight init
Main container starts instantly with cached model

Bug Fix (config.py)

Added YLLM_SERVED_MODEL_NAME environment variable support
Previously only worked via CLI arguments
Now enables Kubernetes env-based configuration

Documentation

Comprehensive deployment guide (deploy/docs/README.md)
Architecture explanation (Pod structure, storage, networking)
Kind cluster setup examples
Troubleshooting section with common issues

Test Results

Deployment Validation (Kind Cluster)

Resources Created:

Namespace: llm-katan-system
Deployments: llm-katan-gpt35, llm-katan-claude (both 1/1 Running)
Services: llm-katan-gpt35, llm-katan-claude (ClusterIP, port 8000)
PVCs: llm-katan-models-gpt35, llm-katan-models-claude (both 5Gi Bound)

API Validation:

GPT35 instance
$ curl http://llm-katan-gpt35:8000/v1/models
{"data":[{"id":"gpt-3.5-turbo",...}]}
Claude instance
$ curl http://llm-katan-claude:8000/v1/models
{"data":[{"id":"claude-3-haiku-20240307",...}]}

Motivation

This implementation addresses the need for:

Cloud-native deployments: Production-ready Kubernetes manifests
Multi-instance testing: Run multiple model aliases simultaneously
Efficient resource usage: Model caching prevents redundant downloads
Testing flexibility: Easy overlay creation for new model aliases

The Kustomize structure enables:

Consistent base configuration
Environment-specific customization via overlays
Easy addition of new model aliases without base changes

Related issue: #278

netlify · 2025-11-20T23:42:52Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`62e5d54`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/69238301674821000891cd98
😎 Deploy Preview	https://deploy-preview-710--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-11-20T23:49:37Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `e2e-tests`

Owners: @yossiovadia
Files changed:

e2e-tests/llm-katan/deploy/docs/README.md
e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml
e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml
e2e-tests/llm-katan/llm_katan/config.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude) - Implement initContainer for efficient model caching using PVC - Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables - Add deployment documentation with examples for Kind cluster / Minikube This enables running multiple llm-katan instances in Kubernetes, each serving different model aliases while sharing the same underlying model. The overlays (gpt35, claude) demonstrate multi-instance deployments where each instance exposes a different served model name (e.g., gpt-3.5-turbo, claude-3-haiku-20240307) via the API. The served model name now works via environment variables, enabling Kubernetes deployments to expose diffrent model name via the API. Signed-off-by: Noa Limoy <nlimoy@nlimoy-thinkpadp1gen7.raanaii.csb>

yossiovadia · 2025-11-23T22:01:05Z

Hey @noalimoy , i'll try catching u during the week to talks about this one

noalimoy · 2025-11-23T22:09:33Z

Hi @Xunzhuo
I'm working on this issue and need some clarification about the documentation requirements mentioned in it:

Examples for common use cases (CI/CD testing, multi-model setups)
Integration patterns with the main Semantic Router

Could you share more details on what exactly you're expecting for these sections? I want to be sure I'm implementing the intended scope.

Specifically:
Should this include adding a test-llm-katan make target to tools/make/kube.mk that will run in the CI workflow or sth else?

Thanks!

Xunzhuo · 2025-11-25T11:10:20Z

@noalimoy ideally we should replace all the base-model.yaml with qwen0.6B https://github.com/vllm-project/semantic-router/blob/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml

github-actions bot assigned yossiovadia Nov 20, 2025

noalimoy force-pushed the feat/add_kubernetes_support_for_llm_katan/278 branch from 04e7542 to 62e5d54 Compare November 23, 2025 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llm-katan): Add Kubernetes deployment support #710

feat(llm-katan): Add Kubernetes deployment support #710

Uh oh!

noalimoy commented Nov 20, 2025

Uh oh!

netlify bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

yossiovadia commented Nov 23, 2025

Uh oh!

noalimoy commented Nov 23, 2025

Uh oh!

Xunzhuo commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(llm-katan): Add Kubernetes deployment support #710

Are you sure you want to change the base?

feat(llm-katan): Add Kubernetes deployment support #710

Uh oh!

Conversation

noalimoy commented Nov 20, 2025

Summary

Kubernetes Manifests (Kustomize-based)

Multi-Instance Support (Overlays)

Model Caching Optimization

Bug Fix (config.py)

Documentation

Test Results

Deployment Validation (Kind Cluster)

Motivation

Uh oh!

netlify bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👥 vLLM Semantic Team Notification

📁 e2e-tests

🎉 Thanks for your contributions!

Uh oh!

yossiovadia commented Nov 23, 2025

Uh oh!

noalimoy commented Nov 23, 2025

Uh oh!

Xunzhuo commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Nov 20, 2025 •

edited

Loading

github-actions bot commented Nov 20, 2025 •

edited

Loading

📁 `e2e-tests`