-
Notifications
You must be signed in to change notification settings - Fork 298
feat(llm-katan): Add Kubernetes deployment support #710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(llm-katan): Add Kubernetes deployment support #710
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude) - Implement initContainer for efficient model caching using PVC - Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables - Add deployment documentation with examples for Kind cluster / Minikube This enables running multiple llm-katan instances in Kubernetes, each serving different model aliases while sharing the same underlying model. The overlays (gpt35, claude) demonstrate multi-instance deployments where each instance exposes a different served model name (e.g., gpt-3.5-turbo, claude-3-haiku-20240307) via the API. The served model name now works via environment variables, enabling Kubernetes deployments to expose diffrent model name via the API. Signed-off-by: Noa Limoy <nlimoy@nlimoy-thinkpadp1gen7.raanaii.csb>
04e7542 to
62e5d54
Compare
|
Hey @noalimoy , i'll try catching u during the week to talks about this one |
|
Hi @Xunzhuo
Could you share more details on what exactly you're expecting for these sections? I want to be sure I'm implementing the intended scope. Specifically: Thanks! |
|
@noalimoy ideally we should replace all the base-model.yaml with qwen0.6B https://github.com/vllm-project/semantic-router/blob/main/deploy/kubernetes/ai-gateway/aigw-resources/base-model.yaml |

Summary
This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.
Kubernetes Manifests (Kustomize-based)
llm-katan-system)Multi-Instance Support (Overlays)
gpt-3.5-turboaliasclaude-3-haiku-20240307aliasModel Caching Optimization
model-downloader) pre-downloads models to PVCpython:3.11-slim+hf downloadfor ~45MB lightweight initBug Fix (config.py)
YLLM_SERVED_MODEL_NAMEenvironment variable supportDocumentation
deploy/docs/README.md)Test Results
Deployment Validation (Kind Cluster)
Resources Created:
API Validation:
GPT35 instance
$ curl http://llm-katan-gpt35:8000/v1/models
{"data":[{"id":"gpt-3.5-turbo",...}]}
Claude instance
$ curl http://llm-katan-claude:8000/v1/models
{"data":[{"id":"claude-3-haiku-20240307",...}]}
Motivation
This implementation addresses the need for:
The Kustomize structure enables:
Related issue: #278