diff --git a/e2e-tests/llm-katan/README.md b/e2e-tests/llm-katan/README.md
index 8d6d9658d..5c25f7e78 100644
--- a/e2e-tests/llm-katan/README.md
+++ b/e2e-tests/llm-katan/README.md
@@ -38,6 +38,25 @@ docker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest \
   llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
 ```
 
+#### Option 3: Kubernetes
+
+```bash
+# Quick start with make targets
+make kube-deploy-llm-katan-gpt35    # Deploy GPT-3.5 simulation
+make kube-deploy-llm-katan-claude   # Deploy Claude simulation
+make kube-deploy-llm-katan-multi    # Deploy both models
+
+# Or manually with kubectl
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+
+# Port forward and test
+make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=gpt35
+curl http://localhost:8000/health
+```
+
+**📚 For comprehensive Kubernetes deployment guide, see [docs/kubernetes.md](docs/kubernetes.md)**
+
 ### Setup
 
 #### HuggingFace Token (Required)
diff --git a/e2e-tests/llm-katan/deploy/docs/README.md b/e2e-tests/llm-katan/deploy/docs/README.md
new file mode 100644
index 000000000..19797288a
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/docs/README.md
@@ -0,0 +1,941 @@
+# LLM Katan - Kubernetes Deployment
+
+Comprehensive Kubernetes support for deploying LLM Katan in cloud-native environments.
+
+## Overview
+
+This directory provides production-ready Kubernetes manifests using Kustomize for deploying LLM Katan - a lightweight LLM server designed for testing and development workflows.
+
+**Local Development:** This guide includes complete setup examples for both **kind** and **minikube** clusters, making it easy to run LLM Katan locally for development and testing.
+
+## Architecture
+
+### Pod Structure
+
+Each deployment consists of two containers:
+
+- **initContainer (model-downloader)**: Downloads models from HuggingFace to PVC
+  - Image: `python:3.11-slim` (~45MB)
+  - Checks if model exists before downloading
+  - Runs once before main container starts
+
+- **main container (llm-katan)**: Serves the LLM API
+  - Image: `ghcr.io/vllm-project/semantic-router/llm-katan:latest` (~1.35GB)
+  - Loads model from PVC cache
+  - Exposes OpenAI-compatible API on port 8000
+
+### Storage
+
+- **PersistentVolumeClaim**: 5Gi for model caching
+- **Mount Path**: `/cache/models/`
+- **Access Mode**: ReadWriteOnce (single Pod write)
+- Models persist across Pod restarts
+
+### Namespace
+
+All resources deploy to the `llm-katan-system` namespace. Each overlay creates isolated instances within this namespace:
+
+- **gpt35**: Simulates GPT-3.5-turbo
+- **claude**: Simulates Claude-3-Haiku
+
+### Resource Naming
+
+Kustomize applies `nameSuffix` to avoid conflicts:
+
+- Base: `llm-katan`
+- gpt35 overlay: `llm-katan-gpt35` (via `nameSuffix: -gpt35`)
+- claude overlay: `llm-katan-claude` (via `nameSuffix: -claude`)
+
+**How it works:**
+
+```yaml
+# overlays/gpt35/kustomization.yaml
+nameSuffix: -gpt35  # Automatically appends to all resource names
+```
+
+This creates unique resource names for each overlay without manual patches, allowing multiple instances to coexist in the same namespace.
+
+### Networking
+
+- **Service Type**: ClusterIP (internal only)
+- **Port**: 8000 (HTTP)
+- **Endpoints**: `/health`, `/v1/models`, `/v1/chat/completions`, `/metrics`
+
+### Health Checks
+
+- **Startup Probe**: 30s initial delay, 60 failures (15 min max startup)
+- **Liveness Probe**: 15s delay, checks every 20s
+- **Readiness Probe**: 5s delay, checks every 10s
+
+## Directory Structure
+
+```
+e2e-tests/llm-katan/deploy/
+├── docs/                          # Documentation
+│   └── README.md                 # This file - comprehensive deployment guide
+│
+└── kubernetes/                    # Kubernetes manifests
+    ├── base/                     # Base Kubernetes manifests
+    │   ├── namespace.yaml        # llm-katan-system namespace
+    │   ├── deployment.yaml       # Main deployment with health checks
+    │   ├── service.yaml          # ClusterIP service (port 8000)
+    │   ├── pvc.yaml              # Model cache storage (5Gi)
+    │   └── kustomization.yaml    # Base kustomization
+    │
+    ├── components/               # Reusable Kustomize components
+    │   └── common/               # Common labels for all resources
+    │       └── kustomization.yaml # Shared label definitions
+    │
+    └── overlays/                 # Environment-specific configurations
+        ├── gpt35/                # GPT-3.5-turbo simulation
+        │   └── kustomization.yaml # Overlay with patches for gpt35
+        │
+        └── claude/               # Claude-3-Haiku simulation
+            └── kustomization.yaml # Overlay with patches for claude
+```
+
+## Prerequisites
+
+Before starting, ensure you have the following tools installed:
+
+- [Docker](https://docs.docker.com/get-docker/) - Container runtime
+- **Local Kubernetes cluster** (choose one):
+  - [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker (recommended for CI/CD)
+  - [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes (recommended for development)
+- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI
+- `kustomize` (built into kubectl 1.14+)
+
+## Local Cluster Setup
+
+This guide provides examples for both **kind** and **minikube** clusters. Choose the one that best fits your needs.
+
+### Option 1: kind (Kubernetes in Docker)
+
+**Installation:**
+
+```bash
+# Install kind
+curl -Lo ./kind https://kind.sigs.k8s.io/dl/latest/kind-linux-amd64
+chmod +x ./kind
+sudo mv ./kind /usr/local/bin/kind
+
+# Verify installation
+kind version
+```
+
+**Create Cluster:**
+
+```bash
+# Create a basic cluster
+kind create cluster --name llm-katan-test
+
+# Verify cluster is running
+kubectl cluster-info --context kind-llm-katan-test
+kind get clusters
+```
+
+**Load Docker Image (Required):**
+
+```bash
+# Build the image first (if not already built)
+docker build -t ghcr.io/vllm-project/semantic-router/llm-katan:latest -f Dockerfile .
+
+# Load image into kind cluster
+kind load docker-image ghcr.io/vllm-project/semantic-router/llm-katan:latest --name llm-katan-test
+```
+
+### Option 2: minikube
+
+**Installation:**
+
+```bash
+# Download minikube
+cd /tmp && curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
+
+# Install minikube
+sudo install /tmp/minikube-linux-amd64 /usr/local/bin/minikube
+
+# Verify installation
+minikube version
+```
+
+**Start Cluster:**
+
+```bash
+# Start with recommended resources (16GB for running multiple instances)
+minikube start --driver=docker --memory=16384 --cpus=4
+
+# Verify cluster is running
+minikube status
+kubectl cluster-info
+```
+
+**Load Docker Image (Required):**
+
+```bash
+# Build the image first (if not already built)
+docker build -t ghcr.io/vllm-project/semantic-router/llm-katan:latest -f Dockerfile .
+
+# Load image into minikube
+minikube image load ghcr.io/vllm-project/semantic-router/llm-katan:latest
+
+# Verify image is loaded
+minikube image ls | grep llm-katan
+```
+
+### Switching Between Clusters
+
+If you have multiple clusters (kind, minikube, etc.), you need to select which one kubectl should use:
+
+```bash
+# List all contexts
+kubectl config get-contexts
+
+# Switch to kind
+kubectl config use-context kind-llm-katan-test
+
+# Switch to minikube
+kubectl config use-context minikube
+
+# Check current context
+kubectl config current-context
+```
+
+The `*` symbol indicates the active context. All `kubectl` commands will target this cluster.
+
+### Configuration
+
+Environment variables are defined directly in `deployment.yaml`:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `YLLM_MODEL` | `Qwen/Qwen3-0.6B` | HuggingFace model to load |
+| `YLLM_SERVED_MODEL_NAME` | (empty) | Model name for API (defaults to YLLM_MODEL) |
+| `YLLM_BACKEND` | `transformers` | Backend: `transformers` or `vllm` |
+| `YLLM_HOST` | `0.0.0.0` | Server bind address |
+| `YLLM_PORT` | `8000` | Server port |
+
+### Resource Limits
+
+Default per instance:
+
+```yaml
+resources:
+  requests:
+    cpu: "1"
+    memory: "3Gi"
+  limits:
+    cpu: "2"
+    memory: "6Gi"
+```
+
+**GPU Support:**
+
+LLM Katan is optimized for CPU workloads with tiny models. For GPU testing scenarios:
+
+```yaml
+# Add to deployment.yaml resources section
+limits:
+  nvidia.com/gpu: 1
+```
+
+**Note:** For production GPU deployments with larger models, use the main Semantic Router instead of LLM Katan.
+
+### Storage
+
+- **PVC Size**: 5Gi (adjust in overlays if needed)
+- **Access Mode**: ReadWriteOnce
+- **Mount Path**: `/cache/models/`
+- **Purpose**: Cache downloaded models between restarts
+
+## Complete Workflows
+
+### Quick Start (Using Make)
+
+Complete setup from scratch using make targets:
+
+```bash
+# 1. Create kind cluster (if using kind)
+make create-cluster KIND_CLUSTER_NAME=llm-katan-test
+
+# 2. Build and load Docker image
+make docker-build-llm-katan
+make kube-load-llm-katan-image KIND_CLUSTER_NAME=llm-katan-test
+
+# 3. Deploy both models
+make kube-deploy-llm-katan-multi
+
+# 4. Check status
+make kube-status-llm-katan
+
+# 5. Test deployment
+make kube-test-llm-katan
+
+# 6. Access the service (in another terminal)
+make kube-port-forward-llm-katan
+# Then: curl http://localhost:8000/health
+```
+
+### Development Workflow
+
+For iterative development and testing:
+
+```bash
+# Build and deploy
+make docker-build-llm-katan
+make kube-load-llm-katan-image
+make kube-deploy-llm-katan-gpt35
+
+# Make changes, rebuild, and redeploy
+make docker-build-llm-katan
+make kube-load-llm-katan-image
+kubectl rollout restart deployment/llm-katan-gpt35 -n llm-katan-system
+
+# View logs during testing
+make kube-logs-llm-katan
+```
+
+### Testing Multiple Models
+
+For testing routing between different LLM models:
+
+```bash
+# Deploy both models
+make kube-deploy-llm-katan-multi
+
+# Port-forward both (in separate terminals)
+# Terminal 1:
+make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=gpt35 PORT=8000
+
+# Terminal 2:
+make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude PORT=8001
+
+# Test both endpoints
+curl http://localhost:8000/v1/models  # GPT-3.5
+curl http://localhost:8001/v1/models  # Claude
+```
+
+## Deployment Options
+
+You have two main ways to deploy LLM Katan:
+
+### Option A: Using Make Targets (Recommended)
+
+**Best for:** Daily use, automation, simplified commands
+
+See the [Complete Workflows](#complete-workflows) section above for step-by-step guides.
+
+```bash
+# Quick deployment
+make kube-deploy-llm-katan-multi     # Deploy both models
+make kube-status-llm-katan           # Check status
+make kube-test-llm-katan             # Verify deployment
+```
+
+### Option B: Using kubectl Directly
+
+**Best for:** Custom configurations, troubleshooting, learning Kubernetes
+
+**Deploy from repository root:**
+
+```bash
+# Single model
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+
+# Both models
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+
+# Verify
+kubectl get all -n llm-katan-system
+```
+
+## Make Targets
+
+All commands should be run from the repository root.
+
+### Configuration Variables
+
+The following environment variables can be used to customize the make targets:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `LLM_KATAN_OVERLAY` | `gpt35` | Overlay to deploy: `gpt35`, `claude`, or `all` (for undeploy) |
+| `LLM_KATAN_NAMESPACE` | `llm-katan-system` | Kubernetes namespace for deployments |
+| `LLM_KATAN_BASE_PATH` | `e2e-tests/llm-katan/deploy/kubernetes` | Base path to Kubernetes manifests |
+| `PORT` | `8000` | Local port for port-forwarding |
+| `KIND_CLUSTER_NAME` | `semantic-router-cluster` | Kind cluster name |
+
+### Deployment
+
+```bash
+# Deploy single overlay
+make kube-deploy-llm-katan                         # Deploy with default overlay (gpt35)
+make kube-deploy-llm-katan LLM_KATAN_OVERLAY=claude # Deploy with custom overlay
+
+# Deploy specific overlays
+make kube-deploy-llm-katan-gpt35                   # Deploy GPT-3.5 simulation
+make kube-deploy-llm-katan-claude                  # Deploy Claude simulation
+
+# Deploy multiple overlays
+make kube-deploy-llm-katan-multi                   # Deploy both gpt35 and claude
+```
+
+### Status & Monitoring
+
+```bash
+# Show deployment status
+make kube-status-llm-katan                         # Show all llm-katan resources
+
+# View logs
+make kube-logs-llm-katan                           # View logs (default: gpt35)
+make kube-logs-llm-katan LLM_KATAN_OVERLAY=claude  # View Claude logs
+```
+
+### Testing & Debugging
+
+```bash
+# Test deployment
+make kube-test-llm-katan                           # Test deployment (default: gpt35)
+make kube-test-llm-katan LLM_KATAN_OVERLAY=claude  # Test Claude deployment
+
+# Port forward for local access
+make kube-port-forward-llm-katan                   # Port forward to localhost:8000 (gpt35)
+make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude # Port forward Claude
+make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude PORT=8001 # Custom port
+```
+
+### Image Management
+
+```bash
+# Build and load Docker images
+make docker-build-llm-katan                        # Build llm-katan Docker image
+make kube-load-llm-katan-image                     # Load image into kind cluster
+```
+
+### Cleanup
+
+```bash
+# Remove specific deployment
+make kube-undeploy-llm-katan                       # Remove default overlay (gpt35)
+make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=gpt35   # Remove gpt35 deployment
+make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=claude  # Remove claude deployment
+
+# Remove all deployments
+make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all     # Remove all llm-katan deployments
+```
+
+### Help
+
+```bash
+# Show Kubernetes makefile help
+make help-kube                                     # Display all available Kubernetes targets
+```
+
+## Direct kubectl Commands
+
+### Deploy
+
+```bash
+# Deploy using kustomize overlays
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+
+# Deploy both
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 && \
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+```
+
+### Status
+
+```bash
+# Get all resources
+kubectl get all -n llm-katan-system
+
+# Get pods
+kubectl get pods -n llm-katan-system -o wide
+
+# Get services
+kubectl get svc -n llm-katan-system
+
+# Get PVCs
+kubectl get pvc -n llm-katan-system
+```
+
+### Logs
+
+```bash
+# View logs
+kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -f
+kubectl logs -n llm-katan-system -l app=llm-katan-claude -f
+
+# View init container logs (model download)
+kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -c model-downloader
+```
+
+### Port Forward
+
+```bash
+# Forward to localhost
+kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000
+kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000
+```
+
+### Testing
+
+```bash
+# Health check
+curl http://localhost:8000/health
+
+# List models
+curl http://localhost:8000/v1/models
+
+# Chat completion
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gpt-3.5-turbo",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+
+# Metrics
+curl http://localhost:8000/metrics
+```
+
+### Cleanup
+
+```bash
+# Remove specific deployment
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+
+# Remove entire namespace
+kubectl delete namespace llm-katan-system
+```
+
+## Testing & Verification
+
+### Health Check
+
+```bash
+kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000
+curl http://localhost:8000/health
+
+# Expected response:
+# {"status":"ok","model":"Qwen/Qwen3-0.6B","backend":"transformers"}
+```
+
+### Chat Completion
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+### Models Endpoint
+
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### Metrics (Prometheus)
+
+```bash
+# Don't forget to port-forward first
+kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000
+
+# Get metrics
+curl http://localhost:8000/metrics
+```
+
+## Best Practices
+
+1. **Memory Allocation**: Allocate minimum 8Gi RAM for single instance, 16Gi for multi-model deployments
+2. **Model Caching**: Keep PVCs to avoid re-downloading models (first deploy: 5-15 min, cached: 1-3 min)
+3. **Cluster Selection**: Use `kind` for CI/CD and automated testing, `minikube` for local development with dashboard
+4. **Iterative Testing**: Use `kubectl rollout restart` instead of redeploy for faster iterations (1-3 min vs 5-15 min)
+5. **Tool Choice**: Use Make targets for simplified workflows, kubectl for fine-grained control and troubleshooting
+6. **Debugging**: Watch pods with `-w` flag, check init container logs for download issues, use `describe pod` for events
+7. **Production**: LLM Katan is for testing only - for production use `/deploy/helm/`, `/deploy/kubernetes/`, `/deploy/kserve/`, or `/deploy/openshift/`
+8. **Security**: Deployments use non-root containers and enforce resource limits for secure operation
+
+## Advanced Integration
+
+### Service Mesh Compatibility
+
+LLM Katan deployments work with service mesh solutions like Istio and Linkerd:
+
+**Automatic Features:**
+
+- mTLS encryption between pods
+- Traffic metrics and observability
+- Automatic retries and circuit breakers
+- Advanced load balancing
+
+**Enable sidecar injection:**
+
+```bash
+# Label namespace for automatic injection
+kubectl label namespace llm-katan-system istio-injection=enabled
+
+# Redeploy to inject sidecars
+kubectl rollout restart deployment -n llm-katan-system
+```
+
+**Note:** For production Semantic Router with service mesh, see `/deploy/kubernetes/istio/`
+
+### Testing Semantic Router with LLM Katan
+
+LLM Katan simulates LLM APIs (GPT, Claude) locally, enabling you to test Semantic Router **without API costs**.
+
+**Use Case:** Test intelligent routing logic before deploying to production with real LLM APIs.
+
+#### Step 1: Deploy LLM Katan
+
+```bash
+# Deploy both GPT-3.5 and Claude simulators
+make kube-deploy-llm-katan-multi
+
+# Verify services are running
+kubectl get svc -n llm-katan-system
+# NAME               TYPE        CLUSTER-IP      PORT(S)
+# llm-katan-gpt35    ClusterIP   10.96.186.147   8000/TCP
+# llm-katan-claude   ClusterIP   10.96.119.98    8000/TCP
+```
+
+#### Step 2: Configure Semantic Router
+
+Update `config/config.yaml` to point to LLM Katan endpoints:
+
+```yaml
+# config/config.yaml
+
+vllm_endpoints:
+  - name: "gpt35-katan"
+    address: "llm-katan-gpt35.llm-katan-system"  # Kubernetes DNS
+    port: 8000
+    weight: 1
+
+  - name: "claude-katan"
+    address: "llm-katan-claude.llm-katan-system"
+    port: 8000
+    weight: 1
+
+model_config:
+  "gpt-3.5-turbo":
+    preferred_endpoints: ["gpt35-katan"]
+  
+  "claude-3-haiku-20240307":
+    preferred_endpoints: ["claude-katan"]
+
+categories:
+  - name: coding
+    utterances:
+      - "write code"
+      - "debug"
+    model_scores:
+      "gpt-3.5-turbo": 0.9
+```
+
+#### Step 3: Deploy and Test
+
+```bash
+# Deploy Semantic Router (using Helm)
+helm install semantic-router deploy/helm/semantic-router \
+  -f config/config.yaml
+
+# Or run locally
+make run-router
+
+# Test routing (Semantic Router port-forward to 8080)
+curl -X POST http://localhost:8080/api/v1/route \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Write a Python function to sort a list",
+    "stream": false
+  }'
+
+```
+
+### Deployment Verification
+
+Use the automated verification script:
+
+```bash
+# Run comprehensive deployment checks (default: llm-katan-system namespace)
+./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh
+
+# Or specify namespace and service name
+./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh llm-katan-system llm-katan-gpt35
+./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh llm-katan-system llm-katan-claude
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**Common pod error:**
+
+- **OOMKilled (Exit Code 137)**: Pod exceeded memory limit during model loading
+  - Solution for Minikube: Restart with more RAM: `minikube delete && minikube start --memory=16384 --cpus=4`
+  - Solution for manifests: Increase memory in `deployment.yaml` (current: 6Gi)
+- **ImagePullBackOff**: Image not available in cluster
+  - For kind: `kind load docker-image ghcr.io/vllm-project/semantic-router/llm-katan:latest --name llm-katan-test`
+  - For minikube: `minikube image load ghcr.io/vllm-project/semantic-router/llm-katan:latest`
+- **Init:CrashLoopBackOff**: Model download failed
+  - Check initContainer logs: `kubectl logs -n llm-katan-system <pod-name> -c model-downloader`
+
+**Pod not starting:**
+
+```bash
+# Check pod status
+kubectl get pods -n llm-katan-system
+
+# Describe pod for events
+kubectl describe pod -n llm-katan-system -l app.kubernetes.io/name=llm-katan
+
+# Check initContainer logs (model download)
+kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c model-downloader
+
+# Check main container logs
+kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c llm-katan -f
+```
+
+**LLM Katan not responding:**
+
+```bash
+# Check deployment status
+kubectl get deployment -n llm-katan-system
+
+# Check service
+kubectl get svc -n llm-katan-system
+
+# Check if port-forward is active
+ps aux | grep "port-forward" | grep llm-katan
+
+# Test health endpoint
+kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 &
+curl http://localhost:8000/health
+```
+
+**PVC issues:**
+
+```bash
+# Check PVC status
+kubectl get pvc -n llm-katan-system
+
+# Check PVC details
+kubectl describe pvc -n llm-katan-system
+
+# Check volume contents (if pod is running)
+kubectl exec -n llm-katan-system <pod-name> -- ls -lah /cache/models/
+```
+
+## Cleanup
+
+**Remove Specific Overlay:**
+
+```bash
+# Remove gpt35 instance
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/
+
+# Remove claude instance
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/
+```
+
+**Remove All llm-katan Resources:**
+
+```bash
+# Delete entire namespace (removes everything)
+kubectl delete namespace llm-katan-system
+
+# Or delete base deployment
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/base/
+```
+
+**Cleanup Local Cluster:**
+
+```bash
+# For kind
+kind delete cluster --name llm-katan-test
+# Or if using default cluster name
+kind delete cluster
+
+# For minikube
+minikube stop    # Stop the cluster (preserves state)
+minikube delete  # Delete the cluster entirely
+```
+
+## CI/CD Integration
+
+### GitHub Actions Example
+
+Complete workflow with e2e tests:
+
+```yaml
+name: LLM Katan E2E Tests
+
+on:
+  pull_request:
+    branches: [main]
+  push:
+    branches: [main]
+
+jobs:
+  test-deployment:
+    runs-on: ubuntu-latest
+    timeout-minutes: 30
+    
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+      
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+      
+      - name: Install test dependencies
+        run: pip install pytest requests
+      
+      - name: Create kind cluster
+        run: make create-cluster KIND_CLUSTER_NAME=ci-test
+      
+      - name: Build and load Docker image
+        run: |
+          make docker-build-llm-katan
+          make kube-load-llm-katan-image KIND_CLUSTER_NAME=ci-test
+      
+      - name: Deploy LLM Katan (both models)
+        run: make kube-deploy-llm-katan-multi
+      
+      - name: Wait for deployments
+        run: |
+          make kube-test-llm-katan LLM_KATAN_OVERLAY=gpt35
+          make kube-test-llm-katan LLM_KATAN_OVERLAY=claude
+      
+      - name: Run integration tests
+        run: |
+          # Port-forward in background
+          kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 &
+          kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000 &
+          sleep 5
+          
+          # Run e2e tests (if available)
+          # pytest e2e-tests/ -v
+          
+          # Or simple health check
+          curl -f http://localhost:8000/health
+          curl -f http://localhost:8001/health
+      
+      - name: Show logs on failure
+        if: failure()
+        run: |
+          kubectl get all -n llm-katan-system
+          kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 --tail=100
+          kubectl logs -n llm-katan-system -l app=llm-katan-claude --tail=100
+      
+      - name: Cleanup
+        if: always()
+        run: |
+          make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all
+          make delete-cluster KIND_CLUSTER_NAME=ci-test
+```
+
+### GitLab CI Example
+
+```yaml
+test-llm-katan:
+  stage: test
+  script:
+    - make create-cluster
+    - make docker-build-llm-katan
+    - make kube-load-llm-katan-image
+    - make kube-deploy-llm-katan-multi
+    - make kube-test-llm-katan
+  after_script:
+    - make delete-cluster
+
+```
+
+## Quick Reference
+
+### Essential Make Commands (Recommended)
+
+**From repository root:**
+
+```bash
+# Deployment
+make kube-deploy-llm-katan-multi              # Deploy both models
+make kube-deploy-llm-katan-gpt35              # Deploy GPT-3.5 only
+make kube-deploy-llm-katan-claude             # Deploy Claude only
+
+# Status & Logs
+make kube-status-llm-katan                    # Show all resources
+make kube-logs-llm-katan                      # View logs (gpt35)
+make kube-logs-llm-katan LLM_KATAN_OVERLAY=claude
+
+# Testing
+make kube-test-llm-katan                      # Test gpt35
+make kube-port-forward-llm-katan              # Access at localhost:8000
+
+# Cleanup
+make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=gpt35
+make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all
+```
+
+### Direct kubectl Commands (For Advanced Use)
+
+**When you need more control:**
+
+```bash
+# Deploy
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude
+
+# Status
+kubectl get all,pvc -n llm-katan-system
+kubectl get pods -n llm-katan-system -o wide
+kubectl describe pod -n llm-katan-system -l app=llm-katan-gpt35
+
+# Logs
+kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -f
+kubectl logs -n llm-katan-system <pod-name> -c model-downloader  # Init container
+
+# Port-forward
+kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000
+kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000
+
+# Testing
+kubectl exec -n llm-katan-system deployment/llm-katan-gpt35 -- curl localhost:8000/health
+
+# Cleanup
+kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35
+kubectl delete namespace llm-katan-system
+```
+
+### Resource Specifications
+
+| Component | Value |
+|-----------|-------|
+| **Namespace** | `llm-katan-system` |
+| **Service Port** | `8000` |
+| **PVC Size** | `5Gi` |
+| **CPU Request** | `1 core` |
+| **CPU Limit** | `2 cores` |
+| **Memory Request** | `3Gi` |
+| **Memory Limit** | `6Gi` |
+| **Startup Timeout** | `15 minutes` |
+
+### API Endpoints
+
+| Endpoint | Description |
+|----------|-------------|
+| `/health` | Health check |
+| `/v1/models` | List available models |
+| `/v1/chat/completions` | Chat completion (OpenAI compatible) |
+| `/metrics` | Prometheus metrics |
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml
new file mode 100644
index 000000000..1931164b1
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml
@@ -0,0 +1,144 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: llm-katan
+spec:
+  selector:
+    matchLabels: {}
+  replicas: 1
+  template:
+    metadata:
+      labels: {}
+    spec:
+      # Create a non-root user for security (matching Dockerfile)
+      securityContext:
+        fsGroup: 1000
+        runAsUser: 1000
+        runAsNonRoot: true
+
+      initContainers:
+        # Pre-download model to cache for faster startup
+        # Uses lightweight python:3.11-slim image and checks if model exists before downloading
+        - name: model-downloader
+          image: python:3.11-slim
+          imagePullPolicy: IfNotPresent
+          securityContext:
+            runAsUser: 0  # Run as root to install packages
+            runAsNonRoot: false
+            allowPrivilegeEscalation: false
+          command: ["/bin/bash", "-c"]
+          args:
+            - |
+              set -e
+
+              MODEL_ID="${YLLM_MODEL:-Qwen/Qwen3-0.6B}"
+              MODEL_DIR=$(basename "$MODEL_ID")
+
+              mkdir -p /cache/models
+              cd /cache/models
+
+              # Check if model already exists in PVC
+              if [ -d "$MODEL_DIR" ]; then
+                echo "Model $MODEL_ID already cached. Skipping download."
+                exit 0
+              fi
+
+              # Model not found, proceed with download
+              echo "Downloading model $MODEL_ID..."
+              pip install --no-cache-dir huggingface_hub[cli]
+              hf download "$MODEL_ID" --local-dir "$MODEL_DIR"
+          env:
+            - name: YLLM_MODEL
+              value: "Qwen/Qwen3-0.6B"
+            - name: HF_HUB_CACHE
+              value: "/tmp/hf_cache"
+          volumeMounts:
+            - name: models-volume
+              mountPath: /cache/models
+          resources:
+            requests:
+              memory: "512Mi"
+              cpu: "250m"
+            limits:
+              memory: "1Gi"
+              cpu: "500m"
+
+      containers:
+        - name: llm-katan
+          image: ghcr.io/vllm-project/semantic-router/llm-katan:latest
+          imagePullPolicy: IfNotPresent
+
+          # Command is set via environment variables
+          # Default: llm-katan --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000
+
+          ports:
+            - name: http
+              containerPort: 8000
+              protocol: TCP
+
+          env:
+            # These can be overridden via ConfigMap in overlays
+            - name: YLLM_MODEL
+              value: "/cache/models/Qwen3-0.6B"  # Local path to downloaded model
+            - name: YLLM_PORT
+              value: "8000"
+            - name: YLLM_HOST
+              value: "0.0.0.0"
+            - name: YLLM_BACKEND
+              value: "transformers"
+            - name: PYTHONUNBUFFERED
+              value: "1"
+            - name: PYTHONDONTWRITEBYTECODE
+              value: "1"
+
+          volumeMounts:
+            - name: models-volume
+              mountPath: /cache/models
+
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 15
+            periodSeconds: 20
+            timeoutSeconds: 5
+            failureThreshold: 3
+
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 5
+            periodSeconds: 10
+            timeoutSeconds: 3
+            failureThreshold: 3
+
+          startupProbe:
+            httpGet:
+              path: /health
+              port: http
+            initialDelaySeconds: 30
+            periodSeconds: 15
+            timeoutSeconds: 5
+            failureThreshold: 60  # 15 minutes max startup time (for slow model downloads)
+
+          resources:
+            requests:
+              memory: "3Gi"
+              cpu: "1"
+            limits:
+              memory: "6Gi"
+              cpu: "2"
+
+          securityContext:
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: false  # HuggingFace needs to write to cache
+            runAsNonRoot: true
+            capabilities:
+              drop:
+                - ALL
+
+      volumes:
+        - name: models-volume
+          persistentVolumeClaim:
+            claimName: llm-katan-models
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml
new file mode 100644
index 000000000..53b95679c
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml
@@ -0,0 +1,21 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+metadata:
+  name: llm-katan-base
+
+namespace: llm-katan-system
+
+
+resources:
+  - namespace.yaml
+  - pvc.yaml
+  - deployment.yaml
+  - service.yaml
+
+# Images (can be overridden in overlays)
+images:
+  - name: llm-katan
+    newName: ghcr.io/vllm-project/semantic-router/llm-katan
+    newTag: latest
+
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml
new file mode 100644
index 000000000..f53e19f9a
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml
@@ -0,0 +1,4 @@
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: llm-katan-system
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml
new file mode 100644
index 000000000..ed12f2a5f
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml
@@ -0,0 +1,10 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: llm-katan-models
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi  # Increased for model cache (~600MB model + overhead)
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml
new file mode 100644
index 000000000..a8cd3bfee
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml
@@ -0,0 +1,12 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: llm-katan
+spec:
+  type: ClusterIP
+  ports:
+    - name: http
+      port: 8000
+      targetPort: http
+      protocol: TCP
+
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml
new file mode 100644
index 000000000..5312fe4af
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml
@@ -0,0 +1,10 @@
+apiVersion: kustomize.config.k8s.io/v1alpha1
+kind: Component
+
+# Common labels applied to all resources that use this component
+labels:
+- includeSelectors: true
+  pairs:
+    app.kubernetes.io/name: llm-katan
+    app.kubernetes.io/part-of: semantic-router-workspaces
+    app.kubernetes.io/managed-by: kustomize
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml
new file mode 100644
index 000000000..c9367b969
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml
@@ -0,0 +1,42 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+metadata:
+  name: llm-katan-claude
+
+resources:
+  - ../../base
+
+components:
+  - ../../components/common
+
+nameSuffix: -claude
+
+patches:
+  - target:
+      kind: Deployment
+      name: llm-katan
+    patch: |-
+      - op: add
+        path: /spec/template/spec/containers/0/env/-
+        value:
+          name: YLLM_SERVED_MODEL_NAME
+          value: "claude-3-haiku-20240307"
+      - op: add
+        path: /spec/template/metadata/labels/model-alias
+        value: "claude-3-haiku"
+  - target:
+      kind: Service
+      name: llm-katan
+    patch: |-
+      - op: add
+        path: /metadata/labels/model-alias
+        value: "claude-3-haiku"
+  # Update PVC reference in deployment to match suffixed PVC name
+  - target:
+      kind: Deployment
+      name: llm-katan
+    patch: |-
+      - op: replace
+        path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName
+        value: llm-katan-models-claude
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml
new file mode 100644
index 000000000..3f714d60b
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml
@@ -0,0 +1,41 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - ../../base
+
+components:
+  - ../../components/common
+
+nameSuffix: -gpt35
+
+patches:
+  - target:
+      kind: Deployment
+      name: llm-katan
+    patch: |-
+      - op: add
+        path: /spec/template/spec/containers/0/env/-
+        value:
+          name: YLLM_SERVED_MODEL_NAME
+          value: "gpt-3.5-turbo"
+      - op: add
+        path: /spec/template/metadata/labels/model-alias
+        value: "gpt-3.5-turbo"
+
+  - target:
+      kind: Service
+      name: llm-katan
+    patch: |-
+      - op: add
+        path: /metadata/labels/model-alias
+        value: "gpt-3.5-turbo"
+
+  # Update PVC reference in deployment to match suffixed PVC name
+  - target:
+      kind: Deployment
+      name: llm-katan
+    patch: |-
+      - op: replace
+        path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName
+        value: llm-katan-models-gpt35
diff --git a/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh b/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh
new file mode 100644
index 000000000..bdca079ca
--- /dev/null
+++ b/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh
@@ -0,0 +1,252 @@
+#!/bin/bash
+# Verification script for LLM Katan Kubernetes deployment
+# Usage: ./verify-deployment.sh [namespace] [service-name]
+
+set -e
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Default values
+NAMESPACE="${1:-llm-katan-system}"
+SERVICE="${2:-llm-katan}"
+PORT=8000
+
+# Functions
+log_info() {
+    echo -e "${BLUE}[INFO]${NC} $1"
+}
+
+log_success() {
+    echo -e "${GREEN}[SUCCESS]${NC} $1"
+}
+
+log_warning() {
+    echo -e "${YELLOW}[WARNING]${NC} $1"
+}
+
+log_error() {
+    echo -e "${RED}[ERROR]${NC} $1"
+}
+
+# Track overall status
+FAILED=0
+
+echo "=================================================="
+echo "LLM Katan Deployment Verification"
+echo "=================================================="
+log_info "Namespace: $NAMESPACE"
+log_info "Service: $SERVICE"
+echo ""
+
+# Check 1: Namespace exists
+log_info "Checking namespace..."
+if kubectl get namespace "$NAMESPACE" &> /dev/null; then
+    log_success "Namespace $NAMESPACE exists"
+else
+    log_error "Namespace $NAMESPACE not found"
+    FAILED=1
+fi
+echo ""
+
+# Check 2: Deployment exists
+log_info "Checking deployments..."
+if kubectl get deployment -n "$NAMESPACE" &> /dev/null; then
+    DEPLOYMENT_COUNT=$(kubectl get deployment -n "$NAMESPACE" -o name | wc -l)
+    log_success "Found $DEPLOYMENT_COUNT deployment(s)"
+    kubectl get deployment -n "$NAMESPACE"
+else
+    log_error "No deployments found"
+    FAILED=1
+fi
+echo ""
+
+# Check 3: Pods are running
+log_info "Checking pods..."
+POD_STATUS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.phase}' 2>/dev/null || echo "")
+if [ -z "$POD_STATUS" ]; then
+    log_error "No pods found"
+    FAILED=1
+else
+    RUNNING_PODS=$(echo "$POD_STATUS" | tr ' ' '\n' | grep -c "Running" || echo "0")
+    TOTAL_PODS=$(echo "$POD_STATUS" | wc -w)
+    
+    if [ "$RUNNING_PODS" -eq "$TOTAL_PODS" ] && [ "$RUNNING_PODS" -gt 0 ]; then
+        log_success "All $RUNNING_PODS/$TOTAL_PODS pods are running"
+        kubectl get pods -n "$NAMESPACE"
+    else
+        log_error "Only $RUNNING_PODS/$TOTAL_PODS pods are running"
+        kubectl get pods -n "$NAMESPACE"
+        FAILED=1
+    fi
+fi
+echo ""
+
+# Check 4: Services exist
+log_info "Checking services..."
+if kubectl get svc -n "$NAMESPACE" -o name | grep -q "$SERVICE"; then
+    log_success "Service $SERVICE exists"
+    kubectl get svc -n "$NAMESPACE" | grep "$SERVICE" || true
+else
+    log_error "Service $SERVICE not found"
+    FAILED=1
+fi
+echo ""
+
+# Check 5: PVC bound
+log_info "Checking PersistentVolumeClaims..."
+PVC_COUNT=$(kubectl get pvc -n "$NAMESPACE" -o name 2>/dev/null | wc -l)
+if [ "$PVC_COUNT" -gt 0 ]; then
+    BOUND_PVCS=$(kubectl get pvc -n "$NAMESPACE" -o jsonpath='{.items[*].status.phase}' 2>/dev/null | tr ' ' '\n' | grep -c "Bound" || echo "0")
+    if [ "$BOUND_PVCS" -eq "$PVC_COUNT" ]; then
+        log_success "All $PVC_COUNT PVC(s) are bound"
+        kubectl get pvc -n "$NAMESPACE"
+    else
+        log_error "Only $BOUND_PVCS/$PVC_COUNT PVC(s) are bound"
+        kubectl get pvc -n "$NAMESPACE"
+        FAILED=1
+    fi
+else
+    log_warning "No PVCs found (optional)"
+fi
+echo ""
+
+# Check 6: ConfigMaps exist
+log_info "Checking ConfigMaps..."
+CM_COUNT=$(kubectl get configmap -n "$NAMESPACE" -o name 2>/dev/null | wc -l)
+if [ "$CM_COUNT" -gt 0 ]; then
+    log_success "Found $CM_COUNT ConfigMap(s)"
+    kubectl get configmap -n "$NAMESPACE" -o name
+else
+    log_warning "No ConfigMaps found (may use default config)"
+fi
+echo ""
+
+# Check 7: Pod readiness
+log_info "Checking pod readiness..."
+READY_PODS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | tr ' ' '\n' | grep -c "True" || echo "0")
+TOTAL_PODS=$(kubectl get pods -n "$NAMESPACE" -o name 2>/dev/null | wc -l)
+if [ "$READY_PODS" -eq "$TOTAL_PODS" ] && [ "$READY_PODS" -gt 0 ]; then
+    log_success "All $READY_PODS/$TOTAL_PODS pods are ready"
+else
+    log_error "Only $READY_PODS/$TOTAL_PODS pods are ready"
+    FAILED=1
+fi
+echo ""
+
+# Check 8: Recent pod restarts
+log_info "Checking for pod restarts..."
+MAX_RESTARTS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' 2>/dev/null | tr ' ' '\n' | sort -rn | head -1 || echo "0")
+if [ "$MAX_RESTARTS" -eq 0 ]; then
+    log_success "No pod restarts detected"
+elif [ "$MAX_RESTARTS" -lt 3 ]; then
+    log_warning "Some pods have restarted (max: $MAX_RESTARTS times)"
+else
+    log_error "High restart count detected (max: $MAX_RESTARTS times)"
+    FAILED=1
+fi
+echo ""
+
+# Check 9: Endpoint connectivity (requires port-forward)
+log_info "Testing endpoint connectivity..."
+log_info "Setting up port-forward..."
+
+# Start port-forward in background
+kubectl port-forward -n "$NAMESPACE" "svc/$SERVICE" "$PORT:$PORT" &> /dev/null &
+PF_PID=$!
+sleep 3
+
+# Test health endpoint
+if curl -f -s -m 5 "http://localhost:$PORT/health" &> /dev/null; then
+    log_success "Health endpoint responding"
+    
+    # Try to get actual response
+    HEALTH_RESPONSE=$(curl -s -m 5 "http://localhost:$PORT/health" 2>/dev/null || echo "{}")
+    log_info "Response: $HEALTH_RESPONSE"
+else
+    log_error "Health endpoint not responding"
+    FAILED=1
+fi
+
+# Test models endpoint
+log_info "Testing /v1/models endpoint..."
+if curl -f -s -m 5 "http://localhost:$PORT/v1/models" &> /dev/null; then
+    log_success "Models endpoint responding"
+    MODELS=$(curl -s -m 5 "http://localhost:$PORT/v1/models" 2>/dev/null | grep -o '"id":"[^"]*"' || echo "")
+    if [ -n "$MODELS" ]; then
+        log_info "Models: $MODELS"
+    fi
+else
+    log_error "Models endpoint not responding"
+    FAILED=1
+fi
+
+# Test metrics endpoint
+log_info "Testing /metrics endpoint..."
+if curl -f -s -m 5 "http://localhost:$PORT/metrics" &> /dev/null; then
+    log_success "Metrics endpoint responding"
+    METRICS_LINES=$(curl -s -m 5 "http://localhost:$PORT/metrics" 2>/dev/null | wc -l)
+    log_info "Metrics: $METRICS_LINES lines"
+else
+    log_warning "Metrics endpoint not responding (may not be enabled)"
+fi
+
+# Cleanup port-forward
+kill $PF_PID 2>/dev/null || true
+wait $PF_PID 2>/dev/null || true
+echo ""
+
+# Check 10: Resource usage (if metrics-server available)
+log_info "Checking resource usage..."
+if kubectl top pod -n "$NAMESPACE" &> /dev/null; then
+    log_success "Resource metrics available"
+    kubectl top pod -n "$NAMESPACE"
+else
+    log_warning "metrics-server not available (optional)"
+fi
+echo ""
+
+# Check 11: Recent logs for errors
+log_info "Checking recent logs for errors..."
+ERROR_COUNT=$(kubectl logs -n "$NAMESPACE" -l app=llm-katan --tail=100 2>/dev/null | grep -ic "error\|exception\|failed" || echo "0")
+if [ "$ERROR_COUNT" -eq 0 ]; then
+    log_success "No errors in recent logs"
+else
+    log_warning "Found $ERROR_COUNT error messages in recent logs"
+    log_info "Recent errors:"
+    kubectl logs -n "$NAMESPACE" -l app=llm-katan --tail=100 2>/dev/null | grep -i "error\|exception\|failed" | tail -5 || true
+fi
+echo ""
+
+# Final summary
+echo "=================================================="
+echo "Verification Summary"
+echo "=================================================="
+
+if [ $FAILED -eq 0 ]; then
+    log_success "All critical checks passed!"
+    echo ""
+    log_info "Deployment is healthy and ready to use."
+    echo ""
+    log_info "Access the service:"
+    echo "  kubectl port-forward -n $NAMESPACE svc/$SERVICE $PORT:$PORT"
+    echo "  curl http://localhost:$PORT/health"
+    echo ""
+    exit 0
+else
+    log_error "Some checks failed!"
+    echo ""
+    log_info "Troubleshooting steps:"
+    echo "  1. Check pod logs: kubectl logs -n $NAMESPACE -l app=llm-katan"
+    echo "  2. Describe pods: kubectl describe pod -n $NAMESPACE -l app=llm-katan"
+    echo "  3. Check events: kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'"
+    echo ""
+    exit 1
+fi
+
+
+
diff --git a/e2e-tests/llm-katan/llm_katan/config.py b/e2e-tests/llm-katan/llm_katan/config.py
index 1f91d8ac9..138ed9d92 100644
--- a/e2e-tests/llm-katan/llm_katan/config.py
+++ b/e2e-tests/llm-katan/llm_katan/config.py
@@ -31,15 +31,14 @@ def __post_init__(self):
 
         # Apply environment variable overrides
         self.model_name = os.getenv("YLLM_MODEL", self.model_name)
+        self.served_model_name = os.getenv("YLLM_SERVED_MODEL_NAME", self.served_model_name)
         self.port = int(os.getenv("YLLM_PORT", str(self.port)))
         self.backend = os.getenv("YLLM_BACKEND", self.backend)
         self.host = os.getenv("YLLM_HOST", self.host)
 
         # Validate backend
         if self.backend not in ["transformers", "vllm"]:
-            raise ValueError(
-                f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'"
-            )
+            raise ValueError(f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'")
 
     @property
     def device_auto(self) -> str:
diff --git a/tools/make/kube.mk b/tools/make/kube.mk
index 0ff5b34fd..1e97010eb 100644
--- a/tools/make/kube.mk
+++ b/tools/make/kube.mk
@@ -166,6 +166,135 @@ setup: create-cluster deploy ## Complete setup: create cluster and deploy
 cleanup: undeploy delete-cluster ## Complete cleanup: undeploy and delete cluster
 	@echo "$(GREEN)[SUCCESS]$(NC) Complete cleanup finished!"
 
+##@ LLM Katan Kubernetes
+
+# LLM Katan configuration
+LLM_KATAN_NAMESPACE ?= llm-katan-system
+LLM_KATAN_BASE_PATH ?= e2e-tests/llm-katan/deploy/kubernetes
+LLM_KATAN_OVERLAY ?= gpt35
+LLM_KATAN_IMAGE ?= $(DOCKER_REGISTRY)/llm-katan:$(DOCKER_TAG)
+
+.PHONY: kube-deploy-llm-katan kube-deploy-llm-katan-gpt35 kube-deploy-llm-katan-claude \
+	kube-undeploy-llm-katan kube-status-llm-katan kube-logs-llm-katan \
+	kube-port-forward-llm-katan kube-test-llm-katan kube-load-llm-katan-image \
+	kube-deploy-llm-katan-multi help-kube-llm-katan
+
+# Deploy llm-katan with specified overlay
+kube-deploy-llm-katan: ## Deploy llm-katan to cluster (OVERLAY=gpt35|claude, default: gpt35)
+	@echo "$(BLUE)[INFO]$(NC) Deploying llm-katan with overlay: $(LLM_KATAN_OVERLAY)"
+	@if ! kubectl cluster-info &>/dev/null; then \
+		echo "$(RED)[ERROR]$(NC) Kubernetes cluster is not accessible"; \
+		echo "$(BLUE)[INFO]$(NC) Run 'make create-cluster' first"; \
+		exit 1; \
+	fi
+	@echo "$(BLUE)[INFO]$(NC) Applying Kubernetes manifests..."
+	@kubectl apply -k $(LLM_KATAN_BASE_PATH)/overlays/$(LLM_KATAN_OVERLAY)
+	@echo "$(BLUE)[INFO]$(NC) Waiting for namespace to be ready..."
+	@kubectl wait --for=condition=Ready namespace/$(LLM_KATAN_NAMESPACE) --timeout=60s || true
+	@echo "$(BLUE)[INFO]$(NC) Waiting for deployment to be ready..."
+	@kubectl wait --for=condition=Available deployment/llm-katan-$(LLM_KATAN_OVERLAY) \
+		-n $(LLM_KATAN_NAMESPACE) --timeout=600s || echo "$(YELLOW)[WARNING]$(NC) Deployment not ready yet, check status with: make kube-status-llm-katan"
+	@echo "$(GREEN)[SUCCESS]$(NC) LLM Katan deployment completed!"
+	@echo "$(BLUE)[INFO]$(NC) Deployment status:"
+	@kubectl get pods -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -o wide
+
+# Deploy llm-katan with gpt35 overlay
+kube-deploy-llm-katan-gpt35: ## Deploy llm-katan with GPT-3.5 overlay
+	@$(MAKE) kube-deploy-llm-katan LLM_KATAN_OVERLAY=gpt35
+	@echo "$(GREEN)[SUCCESS]$(NC) GPT-3.5 simulation deployed!"
+	@echo "$(BLUE)[INFO]$(NC) Test with: make kube-test-llm-katan LLM_KATAN_OVERLAY=gpt35"
+
+# Deploy llm-katan with claude overlay
+kube-deploy-llm-katan-claude: ## Deploy llm-katan with Claude overlay
+	@$(MAKE) kube-deploy-llm-katan LLM_KATAN_OVERLAY=claude
+	@echo "$(GREEN)[SUCCESS]$(NC) Claude simulation deployed!"
+	@echo "$(BLUE)[INFO]$(NC) Test with: make kube-test-llm-katan LLM_KATAN_OVERLAY=claude"
+
+# Deploy both overlays for multi-model testing
+kube-deploy-llm-katan-multi: ## Deploy both gpt35 and claude overlays
+	@echo "$(BLUE)[INFO]$(NC) Deploying multiple llm-katan instances..."
+	@$(MAKE) kube-deploy-llm-katan-gpt35
+	@echo ""
+	@$(MAKE) kube-deploy-llm-katan-claude
+	@echo ""
+	@echo "$(GREEN)[SUCCESS]$(NC) Multi-model deployment completed!"
+	@echo "$(BLUE)[INFO]$(NC) Available models:"
+	@kubectl get pods -n $(LLM_KATAN_NAMESPACE) -o wide
+
+# Remove llm-katan from the cluster
+kube-undeploy-llm-katan: ## Remove llm-katan from cluster (OVERLAY=gpt35|claude|all, default: gpt35)
+	@echo "$(BLUE)[INFO]$(NC) Removing llm-katan overlay: $(LLM_KATAN_OVERLAY)"
+	@if [ "$(LLM_KATAN_OVERLAY)" = "all" ]; then \
+		echo "$(BLUE)[INFO]$(NC) Removing all llm-katan deployments..."; \
+		kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/gpt35 --ignore-not-found=true; \
+		kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/claude --ignore-not-found=true; \
+	else \
+		kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/$(LLM_KATAN_OVERLAY) --ignore-not-found=true; \
+	fi
+	@echo "$(GREEN)[SUCCESS]$(NC) LLM Katan undeployment completed"
+
+# Show llm-katan deployment status
+kube-status-llm-katan: ## Show llm-katan deployment status
+	@echo "$(BLUE)[INFO]$(NC) LLM Katan deployment status"
+	@echo "$(BLUE)[INFO]$(NC) Namespace: $(LLM_KATAN_NAMESPACE)"
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Pods:"
+	@kubectl get pods -n $(LLM_KATAN_NAMESPACE) -o wide || echo "$(RED)[ERROR]$(NC) Cannot get pods"
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Services:"
+	@kubectl get services -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get services"
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) PVCs:"
+	@kubectl get pvc -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get PVCs"
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Deployments:"
+	@kubectl get deployments -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get deployments"
+
+# Show llm-katan logs
+kube-logs-llm-katan: ## Show llm-katan logs (OVERLAY=gpt35|claude, default: gpt35)
+	@echo "$(BLUE)[INFO]$(NC) Showing llm-katan logs for overlay: $(LLM_KATAN_OVERLAY)"
+	@kubectl logs -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -f
+
+# Port forward llm-katan API
+kube-port-forward-llm-katan: ## Port forward llm-katan API (OVERLAY=gpt35|claude, PORT=8000)
+	@$(eval PORT ?= 8000)
+	@echo "$(BLUE)[INFO]$(NC) Port forwarding llm-katan API (overlay: $(LLM_KATAN_OVERLAY))"
+	@echo "$(YELLOW)[INFO]$(NC) Access API at: http://localhost:$(PORT)"
+	@echo "$(YELLOW)[INFO]$(NC) Health check: curl http://localhost:$(PORT)/health"
+	@echo "$(YELLOW)[INFO]$(NC) Models: curl http://localhost:$(PORT)/v1/models"
+	@echo "$(YELLOW)[INFO]$(NC) Press Ctrl+C to stop port forwarding"
+	@kubectl port-forward -n $(LLM_KATAN_NAMESPACE) svc/llm-katan-$(LLM_KATAN_OVERLAY) $(PORT):8000
+
+# Test llm-katan deployment
+kube-test-llm-katan: ## Test llm-katan deployment (OVERLAY=gpt35|claude, default: gpt35)
+	@echo "$(BLUE)[INFO]$(NC) Testing llm-katan deployment (overlay: $(LLM_KATAN_OVERLAY))"
+	@echo "$(BLUE)[INFO]$(NC) Checking pod status..."
+	@kubectl get pods -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -o wide
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Checking service..."
+	@kubectl get svc -n $(LLM_KATAN_NAMESPACE) llm-katan-$(LLM_KATAN_OVERLAY)
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Checking pod readiness..."
+	@kubectl wait --for=condition=Ready pod -l app=llm-katan-$(LLM_KATAN_OVERLAY) \
+		-n $(LLM_KATAN_NAMESPACE) --timeout=60s || echo "$(RED)[ERROR]$(NC) Pod not ready"
+	@echo ""
+	@echo "$(BLUE)[INFO]$(NC) Testing API endpoint (requires port-forward in another terminal)..."
+	@echo "$(YELLOW)[INFO]$(NC) Run in another terminal: make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=$(LLM_KATAN_OVERLAY)"
+	@echo "$(YELLOW)[INFO]$(NC) Then test with: curl http://localhost:8000/health"
+	@echo "$(GREEN)[SUCCESS]$(NC) Deployment test completed"
+
+# Load llm-katan image into kind cluster
+kube-load-llm-katan-image: ## Load llm-katan Docker image into kind cluster
+	@echo "$(BLUE)[INFO]$(NC) Loading llm-katan Docker image into kind cluster"
+	@if ! kind get clusters | grep -q "^$(KIND_CLUSTER_NAME)$$"; then \
+		echo "$(RED)[ERROR]$(NC) Cluster $(KIND_CLUSTER_NAME) does not exist"; \
+		echo "$(BLUE)[INFO]$(NC) Run 'make create-cluster' first"; \
+		exit 1; \
+	fi
+	@echo "$(BLUE)[INFO]$(NC) Loading image: $(LLM_KATAN_IMAGE)"
+	@kind load docker-image $(LLM_KATAN_IMAGE) --name $(KIND_CLUSTER_NAME)
+	@echo "$(GREEN)[SUCCESS]$(NC) LLM Katan image loaded successfully"
+
 # Help target
 help-kube: ## Show Kubernetes makefile help
 	@echo "$(BLUE)Configuration variables:$(NC)"