diff --git a/e2e-tests/llm-katan/README.md b/e2e-tests/llm-katan/README.md index 8d6d9658d..5c25f7e78 100644 --- a/e2e-tests/llm-katan/README.md +++ b/e2e-tests/llm-katan/README.md @@ -38,6 +38,25 @@ docker run -p 8000:8000 ghcr.io/vllm-project/semantic-router/llm-katan:latest \ llm-katan --served-model-name "TinyLlama/TinyLlama-1.1B-Chat-v1.0" ``` +#### Option 3: Kubernetes + +```bash +# Quick start with make targets +make kube-deploy-llm-katan-gpt35 # Deploy GPT-3.5 simulation +make kube-deploy-llm-katan-claude # Deploy Claude simulation +make kube-deploy-llm-katan-multi # Deploy both models + +# Or manually with kubectl +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude + +# Port forward and test +make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=gpt35 +curl http://localhost:8000/health +``` + +**📚 For comprehensive Kubernetes deployment guide, see [docs/kubernetes.md](docs/kubernetes.md)** + ### Setup #### HuggingFace Token (Required) diff --git a/e2e-tests/llm-katan/deploy/docs/README.md b/e2e-tests/llm-katan/deploy/docs/README.md new file mode 100644 index 000000000..19797288a --- /dev/null +++ b/e2e-tests/llm-katan/deploy/docs/README.md @@ -0,0 +1,941 @@ +# LLM Katan - Kubernetes Deployment + +Comprehensive Kubernetes support for deploying LLM Katan in cloud-native environments. + +## Overview + +This directory provides production-ready Kubernetes manifests using Kustomize for deploying LLM Katan - a lightweight LLM server designed for testing and development workflows. + +**Local Development:** This guide includes complete setup examples for both **kind** and **minikube** clusters, making it easy to run LLM Katan locally for development and testing. + +## Architecture + +### Pod Structure + +Each deployment consists of two containers: + +- **initContainer (model-downloader)**: Downloads models from HuggingFace to PVC + - Image: `python:3.11-slim` (~45MB) + - Checks if model exists before downloading + - Runs once before main container starts + +- **main container (llm-katan)**: Serves the LLM API + - Image: `ghcr.io/vllm-project/semantic-router/llm-katan:latest` (~1.35GB) + - Loads model from PVC cache + - Exposes OpenAI-compatible API on port 8000 + +### Storage + +- **PersistentVolumeClaim**: 5Gi for model caching +- **Mount Path**: `/cache/models/` +- **Access Mode**: ReadWriteOnce (single Pod write) +- Models persist across Pod restarts + +### Namespace + +All resources deploy to the `llm-katan-system` namespace. Each overlay creates isolated instances within this namespace: + +- **gpt35**: Simulates GPT-3.5-turbo +- **claude**: Simulates Claude-3-Haiku + +### Resource Naming + +Kustomize applies `nameSuffix` to avoid conflicts: + +- Base: `llm-katan` +- gpt35 overlay: `llm-katan-gpt35` (via `nameSuffix: -gpt35`) +- claude overlay: `llm-katan-claude` (via `nameSuffix: -claude`) + +**How it works:** + +```yaml +# overlays/gpt35/kustomization.yaml +nameSuffix: -gpt35 # Automatically appends to all resource names +``` + +This creates unique resource names for each overlay without manual patches, allowing multiple instances to coexist in the same namespace. + +### Networking + +- **Service Type**: ClusterIP (internal only) +- **Port**: 8000 (HTTP) +- **Endpoints**: `/health`, `/v1/models`, `/v1/chat/completions`, `/metrics` + +### Health Checks + +- **Startup Probe**: 30s initial delay, 60 failures (15 min max startup) +- **Liveness Probe**: 15s delay, checks every 20s +- **Readiness Probe**: 5s delay, checks every 10s + +## Directory Structure + +``` +e2e-tests/llm-katan/deploy/ +├── docs/ # Documentation +│ └── README.md # This file - comprehensive deployment guide +│ +└── kubernetes/ # Kubernetes manifests + ├── base/ # Base Kubernetes manifests + │ ├── namespace.yaml # llm-katan-system namespace + │ ├── deployment.yaml # Main deployment with health checks + │ ├── service.yaml # ClusterIP service (port 8000) + │ ├── pvc.yaml # Model cache storage (5Gi) + │ └── kustomization.yaml # Base kustomization + │ + ├── components/ # Reusable Kustomize components + │ └── common/ # Common labels for all resources + │ └── kustomization.yaml # Shared label definitions + │ + └── overlays/ # Environment-specific configurations + ├── gpt35/ # GPT-3.5-turbo simulation + │ └── kustomization.yaml # Overlay with patches for gpt35 + │ + └── claude/ # Claude-3-Haiku simulation + └── kustomization.yaml # Overlay with patches for claude +``` + +## Prerequisites + +Before starting, ensure you have the following tools installed: + +- [Docker](https://docs.docker.com/get-docker/) - Container runtime +- **Local Kubernetes cluster** (choose one): + - [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker (recommended for CI/CD) + - [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes (recommended for development) +- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI +- `kustomize` (built into kubectl 1.14+) + +## Local Cluster Setup + +This guide provides examples for both **kind** and **minikube** clusters. Choose the one that best fits your needs. + +### Option 1: kind (Kubernetes in Docker) + +**Installation:** + +```bash +# Install kind +curl -Lo ./kind https://kind.sigs.k8s.io/dl/latest/kind-linux-amd64 +chmod +x ./kind +sudo mv ./kind /usr/local/bin/kind + +# Verify installation +kind version +``` + +**Create Cluster:** + +```bash +# Create a basic cluster +kind create cluster --name llm-katan-test + +# Verify cluster is running +kubectl cluster-info --context kind-llm-katan-test +kind get clusters +``` + +**Load Docker Image (Required):** + +```bash +# Build the image first (if not already built) +docker build -t ghcr.io/vllm-project/semantic-router/llm-katan:latest -f Dockerfile . + +# Load image into kind cluster +kind load docker-image ghcr.io/vllm-project/semantic-router/llm-katan:latest --name llm-katan-test +``` + +### Option 2: minikube + +**Installation:** + +```bash +# Download minikube +cd /tmp && curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 + +# Install minikube +sudo install /tmp/minikube-linux-amd64 /usr/local/bin/minikube + +# Verify installation +minikube version +``` + +**Start Cluster:** + +```bash +# Start with recommended resources (16GB for running multiple instances) +minikube start --driver=docker --memory=16384 --cpus=4 + +# Verify cluster is running +minikube status +kubectl cluster-info +``` + +**Load Docker Image (Required):** + +```bash +# Build the image first (if not already built) +docker build -t ghcr.io/vllm-project/semantic-router/llm-katan:latest -f Dockerfile . + +# Load image into minikube +minikube image load ghcr.io/vllm-project/semantic-router/llm-katan:latest + +# Verify image is loaded +minikube image ls | grep llm-katan +``` + +### Switching Between Clusters + +If you have multiple clusters (kind, minikube, etc.), you need to select which one kubectl should use: + +```bash +# List all contexts +kubectl config get-contexts + +# Switch to kind +kubectl config use-context kind-llm-katan-test + +# Switch to minikube +kubectl config use-context minikube + +# Check current context +kubectl config current-context +``` + +The `*` symbol indicates the active context. All `kubectl` commands will target this cluster. + +### Configuration + +Environment variables are defined directly in `deployment.yaml`: + +| Variable | Default | Description | +|----------|---------|-------------| +| `YLLM_MODEL` | `Qwen/Qwen3-0.6B` | HuggingFace model to load | +| `YLLM_SERVED_MODEL_NAME` | (empty) | Model name for API (defaults to YLLM_MODEL) | +| `YLLM_BACKEND` | `transformers` | Backend: `transformers` or `vllm` | +| `YLLM_HOST` | `0.0.0.0` | Server bind address | +| `YLLM_PORT` | `8000` | Server port | + +### Resource Limits + +Default per instance: + +```yaml +resources: + requests: + cpu: "1" + memory: "3Gi" + limits: + cpu: "2" + memory: "6Gi" +``` + +**GPU Support:** + +LLM Katan is optimized for CPU workloads with tiny models. For GPU testing scenarios: + +```yaml +# Add to deployment.yaml resources section +limits: + nvidia.com/gpu: 1 +``` + +**Note:** For production GPU deployments with larger models, use the main Semantic Router instead of LLM Katan. + +### Storage + +- **PVC Size**: 5Gi (adjust in overlays if needed) +- **Access Mode**: ReadWriteOnce +- **Mount Path**: `/cache/models/` +- **Purpose**: Cache downloaded models between restarts + +## Complete Workflows + +### Quick Start (Using Make) + +Complete setup from scratch using make targets: + +```bash +# 1. Create kind cluster (if using kind) +make create-cluster KIND_CLUSTER_NAME=llm-katan-test + +# 2. Build and load Docker image +make docker-build-llm-katan +make kube-load-llm-katan-image KIND_CLUSTER_NAME=llm-katan-test + +# 3. Deploy both models +make kube-deploy-llm-katan-multi + +# 4. Check status +make kube-status-llm-katan + +# 5. Test deployment +make kube-test-llm-katan + +# 6. Access the service (in another terminal) +make kube-port-forward-llm-katan +# Then: curl http://localhost:8000/health +``` + +### Development Workflow + +For iterative development and testing: + +```bash +# Build and deploy +make docker-build-llm-katan +make kube-load-llm-katan-image +make kube-deploy-llm-katan-gpt35 + +# Make changes, rebuild, and redeploy +make docker-build-llm-katan +make kube-load-llm-katan-image +kubectl rollout restart deployment/llm-katan-gpt35 -n llm-katan-system + +# View logs during testing +make kube-logs-llm-katan +``` + +### Testing Multiple Models + +For testing routing between different LLM models: + +```bash +# Deploy both models +make kube-deploy-llm-katan-multi + +# Port-forward both (in separate terminals) +# Terminal 1: +make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=gpt35 PORT=8000 + +# Terminal 2: +make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude PORT=8001 + +# Test both endpoints +curl http://localhost:8000/v1/models # GPT-3.5 +curl http://localhost:8001/v1/models # Claude +``` + +## Deployment Options + +You have two main ways to deploy LLM Katan: + +### Option A: Using Make Targets (Recommended) + +**Best for:** Daily use, automation, simplified commands + +See the [Complete Workflows](#complete-workflows) section above for step-by-step guides. + +```bash +# Quick deployment +make kube-deploy-llm-katan-multi # Deploy both models +make kube-status-llm-katan # Check status +make kube-test-llm-katan # Verify deployment +``` + +### Option B: Using kubectl Directly + +**Best for:** Custom configurations, troubleshooting, learning Kubernetes + +**Deploy from repository root:** + +```bash +# Single model +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 + +# Both models +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude + +# Verify +kubectl get all -n llm-katan-system +``` + +## Make Targets + +All commands should be run from the repository root. + +### Configuration Variables + +The following environment variables can be used to customize the make targets: + +| Variable | Default | Description | +|----------|---------|-------------| +| `LLM_KATAN_OVERLAY` | `gpt35` | Overlay to deploy: `gpt35`, `claude`, or `all` (for undeploy) | +| `LLM_KATAN_NAMESPACE` | `llm-katan-system` | Kubernetes namespace for deployments | +| `LLM_KATAN_BASE_PATH` | `e2e-tests/llm-katan/deploy/kubernetes` | Base path to Kubernetes manifests | +| `PORT` | `8000` | Local port for port-forwarding | +| `KIND_CLUSTER_NAME` | `semantic-router-cluster` | Kind cluster name | + +### Deployment + +```bash +# Deploy single overlay +make kube-deploy-llm-katan # Deploy with default overlay (gpt35) +make kube-deploy-llm-katan LLM_KATAN_OVERLAY=claude # Deploy with custom overlay + +# Deploy specific overlays +make kube-deploy-llm-katan-gpt35 # Deploy GPT-3.5 simulation +make kube-deploy-llm-katan-claude # Deploy Claude simulation + +# Deploy multiple overlays +make kube-deploy-llm-katan-multi # Deploy both gpt35 and claude +``` + +### Status & Monitoring + +```bash +# Show deployment status +make kube-status-llm-katan # Show all llm-katan resources + +# View logs +make kube-logs-llm-katan # View logs (default: gpt35) +make kube-logs-llm-katan LLM_KATAN_OVERLAY=claude # View Claude logs +``` + +### Testing & Debugging + +```bash +# Test deployment +make kube-test-llm-katan # Test deployment (default: gpt35) +make kube-test-llm-katan LLM_KATAN_OVERLAY=claude # Test Claude deployment + +# Port forward for local access +make kube-port-forward-llm-katan # Port forward to localhost:8000 (gpt35) +make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude # Port forward Claude +make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=claude PORT=8001 # Custom port +``` + +### Image Management + +```bash +# Build and load Docker images +make docker-build-llm-katan # Build llm-katan Docker image +make kube-load-llm-katan-image # Load image into kind cluster +``` + +### Cleanup + +```bash +# Remove specific deployment +make kube-undeploy-llm-katan # Remove default overlay (gpt35) +make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=gpt35 # Remove gpt35 deployment +make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=claude # Remove claude deployment + +# Remove all deployments +make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all # Remove all llm-katan deployments +``` + +### Help + +```bash +# Show Kubernetes makefile help +make help-kube # Display all available Kubernetes targets +``` + +## Direct kubectl Commands + +### Deploy + +```bash +# Deploy using kustomize overlays +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude + +# Deploy both +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 && \ +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude +``` + +### Status + +```bash +# Get all resources +kubectl get all -n llm-katan-system + +# Get pods +kubectl get pods -n llm-katan-system -o wide + +# Get services +kubectl get svc -n llm-katan-system + +# Get PVCs +kubectl get pvc -n llm-katan-system +``` + +### Logs + +```bash +# View logs +kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -f +kubectl logs -n llm-katan-system -l app=llm-katan-claude -f + +# View init container logs (model download) +kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -c model-downloader +``` + +### Port Forward + +```bash +# Forward to localhost +kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 +kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000 +``` + +### Testing + +```bash +# Health check +curl http://localhost:8000/health + +# List models +curl http://localhost:8000/v1/models + +# Chat completion +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "gpt-3.5-turbo", + "messages": [{"role": "user", "content": "Hello!"}] + }' + +# Metrics +curl http://localhost:8000/metrics +``` + +### Cleanup + +```bash +# Remove specific deployment +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude + +# Remove entire namespace +kubectl delete namespace llm-katan-system +``` + +## Testing & Verification + +### Health Check + +```bash +kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000 +curl http://localhost:8000/health + +# Expected response: +# {"status":"ok","model":"Qwen/Qwen3-0.6B","backend":"transformers"} +``` + +### Chat Completion + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen3-0.6B", + "messages": [{"role": "user", "content": "Hello!"}] + }' +``` + +### Models Endpoint + +```bash +curl http://localhost:8000/v1/models +``` + +### Metrics (Prometheus) + +```bash +# Don't forget to port-forward first +kubectl port-forward -n llm-katan-system svc/llm-katan 8000:8000 + +# Get metrics +curl http://localhost:8000/metrics +``` + +## Best Practices + +1. **Memory Allocation**: Allocate minimum 8Gi RAM for single instance, 16Gi for multi-model deployments +2. **Model Caching**: Keep PVCs to avoid re-downloading models (first deploy: 5-15 min, cached: 1-3 min) +3. **Cluster Selection**: Use `kind` for CI/CD and automated testing, `minikube` for local development with dashboard +4. **Iterative Testing**: Use `kubectl rollout restart` instead of redeploy for faster iterations (1-3 min vs 5-15 min) +5. **Tool Choice**: Use Make targets for simplified workflows, kubectl for fine-grained control and troubleshooting +6. **Debugging**: Watch pods with `-w` flag, check init container logs for download issues, use `describe pod` for events +7. **Production**: LLM Katan is for testing only - for production use `/deploy/helm/`, `/deploy/kubernetes/`, `/deploy/kserve/`, or `/deploy/openshift/` +8. **Security**: Deployments use non-root containers and enforce resource limits for secure operation + +## Advanced Integration + +### Service Mesh Compatibility + +LLM Katan deployments work with service mesh solutions like Istio and Linkerd: + +**Automatic Features:** + +- mTLS encryption between pods +- Traffic metrics and observability +- Automatic retries and circuit breakers +- Advanced load balancing + +**Enable sidecar injection:** + +```bash +# Label namespace for automatic injection +kubectl label namespace llm-katan-system istio-injection=enabled + +# Redeploy to inject sidecars +kubectl rollout restart deployment -n llm-katan-system +``` + +**Note:** For production Semantic Router with service mesh, see `/deploy/kubernetes/istio/` + +### Testing Semantic Router with LLM Katan + +LLM Katan simulates LLM APIs (GPT, Claude) locally, enabling you to test Semantic Router **without API costs**. + +**Use Case:** Test intelligent routing logic before deploying to production with real LLM APIs. + +#### Step 1: Deploy LLM Katan + +```bash +# Deploy both GPT-3.5 and Claude simulators +make kube-deploy-llm-katan-multi + +# Verify services are running +kubectl get svc -n llm-katan-system +# NAME TYPE CLUSTER-IP PORT(S) +# llm-katan-gpt35 ClusterIP 10.96.186.147 8000/TCP +# llm-katan-claude ClusterIP 10.96.119.98 8000/TCP +``` + +#### Step 2: Configure Semantic Router + +Update `config/config.yaml` to point to LLM Katan endpoints: + +```yaml +# config/config.yaml + +vllm_endpoints: + - name: "gpt35-katan" + address: "llm-katan-gpt35.llm-katan-system" # Kubernetes DNS + port: 8000 + weight: 1 + + - name: "claude-katan" + address: "llm-katan-claude.llm-katan-system" + port: 8000 + weight: 1 + +model_config: + "gpt-3.5-turbo": + preferred_endpoints: ["gpt35-katan"] + + "claude-3-haiku-20240307": + preferred_endpoints: ["claude-katan"] + +categories: + - name: coding + utterances: + - "write code" + - "debug" + model_scores: + "gpt-3.5-turbo": 0.9 +``` + +#### Step 3: Deploy and Test + +```bash +# Deploy Semantic Router (using Helm) +helm install semantic-router deploy/helm/semantic-router \ + -f config/config.yaml + +# Or run locally +make run-router + +# Test routing (Semantic Router port-forward to 8080) +curl -X POST http://localhost:8080/api/v1/route \ + -H "Content-Type: application/json" \ + -d '{ + "text": "Write a Python function to sort a list", + "stream": false + }' + +``` + +### Deployment Verification + +Use the automated verification script: + +```bash +# Run comprehensive deployment checks (default: llm-katan-system namespace) +./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh + +# Or specify namespace and service name +./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh llm-katan-system llm-katan-gpt35 +./e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh llm-katan-system llm-katan-claude +``` + +## Troubleshooting + +### Common Issues + +**Common pod error:** + +- **OOMKilled (Exit Code 137)**: Pod exceeded memory limit during model loading + - Solution for Minikube: Restart with more RAM: `minikube delete && minikube start --memory=16384 --cpus=4` + - Solution for manifests: Increase memory in `deployment.yaml` (current: 6Gi) +- **ImagePullBackOff**: Image not available in cluster + - For kind: `kind load docker-image ghcr.io/vllm-project/semantic-router/llm-katan:latest --name llm-katan-test` + - For minikube: `minikube image load ghcr.io/vllm-project/semantic-router/llm-katan:latest` +- **Init:CrashLoopBackOff**: Model download failed + - Check initContainer logs: `kubectl logs -n llm-katan-system -c model-downloader` + +**Pod not starting:** + +```bash +# Check pod status +kubectl get pods -n llm-katan-system + +# Describe pod for events +kubectl describe pod -n llm-katan-system -l app.kubernetes.io/name=llm-katan + +# Check initContainer logs (model download) +kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c model-downloader + +# Check main container logs +kubectl logs -n llm-katan-system -l app.kubernetes.io/name=llm-katan -c llm-katan -f +``` + +**LLM Katan not responding:** + +```bash +# Check deployment status +kubectl get deployment -n llm-katan-system + +# Check service +kubectl get svc -n llm-katan-system + +# Check if port-forward is active +ps aux | grep "port-forward" | grep llm-katan + +# Test health endpoint +kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 & +curl http://localhost:8000/health +``` + +**PVC issues:** + +```bash +# Check PVC status +kubectl get pvc -n llm-katan-system + +# Check PVC details +kubectl describe pvc -n llm-katan-system + +# Check volume contents (if pod is running) +kubectl exec -n llm-katan-system -- ls -lah /cache/models/ +``` + +## Cleanup + +**Remove Specific Overlay:** + +```bash +# Remove gpt35 instance +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/ + +# Remove claude instance +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/ +``` + +**Remove All llm-katan Resources:** + +```bash +# Delete entire namespace (removes everything) +kubectl delete namespace llm-katan-system + +# Or delete base deployment +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/base/ +``` + +**Cleanup Local Cluster:** + +```bash +# For kind +kind delete cluster --name llm-katan-test +# Or if using default cluster name +kind delete cluster + +# For minikube +minikube stop # Stop the cluster (preserves state) +minikube delete # Delete the cluster entirely +``` + +## CI/CD Integration + +### GitHub Actions Example + +Complete workflow with e2e tests: + +```yaml +name: LLM Katan E2E Tests + +on: + pull_request: + branches: [main] + push: + branches: [main] + +jobs: + test-deployment: + runs-on: ubuntu-latest + timeout-minutes: 30 + + steps: + - name: Checkout code + uses: actions/checkout@v3 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install test dependencies + run: pip install pytest requests + + - name: Create kind cluster + run: make create-cluster KIND_CLUSTER_NAME=ci-test + + - name: Build and load Docker image + run: | + make docker-build-llm-katan + make kube-load-llm-katan-image KIND_CLUSTER_NAME=ci-test + + - name: Deploy LLM Katan (both models) + run: make kube-deploy-llm-katan-multi + + - name: Wait for deployments + run: | + make kube-test-llm-katan LLM_KATAN_OVERLAY=gpt35 + make kube-test-llm-katan LLM_KATAN_OVERLAY=claude + + - name: Run integration tests + run: | + # Port-forward in background + kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 & + kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000 & + sleep 5 + + # Run e2e tests (if available) + # pytest e2e-tests/ -v + + # Or simple health check + curl -f http://localhost:8000/health + curl -f http://localhost:8001/health + + - name: Show logs on failure + if: failure() + run: | + kubectl get all -n llm-katan-system + kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 --tail=100 + kubectl logs -n llm-katan-system -l app=llm-katan-claude --tail=100 + + - name: Cleanup + if: always() + run: | + make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all + make delete-cluster KIND_CLUSTER_NAME=ci-test +``` + +### GitLab CI Example + +```yaml +test-llm-katan: + stage: test + script: + - make create-cluster + - make docker-build-llm-katan + - make kube-load-llm-katan-image + - make kube-deploy-llm-katan-multi + - make kube-test-llm-katan + after_script: + - make delete-cluster + +``` + +## Quick Reference + +### Essential Make Commands (Recommended) + +**From repository root:** + +```bash +# Deployment +make kube-deploy-llm-katan-multi # Deploy both models +make kube-deploy-llm-katan-gpt35 # Deploy GPT-3.5 only +make kube-deploy-llm-katan-claude # Deploy Claude only + +# Status & Logs +make kube-status-llm-katan # Show all resources +make kube-logs-llm-katan # View logs (gpt35) +make kube-logs-llm-katan LLM_KATAN_OVERLAY=claude + +# Testing +make kube-test-llm-katan # Test gpt35 +make kube-port-forward-llm-katan # Access at localhost:8000 + +# Cleanup +make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=gpt35 +make kube-undeploy-llm-katan LLM_KATAN_OVERLAY=all +``` + +### Direct kubectl Commands (For Advanced Use) + +**When you need more control:** + +```bash +# Deploy +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl apply -k e2e-tests/llm-katan/deploy/kubernetes/overlays/claude + +# Status +kubectl get all,pvc -n llm-katan-system +kubectl get pods -n llm-katan-system -o wide +kubectl describe pod -n llm-katan-system -l app=llm-katan-gpt35 + +# Logs +kubectl logs -n llm-katan-system -l app=llm-katan-gpt35 -f +kubectl logs -n llm-katan-system -c model-downloader # Init container + +# Port-forward +kubectl port-forward -n llm-katan-system svc/llm-katan-gpt35 8000:8000 +kubectl port-forward -n llm-katan-system svc/llm-katan-claude 8001:8000 + +# Testing +kubectl exec -n llm-katan-system deployment/llm-katan-gpt35 -- curl localhost:8000/health + +# Cleanup +kubectl delete -k e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35 +kubectl delete namespace llm-katan-system +``` + +### Resource Specifications + +| Component | Value | +|-----------|-------| +| **Namespace** | `llm-katan-system` | +| **Service Port** | `8000` | +| **PVC Size** | `5Gi` | +| **CPU Request** | `1 core` | +| **CPU Limit** | `2 cores` | +| **Memory Request** | `3Gi` | +| **Memory Limit** | `6Gi` | +| **Startup Timeout** | `15 minutes` | + +### API Endpoints + +| Endpoint | Description | +|----------|-------------| +| `/health` | Health check | +| `/v1/models` | List available models | +| `/v1/chat/completions` | Chat completion (OpenAI compatible) | +| `/metrics` | Prometheus metrics | diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml new file mode 100644 index 000000000..1931164b1 --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml @@ -0,0 +1,144 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llm-katan +spec: + selector: + matchLabels: {} + replicas: 1 + template: + metadata: + labels: {} + spec: + # Create a non-root user for security (matching Dockerfile) + securityContext: + fsGroup: 1000 + runAsUser: 1000 + runAsNonRoot: true + + initContainers: + # Pre-download model to cache for faster startup + # Uses lightweight python:3.11-slim image and checks if model exists before downloading + - name: model-downloader + image: python:3.11-slim + imagePullPolicy: IfNotPresent + securityContext: + runAsUser: 0 # Run as root to install packages + runAsNonRoot: false + allowPrivilegeEscalation: false + command: ["/bin/bash", "-c"] + args: + - | + set -e + + MODEL_ID="${YLLM_MODEL:-Qwen/Qwen3-0.6B}" + MODEL_DIR=$(basename "$MODEL_ID") + + mkdir -p /cache/models + cd /cache/models + + # Check if model already exists in PVC + if [ -d "$MODEL_DIR" ]; then + echo "Model $MODEL_ID already cached. Skipping download." + exit 0 + fi + + # Model not found, proceed with download + echo "Downloading model $MODEL_ID..." + pip install --no-cache-dir huggingface_hub[cli] + hf download "$MODEL_ID" --local-dir "$MODEL_DIR" + env: + - name: YLLM_MODEL + value: "Qwen/Qwen3-0.6B" + - name: HF_HUB_CACHE + value: "/tmp/hf_cache" + volumeMounts: + - name: models-volume + mountPath: /cache/models + resources: + requests: + memory: "512Mi" + cpu: "250m" + limits: + memory: "1Gi" + cpu: "500m" + + containers: + - name: llm-katan + image: ghcr.io/vllm-project/semantic-router/llm-katan:latest + imagePullPolicy: IfNotPresent + + # Command is set via environment variables + # Default: llm-katan --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000 + + ports: + - name: http + containerPort: 8000 + protocol: TCP + + env: + # These can be overridden via ConfigMap in overlays + - name: YLLM_MODEL + value: "/cache/models/Qwen3-0.6B" # Local path to downloaded model + - name: YLLM_PORT + value: "8000" + - name: YLLM_HOST + value: "0.0.0.0" + - name: YLLM_BACKEND + value: "transformers" + - name: PYTHONUNBUFFERED + value: "1" + - name: PYTHONDONTWRITEBYTECODE + value: "1" + + volumeMounts: + - name: models-volume + mountPath: /cache/models + + livenessProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 15 + periodSeconds: 20 + timeoutSeconds: 5 + failureThreshold: 3 + + readinessProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + + startupProbe: + httpGet: + path: /health + port: http + initialDelaySeconds: 30 + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 60 # 15 minutes max startup time (for slow model downloads) + + resources: + requests: + memory: "3Gi" + cpu: "1" + limits: + memory: "6Gi" + cpu: "2" + + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false # HuggingFace needs to write to cache + runAsNonRoot: true + capabilities: + drop: + - ALL + + volumes: + - name: models-volume + persistentVolumeClaim: + claimName: llm-katan-models diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml new file mode 100644 index 000000000..53b95679c --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml @@ -0,0 +1,21 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +metadata: + name: llm-katan-base + +namespace: llm-katan-system + + +resources: + - namespace.yaml + - pvc.yaml + - deployment.yaml + - service.yaml + +# Images (can be overridden in overlays) +images: + - name: llm-katan + newName: ghcr.io/vllm-project/semantic-router/llm-katan + newTag: latest + diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml new file mode 100644 index 000000000..f53e19f9a --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml @@ -0,0 +1,4 @@ +apiVersion: v1 +kind: Namespace +metadata: + name: llm-katan-system diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml new file mode 100644 index 000000000..ed12f2a5f --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: llm-katan-models +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 5Gi # Increased for model cache (~600MB model + overhead) diff --git a/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml new file mode 100644 index 000000000..a8cd3bfee --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: Service +metadata: + name: llm-katan +spec: + type: ClusterIP + ports: + - name: http + port: 8000 + targetPort: http + protocol: TCP + diff --git a/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml new file mode 100644 index 000000000..5312fe4af --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml @@ -0,0 +1,10 @@ +apiVersion: kustomize.config.k8s.io/v1alpha1 +kind: Component + +# Common labels applied to all resources that use this component +labels: +- includeSelectors: true + pairs: + app.kubernetes.io/name: llm-katan + app.kubernetes.io/part-of: semantic-router-workspaces + app.kubernetes.io/managed-by: kustomize diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml new file mode 100644 index 000000000..c9367b969 --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml @@ -0,0 +1,42 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +metadata: + name: llm-katan-claude + +resources: + - ../../base + +components: + - ../../components/common + +nameSuffix: -claude + +patches: + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: add + path: /spec/template/spec/containers/0/env/- + value: + name: YLLM_SERVED_MODEL_NAME + value: "claude-3-haiku-20240307" + - op: add + path: /spec/template/metadata/labels/model-alias + value: "claude-3-haiku" + - target: + kind: Service + name: llm-katan + patch: |- + - op: add + path: /metadata/labels/model-alias + value: "claude-3-haiku" + # Update PVC reference in deployment to match suffixed PVC name + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: replace + path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName + value: llm-katan-models-claude diff --git a/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml new file mode 100644 index 000000000..3f714d60b --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml @@ -0,0 +1,41 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - ../../base + +components: + - ../../components/common + +nameSuffix: -gpt35 + +patches: + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: add + path: /spec/template/spec/containers/0/env/- + value: + name: YLLM_SERVED_MODEL_NAME + value: "gpt-3.5-turbo" + - op: add + path: /spec/template/metadata/labels/model-alias + value: "gpt-3.5-turbo" + + - target: + kind: Service + name: llm-katan + patch: |- + - op: add + path: /metadata/labels/model-alias + value: "gpt-3.5-turbo" + + # Update PVC reference in deployment to match suffixed PVC name + - target: + kind: Deployment + name: llm-katan + patch: |- + - op: replace + path: /spec/template/spec/volumes/0/persistentVolumeClaim/claimName + value: llm-katan-models-gpt35 diff --git a/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh b/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh new file mode 100644 index 000000000..bdca079ca --- /dev/null +++ b/e2e-tests/llm-katan/deploy/kubernetes/verify-deployment.sh @@ -0,0 +1,252 @@ +#!/bin/bash +# Verification script for LLM Katan Kubernetes deployment +# Usage: ./verify-deployment.sh [namespace] [service-name] + +set -e + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Default values +NAMESPACE="${1:-llm-katan-system}" +SERVICE="${2:-llm-katan}" +PORT=8000 + +# Functions +log_info() { + echo -e "${BLUE}[INFO]${NC} $1" +} + +log_success() { + echo -e "${GREEN}[SUCCESS]${NC} $1" +} + +log_warning() { + echo -e "${YELLOW}[WARNING]${NC} $1" +} + +log_error() { + echo -e "${RED}[ERROR]${NC} $1" +} + +# Track overall status +FAILED=0 + +echo "==================================================" +echo "LLM Katan Deployment Verification" +echo "==================================================" +log_info "Namespace: $NAMESPACE" +log_info "Service: $SERVICE" +echo "" + +# Check 1: Namespace exists +log_info "Checking namespace..." +if kubectl get namespace "$NAMESPACE" &> /dev/null; then + log_success "Namespace $NAMESPACE exists" +else + log_error "Namespace $NAMESPACE not found" + FAILED=1 +fi +echo "" + +# Check 2: Deployment exists +log_info "Checking deployments..." +if kubectl get deployment -n "$NAMESPACE" &> /dev/null; then + DEPLOYMENT_COUNT=$(kubectl get deployment -n "$NAMESPACE" -o name | wc -l) + log_success "Found $DEPLOYMENT_COUNT deployment(s)" + kubectl get deployment -n "$NAMESPACE" +else + log_error "No deployments found" + FAILED=1 +fi +echo "" + +# Check 3: Pods are running +log_info "Checking pods..." +POD_STATUS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.phase}' 2>/dev/null || echo "") +if [ -z "$POD_STATUS" ]; then + log_error "No pods found" + FAILED=1 +else + RUNNING_PODS=$(echo "$POD_STATUS" | tr ' ' '\n' | grep -c "Running" || echo "0") + TOTAL_PODS=$(echo "$POD_STATUS" | wc -w) + + if [ "$RUNNING_PODS" -eq "$TOTAL_PODS" ] && [ "$RUNNING_PODS" -gt 0 ]; then + log_success "All $RUNNING_PODS/$TOTAL_PODS pods are running" + kubectl get pods -n "$NAMESPACE" + else + log_error "Only $RUNNING_PODS/$TOTAL_PODS pods are running" + kubectl get pods -n "$NAMESPACE" + FAILED=1 + fi +fi +echo "" + +# Check 4: Services exist +log_info "Checking services..." +if kubectl get svc -n "$NAMESPACE" -o name | grep -q "$SERVICE"; then + log_success "Service $SERVICE exists" + kubectl get svc -n "$NAMESPACE" | grep "$SERVICE" || true +else + log_error "Service $SERVICE not found" + FAILED=1 +fi +echo "" + +# Check 5: PVC bound +log_info "Checking PersistentVolumeClaims..." +PVC_COUNT=$(kubectl get pvc -n "$NAMESPACE" -o name 2>/dev/null | wc -l) +if [ "$PVC_COUNT" -gt 0 ]; then + BOUND_PVCS=$(kubectl get pvc -n "$NAMESPACE" -o jsonpath='{.items[*].status.phase}' 2>/dev/null | tr ' ' '\n' | grep -c "Bound" || echo "0") + if [ "$BOUND_PVCS" -eq "$PVC_COUNT" ]; then + log_success "All $PVC_COUNT PVC(s) are bound" + kubectl get pvc -n "$NAMESPACE" + else + log_error "Only $BOUND_PVCS/$PVC_COUNT PVC(s) are bound" + kubectl get pvc -n "$NAMESPACE" + FAILED=1 + fi +else + log_warning "No PVCs found (optional)" +fi +echo "" + +# Check 6: ConfigMaps exist +log_info "Checking ConfigMaps..." +CM_COUNT=$(kubectl get configmap -n "$NAMESPACE" -o name 2>/dev/null | wc -l) +if [ "$CM_COUNT" -gt 0 ]; then + log_success "Found $CM_COUNT ConfigMap(s)" + kubectl get configmap -n "$NAMESPACE" -o name +else + log_warning "No ConfigMaps found (may use default config)" +fi +echo "" + +# Check 7: Pod readiness +log_info "Checking pod readiness..." +READY_PODS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null | tr ' ' '\n' | grep -c "True" || echo "0") +TOTAL_PODS=$(kubectl get pods -n "$NAMESPACE" -o name 2>/dev/null | wc -l) +if [ "$READY_PODS" -eq "$TOTAL_PODS" ] && [ "$READY_PODS" -gt 0 ]; then + log_success "All $READY_PODS/$TOTAL_PODS pods are ready" +else + log_error "Only $READY_PODS/$TOTAL_PODS pods are ready" + FAILED=1 +fi +echo "" + +# Check 8: Recent pod restarts +log_info "Checking for pod restarts..." +MAX_RESTARTS=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' 2>/dev/null | tr ' ' '\n' | sort -rn | head -1 || echo "0") +if [ "$MAX_RESTARTS" -eq 0 ]; then + log_success "No pod restarts detected" +elif [ "$MAX_RESTARTS" -lt 3 ]; then + log_warning "Some pods have restarted (max: $MAX_RESTARTS times)" +else + log_error "High restart count detected (max: $MAX_RESTARTS times)" + FAILED=1 +fi +echo "" + +# Check 9: Endpoint connectivity (requires port-forward) +log_info "Testing endpoint connectivity..." +log_info "Setting up port-forward..." + +# Start port-forward in background +kubectl port-forward -n "$NAMESPACE" "svc/$SERVICE" "$PORT:$PORT" &> /dev/null & +PF_PID=$! +sleep 3 + +# Test health endpoint +if curl -f -s -m 5 "http://localhost:$PORT/health" &> /dev/null; then + log_success "Health endpoint responding" + + # Try to get actual response + HEALTH_RESPONSE=$(curl -s -m 5 "http://localhost:$PORT/health" 2>/dev/null || echo "{}") + log_info "Response: $HEALTH_RESPONSE" +else + log_error "Health endpoint not responding" + FAILED=1 +fi + +# Test models endpoint +log_info "Testing /v1/models endpoint..." +if curl -f -s -m 5 "http://localhost:$PORT/v1/models" &> /dev/null; then + log_success "Models endpoint responding" + MODELS=$(curl -s -m 5 "http://localhost:$PORT/v1/models" 2>/dev/null | grep -o '"id":"[^"]*"' || echo "") + if [ -n "$MODELS" ]; then + log_info "Models: $MODELS" + fi +else + log_error "Models endpoint not responding" + FAILED=1 +fi + +# Test metrics endpoint +log_info "Testing /metrics endpoint..." +if curl -f -s -m 5 "http://localhost:$PORT/metrics" &> /dev/null; then + log_success "Metrics endpoint responding" + METRICS_LINES=$(curl -s -m 5 "http://localhost:$PORT/metrics" 2>/dev/null | wc -l) + log_info "Metrics: $METRICS_LINES lines" +else + log_warning "Metrics endpoint not responding (may not be enabled)" +fi + +# Cleanup port-forward +kill $PF_PID 2>/dev/null || true +wait $PF_PID 2>/dev/null || true +echo "" + +# Check 10: Resource usage (if metrics-server available) +log_info "Checking resource usage..." +if kubectl top pod -n "$NAMESPACE" &> /dev/null; then + log_success "Resource metrics available" + kubectl top pod -n "$NAMESPACE" +else + log_warning "metrics-server not available (optional)" +fi +echo "" + +# Check 11: Recent logs for errors +log_info "Checking recent logs for errors..." +ERROR_COUNT=$(kubectl logs -n "$NAMESPACE" -l app=llm-katan --tail=100 2>/dev/null | grep -ic "error\|exception\|failed" || echo "0") +if [ "$ERROR_COUNT" -eq 0 ]; then + log_success "No errors in recent logs" +else + log_warning "Found $ERROR_COUNT error messages in recent logs" + log_info "Recent errors:" + kubectl logs -n "$NAMESPACE" -l app=llm-katan --tail=100 2>/dev/null | grep -i "error\|exception\|failed" | tail -5 || true +fi +echo "" + +# Final summary +echo "==================================================" +echo "Verification Summary" +echo "==================================================" + +if [ $FAILED -eq 0 ]; then + log_success "All critical checks passed!" + echo "" + log_info "Deployment is healthy and ready to use." + echo "" + log_info "Access the service:" + echo " kubectl port-forward -n $NAMESPACE svc/$SERVICE $PORT:$PORT" + echo " curl http://localhost:$PORT/health" + echo "" + exit 0 +else + log_error "Some checks failed!" + echo "" + log_info "Troubleshooting steps:" + echo " 1. Check pod logs: kubectl logs -n $NAMESPACE -l app=llm-katan" + echo " 2. Describe pods: kubectl describe pod -n $NAMESPACE -l app=llm-katan" + echo " 3. Check events: kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'" + echo "" + exit 1 +fi + + + diff --git a/e2e-tests/llm-katan/llm_katan/config.py b/e2e-tests/llm-katan/llm_katan/config.py index 1f91d8ac9..138ed9d92 100644 --- a/e2e-tests/llm-katan/llm_katan/config.py +++ b/e2e-tests/llm-katan/llm_katan/config.py @@ -31,15 +31,14 @@ def __post_init__(self): # Apply environment variable overrides self.model_name = os.getenv("YLLM_MODEL", self.model_name) + self.served_model_name = os.getenv("YLLM_SERVED_MODEL_NAME", self.served_model_name) self.port = int(os.getenv("YLLM_PORT", str(self.port))) self.backend = os.getenv("YLLM_BACKEND", self.backend) self.host = os.getenv("YLLM_HOST", self.host) # Validate backend if self.backend not in ["transformers", "vllm"]: - raise ValueError( - f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'" - ) + raise ValueError(f"Invalid backend: {self.backend}. Must be 'transformers' or 'vllm'") @property def device_auto(self) -> str: diff --git a/tools/make/kube.mk b/tools/make/kube.mk index 0ff5b34fd..1e97010eb 100644 --- a/tools/make/kube.mk +++ b/tools/make/kube.mk @@ -166,6 +166,135 @@ setup: create-cluster deploy ## Complete setup: create cluster and deploy cleanup: undeploy delete-cluster ## Complete cleanup: undeploy and delete cluster @echo "$(GREEN)[SUCCESS]$(NC) Complete cleanup finished!" +##@ LLM Katan Kubernetes + +# LLM Katan configuration +LLM_KATAN_NAMESPACE ?= llm-katan-system +LLM_KATAN_BASE_PATH ?= e2e-tests/llm-katan/deploy/kubernetes +LLM_KATAN_OVERLAY ?= gpt35 +LLM_KATAN_IMAGE ?= $(DOCKER_REGISTRY)/llm-katan:$(DOCKER_TAG) + +.PHONY: kube-deploy-llm-katan kube-deploy-llm-katan-gpt35 kube-deploy-llm-katan-claude \ + kube-undeploy-llm-katan kube-status-llm-katan kube-logs-llm-katan \ + kube-port-forward-llm-katan kube-test-llm-katan kube-load-llm-katan-image \ + kube-deploy-llm-katan-multi help-kube-llm-katan + +# Deploy llm-katan with specified overlay +kube-deploy-llm-katan: ## Deploy llm-katan to cluster (OVERLAY=gpt35|claude, default: gpt35) + @echo "$(BLUE)[INFO]$(NC) Deploying llm-katan with overlay: $(LLM_KATAN_OVERLAY)" + @if ! kubectl cluster-info &>/dev/null; then \ + echo "$(RED)[ERROR]$(NC) Kubernetes cluster is not accessible"; \ + echo "$(BLUE)[INFO]$(NC) Run 'make create-cluster' first"; \ + exit 1; \ + fi + @echo "$(BLUE)[INFO]$(NC) Applying Kubernetes manifests..." + @kubectl apply -k $(LLM_KATAN_BASE_PATH)/overlays/$(LLM_KATAN_OVERLAY) + @echo "$(BLUE)[INFO]$(NC) Waiting for namespace to be ready..." + @kubectl wait --for=condition=Ready namespace/$(LLM_KATAN_NAMESPACE) --timeout=60s || true + @echo "$(BLUE)[INFO]$(NC) Waiting for deployment to be ready..." + @kubectl wait --for=condition=Available deployment/llm-katan-$(LLM_KATAN_OVERLAY) \ + -n $(LLM_KATAN_NAMESPACE) --timeout=600s || echo "$(YELLOW)[WARNING]$(NC) Deployment not ready yet, check status with: make kube-status-llm-katan" + @echo "$(GREEN)[SUCCESS]$(NC) LLM Katan deployment completed!" + @echo "$(BLUE)[INFO]$(NC) Deployment status:" + @kubectl get pods -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -o wide + +# Deploy llm-katan with gpt35 overlay +kube-deploy-llm-katan-gpt35: ## Deploy llm-katan with GPT-3.5 overlay + @$(MAKE) kube-deploy-llm-katan LLM_KATAN_OVERLAY=gpt35 + @echo "$(GREEN)[SUCCESS]$(NC) GPT-3.5 simulation deployed!" + @echo "$(BLUE)[INFO]$(NC) Test with: make kube-test-llm-katan LLM_KATAN_OVERLAY=gpt35" + +# Deploy llm-katan with claude overlay +kube-deploy-llm-katan-claude: ## Deploy llm-katan with Claude overlay + @$(MAKE) kube-deploy-llm-katan LLM_KATAN_OVERLAY=claude + @echo "$(GREEN)[SUCCESS]$(NC) Claude simulation deployed!" + @echo "$(BLUE)[INFO]$(NC) Test with: make kube-test-llm-katan LLM_KATAN_OVERLAY=claude" + +# Deploy both overlays for multi-model testing +kube-deploy-llm-katan-multi: ## Deploy both gpt35 and claude overlays + @echo "$(BLUE)[INFO]$(NC) Deploying multiple llm-katan instances..." + @$(MAKE) kube-deploy-llm-katan-gpt35 + @echo "" + @$(MAKE) kube-deploy-llm-katan-claude + @echo "" + @echo "$(GREEN)[SUCCESS]$(NC) Multi-model deployment completed!" + @echo "$(BLUE)[INFO]$(NC) Available models:" + @kubectl get pods -n $(LLM_KATAN_NAMESPACE) -o wide + +# Remove llm-katan from the cluster +kube-undeploy-llm-katan: ## Remove llm-katan from cluster (OVERLAY=gpt35|claude|all, default: gpt35) + @echo "$(BLUE)[INFO]$(NC) Removing llm-katan overlay: $(LLM_KATAN_OVERLAY)" + @if [ "$(LLM_KATAN_OVERLAY)" = "all" ]; then \ + echo "$(BLUE)[INFO]$(NC) Removing all llm-katan deployments..."; \ + kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/gpt35 --ignore-not-found=true; \ + kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/claude --ignore-not-found=true; \ + else \ + kubectl delete -k $(LLM_KATAN_BASE_PATH)/overlays/$(LLM_KATAN_OVERLAY) --ignore-not-found=true; \ + fi + @echo "$(GREEN)[SUCCESS]$(NC) LLM Katan undeployment completed" + +# Show llm-katan deployment status +kube-status-llm-katan: ## Show llm-katan deployment status + @echo "$(BLUE)[INFO]$(NC) LLM Katan deployment status" + @echo "$(BLUE)[INFO]$(NC) Namespace: $(LLM_KATAN_NAMESPACE)" + @echo "" + @echo "$(BLUE)[INFO]$(NC) Pods:" + @kubectl get pods -n $(LLM_KATAN_NAMESPACE) -o wide || echo "$(RED)[ERROR]$(NC) Cannot get pods" + @echo "" + @echo "$(BLUE)[INFO]$(NC) Services:" + @kubectl get services -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get services" + @echo "" + @echo "$(BLUE)[INFO]$(NC) PVCs:" + @kubectl get pvc -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get PVCs" + @echo "" + @echo "$(BLUE)[INFO]$(NC) Deployments:" + @kubectl get deployments -n $(LLM_KATAN_NAMESPACE) || echo "$(RED)[ERROR]$(NC) Cannot get deployments" + +# Show llm-katan logs +kube-logs-llm-katan: ## Show llm-katan logs (OVERLAY=gpt35|claude, default: gpt35) + @echo "$(BLUE)[INFO]$(NC) Showing llm-katan logs for overlay: $(LLM_KATAN_OVERLAY)" + @kubectl logs -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -f + +# Port forward llm-katan API +kube-port-forward-llm-katan: ## Port forward llm-katan API (OVERLAY=gpt35|claude, PORT=8000) + @$(eval PORT ?= 8000) + @echo "$(BLUE)[INFO]$(NC) Port forwarding llm-katan API (overlay: $(LLM_KATAN_OVERLAY))" + @echo "$(YELLOW)[INFO]$(NC) Access API at: http://localhost:$(PORT)" + @echo "$(YELLOW)[INFO]$(NC) Health check: curl http://localhost:$(PORT)/health" + @echo "$(YELLOW)[INFO]$(NC) Models: curl http://localhost:$(PORT)/v1/models" + @echo "$(YELLOW)[INFO]$(NC) Press Ctrl+C to stop port forwarding" + @kubectl port-forward -n $(LLM_KATAN_NAMESPACE) svc/llm-katan-$(LLM_KATAN_OVERLAY) $(PORT):8000 + +# Test llm-katan deployment +kube-test-llm-katan: ## Test llm-katan deployment (OVERLAY=gpt35|claude, default: gpt35) + @echo "$(BLUE)[INFO]$(NC) Testing llm-katan deployment (overlay: $(LLM_KATAN_OVERLAY))" + @echo "$(BLUE)[INFO]$(NC) Checking pod status..." + @kubectl get pods -n $(LLM_KATAN_NAMESPACE) -l app=llm-katan-$(LLM_KATAN_OVERLAY) -o wide + @echo "" + @echo "$(BLUE)[INFO]$(NC) Checking service..." + @kubectl get svc -n $(LLM_KATAN_NAMESPACE) llm-katan-$(LLM_KATAN_OVERLAY) + @echo "" + @echo "$(BLUE)[INFO]$(NC) Checking pod readiness..." + @kubectl wait --for=condition=Ready pod -l app=llm-katan-$(LLM_KATAN_OVERLAY) \ + -n $(LLM_KATAN_NAMESPACE) --timeout=60s || echo "$(RED)[ERROR]$(NC) Pod not ready" + @echo "" + @echo "$(BLUE)[INFO]$(NC) Testing API endpoint (requires port-forward in another terminal)..." + @echo "$(YELLOW)[INFO]$(NC) Run in another terminal: make kube-port-forward-llm-katan LLM_KATAN_OVERLAY=$(LLM_KATAN_OVERLAY)" + @echo "$(YELLOW)[INFO]$(NC) Then test with: curl http://localhost:8000/health" + @echo "$(GREEN)[SUCCESS]$(NC) Deployment test completed" + +# Load llm-katan image into kind cluster +kube-load-llm-katan-image: ## Load llm-katan Docker image into kind cluster + @echo "$(BLUE)[INFO]$(NC) Loading llm-katan Docker image into kind cluster" + @if ! kind get clusters | grep -q "^$(KIND_CLUSTER_NAME)$$"; then \ + echo "$(RED)[ERROR]$(NC) Cluster $(KIND_CLUSTER_NAME) does not exist"; \ + echo "$(BLUE)[INFO]$(NC) Run 'make create-cluster' first"; \ + exit 1; \ + fi + @echo "$(BLUE)[INFO]$(NC) Loading image: $(LLM_KATAN_IMAGE)" + @kind load docker-image $(LLM_KATAN_IMAGE) --name $(KIND_CLUSTER_NAME) + @echo "$(GREEN)[SUCCESS]$(NC) LLM Katan image loaded successfully" + # Help target help-kube: ## Show Kubernetes makefile help @echo "$(BLUE)Configuration variables:$(NC)"