Skip to content

Conversation

@abcxff
Copy link
Contributor

@abcxff abcxff commented Oct 28, 2025

K8s self hosting setup

@vercel
Copy link

vercel bot commented Oct 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivet-site Ready Ready Preview Comment Nov 5, 2025 2:07am
rivetkit-serverless Ready Ready Preview Comment Nov 5, 2025 2:07am
2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 5, 2025 2:07am
rivet-inspector Ignored Ignored Preview Nov 5, 2025 2:07am

Copy link
Contributor Author

abcxff commented Oct 28, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title feat: k8 run scripts feat: k8s self hosting Oct 28, 2025
@abcxff abcxff marked this pull request as ready for review October 28, 2025 00:42
@claude
Copy link

claude bot commented Oct 28, 2025

PR Review: Kubernetes Run Scripts & Manifests

Thank you for this contribution! This adds comprehensive Kubernetes deployment support which is a valuable addition. Here's my detailed review:

🎯 Overall Assessment

Strengths:

  • Comprehensive K8s manifests with proper resource definitions
  • Well-structured documentation with quick start guides
  • Good separation of concerns (deployment, service, configmap, etc.)
  • Includes both development (k3d) and production guidance

Issues Found: Several critical issues need addressing before merge.


🔴 Critical Issues

1. Justfile Bug - docker-stop command incorrect

File: justfile:18-19

docker-stop:
	docker run -p 6420:6420 -e RIVET__AUTH__ADMIN_TOKEN=dev -e RUST_LOG=debug rivetkit/engine:local

Issue: This runs a new container instead of stopping one. Should be:

docker-stop:
	docker stop $(docker ps -q --filter ancestor=rivetkit/engine:local)

Or similar container stopping logic.


2. Security: Hardcoded Postgres Password in ConfigMap

File: k8s/engine/02-engine-configmap.yaml:6

"url": "postgresql://postgres:postgres@postgres:5432/rivet"

Issue: Password is hardcoded in the ConfigMap. Even though there's a warning in 11-postgres-secret.yaml, the engine config should reference the secret instead of hardcoding credentials.

Recommendation:

  • Use environment variable substitution or reference the secret directly
  • Or document clearly that users must manually edit this ConfigMap before deploying

3. Missing NATS Deployment

Files: scripts/run/k8s/engine.sh:42-55

Issue: The script references NATS in the configmap (07-nats-configmap.yaml, 08-nats-statefulset.yaml, 09-nats-service.yaml) but never deploys them. The deployment script only applies PostgreSQL and engine manifests, but NATS appears to be required based on the architecture description in CLAUDE.md.

Recommendation: Either:

  1. Add NATS deployment steps to the script
  2. Remove NATS manifests if not needed for basic deployment
  3. Document that NATS is optional

4. Missing Singleton Deployment

File: scripts/run/k8s/engine.sh

Issue: The script doesn't deploy 06-rivet-engine-singleton-deployment.yaml. Based on the manifest, singleton services are separate from the main engine deployment, but the script omits this.

Question: Is the singleton deployment required? If yes, it should be deployed. If no, explain in documentation.


⚠️ Important Issues

5. HPA Not Applied

File: scripts/run/k8s/engine.sh

The 05-rivet-engine-hpa.yaml is never applied. While HPA might not be needed for local dev, it should be documented.


6. Resource Limits Seem High for Local Dev

Files:

  • k8s/engine/03-rivet-engine-deployment.yaml:57-62
  • k8s/engine/06-rivet-engine-singleton-deployment.yaml:56-61
resources:
  limits:
    cpu: 4000m
    memory: 8Gi
  requests:
    cpu: 2000m
    memory: 4Gi

Issue: These are production-grade resources. For local k3d development with 2 agents, this might be too aggressive. Consider:

  • Adding separate dev/prod manifest examples
  • Or documenting that users should adjust these for local dev

7. Service Account Not Used

File: k8s/engine/01-serviceaccount.yaml

Issue: A ServiceAccount is created but appears to have no special permissions. If it's not needed, remove it. If it is needed (for future RBAC), document what it's for.


8. Datacenter Label Placeholder

File: k8s/engine/00-namespace.yaml:6

datacenter: YOUR_DATACENTER_KEY

Issue: Placeholder value that users must replace. Either:

  • Document this clearly in DEPLOY.md
  • Or remove if not critical for basic deployment

💡 Suggestions & Best Practices

9. Startup Probe Timing

File: k8s/engine/03-rivet-engine-deployment.yaml:68

initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 30

Total startup time: 30s + (10s * 30) = 330s. This seems reasonable, but could be optimized after understanding typical startup times.


10. Missing ConfigMap Checksum Annotation

Files:

  • k8s/engine/03-rivet-engine-deployment.yaml:18
  • k8s/engine/06-rivet-engine-singleton-deployment.yaml:18
checksum/config: REPLACE_WITH_CONFIG_CHECKSUM

Issue: This is a placeholder that should be replaced dynamically (usually by tools like Helm). For raw manifests, consider:

  • Removing this annotation
  • Or documenting how to calculate/update it
  • Or using a tool to auto-generate it

11. Documentation: Missing Prerequisites

File: website/src/content/docs/self-hosting/kubernetes.mdx

The quick start jumps into YAML without mentioning:

  • Minimum Kubernetes version (mentioned in DEPLOY.md as v1.24+, but not in main docs)
  • Storage class requirements for PVCs
  • LoadBalancer considerations (works on cloud, not on bare metal k8s)

12. Script: No Error Handling for Image Build

File: scripts/run/k8s/engine.sh:32

If Docker build fails, the script continues. Consider adding error checking or relying on set -euo pipefail to catch it (which is already set).


13. Cluster Autoscaler Annotation

File: k8s/engine/03-rivet-engine-deployment.yaml:19

cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

Question: Why is this set to false? This prevents cluster autoscaler from evicting pods, which might not be desired for scalable deployments. Document the reasoning or make it configurable.


14. PostgreSQL Production Warning

File: k8s/engine/12-postgres-statefulset.yaml

Suggestion: Add prominent warnings in documentation that:

  • This PostgreSQL deployment is for development/testing
  • Production should use managed databases (RDS, Cloud SQL, etc.)
  • No backups or high availability configured
  • Data will be lost if PVC is deleted

📝 Minor Issues

15. Inconsistent Service Type

  • 04-rivet-engine-service.yaml uses NodePort with specific port numbers (30420, 30421)
  • Documentation examples use LoadBalancer

This might confuse users. Clarify when to use which type.


16. NATS Configuration Hardcoded Replicas

File: k8s/engine/07-nats-configmap.yaml:17-20

routes = [
  nats://nats-0.nats.rivet-engine.svc.cluster.local:6222
  nats://nats-1.nats.rivet-engine.svc.cluster.local:6222
  nats://nats-2.nats.rivet-engine.svc.cluster.local:6222
]

If NATS replicas change from 3, this config breaks. Consider documenting this dependency.


🧪 Test Coverage

Missing:

  • No mention of how to test the deployment
  • No validation scripts
  • Could add smoke tests in the deployment script

Suggestion: Add a test section to scripts/run/k8s/engine.sh that:

echo "Running smoke tests..."
curl -f http://localhost:6421/health || exit 1
echo "✓ Health check passed"

📚 Documentation

Good:

  • Clear step-by-step instructions
  • Examples are copy-paste ready
  • Includes both quick start and detailed guides

Needs:

  • Troubleshooting section
  • How to view logs
  • How to update/rollback deployments
  • What to do when things go wrong

🔒 Security Review

  1. ✅ Uses non-root images (postgres:17, nats:2.10-alpine)
  2. ⚠️ Hardcoded credentials in multiple places
  3. ⚠️ No network policies defined
  4. ⚠️ No pod security policies/standards
  5. ⚠️ No secrets encryption at rest mentioned

For production, recommend:

  • Implement NetworkPolicies
  • Use external secrets management (Vault, AWS Secrets Manager)
  • Enable PSP/PSS
  • Document TLS setup for external access

✅ What Looks Good

  1. Proper use of StatefulSets for PostgreSQL and NATS
  2. Health checks properly configured
  3. Resource limits defined (though values may need tuning)
  4. Labels are consistent and well-structured
  5. Script error handling with set -euo pipefail
  6. Modular manifests - easy to understand and modify

🎬 Recommended Actions

Before Merge:

  1. ❗ Fix the docker-stop command in justfile
  2. ❗ Either deploy NATS in script or remove/document why it's not deployed
  3. ❗ Clarify singleton deployment requirements
  4. ❗ Address hardcoded password in engine ConfigMap
  5. ❗ Document the placeholder values that need replacement

Nice to Have:
6. Add troubleshooting section to docs
7. Add smoke tests to deployment script
8. Separate dev/prod resource configurations
9. Document when HPA and autoscaler annotations should be used


📊 Summary

Category Status
Code Quality ⚠️ Good with issues
Security ⚠️ Needs attention
Documentation ✅ Good
Test Coverage ❌ Missing
Best Practices ⚠️ Mostly followed

Recommendation: Request changes to address critical issues, especially the justfile bug and missing deployments in the script.

Great work overall! The foundation is solid and this will be a valuable addition once the issues are addressed.

@abcxff abcxff requested a review from NathanFlurry October 28, 2025 00:43
@pkg-pr-new
Copy link

pkg-pr-new bot commented Oct 28, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3276

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3276

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3276

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3276

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3276

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3276

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3276

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3276

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3276

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3276

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3276

commit: 1d43bd0

@abcxff abcxff force-pushed the 10-26-feat_k8s_selfhosting branch from ca30bde to d981bc6 Compare October 28, 2025 01:05
@claude
Copy link

claude bot commented Oct 28, 2025

PR Review: Kubernetes Self-Hosting Setup

Great work on adding Kubernetes deployment support! This is a solid foundation for self-hosting Rivet on Kubernetes. Below is my feedback organized by category.

🐛 Critical Issues

1. Justfile docker-stop command is incorrect (justfile:18-19)

The docker-stop command runs a container instead of stopping it:

docker-stop:
	docker run -p 6420:6420 -e RIVET__AUTH__ADMIN_TOKEN=dev -e RUST_LOG=debug rivetkit/engine:local

This should probably be docker stop or docker compose down, not docker run.

2. NATS is deployed but never referenced (k8s/engine/07-nats-configmap.yaml)

The NATS StatefulSet and configuration are deployed by the setup script, but:

  • The engine ConfigMap does not reference NATS
  • The deployment script (scripts/run/k8s/engine.sh) does not apply NATS manifests
  • No NATS configuration in the engine config

Either NATS should be removed from this PR, or it should be properly integrated into the engine configuration.

3. Placeholder value not replaced (k8s/engine/03-rivet-engine-deployment.yaml:18)

checksum/config: REPLACE_WITH_CONFIG_CHECKSUM

This placeholder needs to be either:

  • Removed if not used
  • Replaced with an actual checksum
  • Documented how users should replace it

Same issue in 06-rivet-engine-singleton-deployment.yaml:18.

🔒 Security Concerns

1. Hardcoded PostgreSQL password in version control (k8s/engine/11-postgres-secret.yaml)

While there is a comment saying to change it in production, storing even a default password in plaintext in the repo is not ideal. Consider providing documentation on creating the secret instead.

2. No authentication/authorization for NATS

The NATS cluster has no authentication configured, which could be a security risk in production environments.

3. ConfigMap contains sensitive connection string (k8s/engine/02-engine-configmap.yaml:6)

The PostgreSQL connection string includes credentials. This should reference the Secret instead and inject the password via environment variables.

4. No admin token configuration in K8s manifests

The documentation mentions setting RIVET__AUTH__ADMIN_TOKEN in production, but there is no Secret or ConfigMap showing how to do this in the K8s manifests.

⚠️ Potential Issues

1. Singleton deployment not applied in setup script (scripts/run/k8s/engine.sh)

The script deploys 03-rivet-engine-deployment.yaml but not 06-rivet-engine-singleton-deployment.yaml. Is this intentional? The singleton services might be required.

2. imagePullPolicy: Never is hardcoded (k8s/engine/03-rivet-engine-deployment.yaml:34)

This only works for local k3d development. For production, this should use IfNotPresent or Always. Consider having separate manifests or using Kustomize overlays for dev vs prod.

3. HPA configuration seems ineffective (k8s/engine/05-rivet-engine-hpa.yaml)

maxReplicas: 1
minReplicas: 1

This will not actually scale. Is this intentional for the initial release? If so, add a comment explaining why.

4. Resource requests are quite high (k8s/engine/03-rivet-engine-deployment.yaml:57-63)

requests:
  cpu: 2000m
  memory: 4Gi
limits:
  cpu: 4000m
  memory: 8Gi

These values might be too high for initial testing/small deployments. Consider documenting that users should adjust based on their needs.

5. No persistent storage for NATS

NATS is using a StatefulSet but has no volumeClaimTemplates. This means NATS data is not persisted. If this is intentional (because NATS is just for messaging), add a comment explaining it.

6. PostgreSQL storage class not specified (k8s/engine/12-postgres-statefulset.yaml:68)

Should specify storageClassName or document that the default storage class will be used.

📝 Documentation Issues

1. Incomplete datacenter configuration (k8s/engine/00-namespace.yaml:6)

datacenter: YOUR_DATACENTER_KEY

This placeholder is never explained in the documentation. Users will not know what to put here.

2. Inconsistent service type in docs (website/src/content/docs/self-hosting/kubernetes.mdx)

The documentation shows type: LoadBalancer but the actual manifest (k8s/engine/04-rivet-engine-service.yaml) uses type: NodePort. This should be consistent.

3. Malformed JSON in documentation

The engine configuration in the documentation appears to have JSON formatting issues. Verify the complete JSON is valid.

✅ Positive Aspects

  1. Good separation of concerns - Nice job splitting manifests into separate numbered files
  2. Comprehensive documentation - The README and DEPLOY.md are helpful
  3. Health checks properly configured - Startup, readiness, and liveness probes are all present
  4. Script organization - Moving Docker scripts to scripts/run/docker/ is a good cleanup
  5. Local development support - The k3d setup script is very useful

🎯 Recommendations

High Priority

  1. Fix the docker-stop command in justfile
  2. Either integrate NATS properly or remove it from this PR
  3. Replace placeholder values or document how to replace them
  4. Fix security issues (secrets, password handling)

Medium Priority

  1. Apply singleton deployment in the setup script or document why it is not needed
  2. Separate dev/prod configurations
  3. Document resource requirements and allow customization
  4. Add admin token Secret example

Low Priority

  1. Add storageClass specification for PostgreSQL
  2. Consider adding NetworkPolicies for better security
  3. Add examples for ingress configuration

Testing Recommendations

Based on CLAUDE.md guidelines, consider adding:

  • Integration tests that deploy to a k3d cluster
  • Validation that all manifests are valid YAML and apply successfully
  • Health check verification after deployment

Overall, this is solid work that adds valuable K8s support. The main concerns are around NATS integration, security hardening, and ensuring dev/prod configurations are clearly separated.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Oct 30, 2025

PR Review: Kubernetes Self-Hosting Setup

Thank you for this comprehensive K8s deployment implementation! This is a great addition for self-hosting. Here's my detailed review:

🎯 Overall Assessment

This PR successfully adds Kubernetes deployment capabilities with a well-structured approach separating main engine services from singleton services. The implementation is solid, but there are several areas that need attention before merging.


🔴 Critical Issues

1. Justfile Bug - Incorrect docker-stop Command

Location: justfile:18-19

The docker-stop command is incorrectly running the container instead of stopping it:

docker-stop:
    docker run -p 6420:6420 -e RIVET__AUTH__ADMIN_TOKEN=dev -e RUST_LOG=debug rivetkit/engine:local

This should be using docker stop or docker ps to identify and stop running containers. The current implementation will just start another container.

Suggested fix:

docker-stop:
    docker ps -q --filter ancestor=rivetkit/engine:local | xargs -r docker stop

2. Hardcoded Placeholder Not Replaced

Location: k8s/engine/03-rivet-engine-deployment.yaml:18 and k8s/engine/06-rivet-engine-singleton-deployment.yaml:18

The annotation contains a placeholder:

checksum/config: REPLACE_WITH_CONFIG_CHECKSUM

This should either be:

  • Removed if not being used
  • Or properly computed and injected (e.g., via a script or CI/CD pipeline)

The deployment script scripts/run/k8s/engine.sh doesn't replace this value, so pods won't restart when config changes.

3. Missing NATS Deployment

Location: scripts/run/k8s/engine.sh

The NATS manifests (07-09) exist but are never applied by the deployment script. NATS is a critical dependency for Rivet's service communication, yet the script skips deploying it entirely. This will cause the engine to fail at runtime.

Fix: Add NATS deployment to engine.sh:

# After postgres deployment
kubectl apply -f 07-nats-configmap.yaml
kubectl apply -f 08-nats-statefulset.yaml
kubectl apply -f 09-nats-service.yaml

# Wait for NATS
echo "Waiting for NATS to be ready..."
kubectl -n "${NAMESPACE}" wait --for=condition=ready pod -l app=nats --timeout=300s

⚠️ Security Concerns

1. Weak Default PostgreSQL Password

Location: k8s/engine/11-postgres-secret.yaml:9

While there's a comment warning to change it, using postgres as the default password is risky. Consider:

  • Using a generated random password
  • Or at minimum, making it more obvious this is insecure (e.g., INSECURE_DEV_PASSWORD)

2. Hardcoded Credentials in ConfigMap

Location: k8s/engine/02-engine-configmap.yaml:6

The postgres connection string contains hardcoded credentials:

"url": "postgresql://postgres:postgres@postgres:5432/rivet"

This should reference the secret instead of duplicating the password. Consider using environment variable substitution or a secret reference pattern.

3. Missing Admin Token Configuration

The engine requires RIVET__AUTH__ADMIN_TOKEN for admin access, but it's not configured in the deployments. Users won't be able to access admin APIs without this.

Recommendation: Add a secret for the admin token and reference it in both deployment manifests.


📊 Kubernetes Best Practices

1. Resource Requests Are Very High

Location: Multiple deployment files

Both deployments request:

  • cpu: 2000m (2 full cores)
  • memory: 4Gi

For a starting deployment, especially with HPA at 2-10 replicas, this is 4-20 cores and 8-40Gi memory. This may be excessive for initial testing. Consider:

  • Lower default requests: cpu: 500m, memory: 1Gi
  • Document expected resource usage
  • Let users tune based on actual workload

2. Missing PodDisruptionBudget

For production reliability, consider adding PodDisruptionBudgets for the main engine deployment to ensure availability during voluntary disruptions (node drains, updates, etc.).

3. Missing Anti-Affinity Rules

The main engine deployment (2+ replicas) should use pod anti-affinity to spread replicas across nodes for better availability:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - rivet-engine
        topologyKey: kubernetes.io/hostname

4. NATS StatefulSet Missing Persistent Storage

Location: k8s/engine/08-nats-statefulset.yaml

NATS is configured as a StatefulSet but has no volumeClaimTemplates. This means:

  • Message persistence won't survive pod restarts
  • For JetStream features, you need persistent storage

Consider adding a PVC template or documenting that this is a non-persistent configuration.

5. Service Type Should Be Configurable

Location: k8s/engine/04-rivet-engine-service.yaml:11

The service uses NodePort type, which is fine for development but not ideal for production. Consider:

  • Using ClusterIP by default
  • Document how to expose via Ingress
  • Make service type configurable via documentation

🐛 Potential Bugs

1. Missing NATS Connection Configuration

The engine config in 02-engine-configmap.yaml only has postgres and topology config. Where's the NATS connection configuration? The engine likely needs to know how to connect to nats://nats.rivet-engine.svc.cluster.local:4222.

2. Singleton Service Arguments May Be Incorrect

Location: k8s/engine/06-rivet-engine-singleton-deployment.yaml:26-31

The args use:

- --services
- singleton
- --services
- api-peer

Repeating --services twice seems unusual. Verify if this is the correct flag syntax. Should it be comma-separated or is this a multi-flag pattern?


📝 Documentation Issues

1. Broken Markdown in Documentation

Location: website/src/content/docs/self-hosting/kubernetes.mdx

The diff shows malformed YAML/JSON in the documentation with incomplete structures:

config.jsonc: |
    ,
      "topology": 
        ]
      }
    }

This appears to be incomplete or corrupted content. Review the actual file to ensure it's properly formatted.

2. Missing NATS Documentation

While NATS manifests exist, there's no documentation about:

  • What NATS is used for
  • Whether it's required
  • How to scale or configure it
  • Troubleshooting NATS issues

3. Public URL Configuration Unclear

The config uses http://localhost:6420 for public_url, but this won't work in a K8s cluster. Document how users should set this for their actual domain/load balancer.


🎨 Code Quality

1. Shell Script Is Well-Structured

The scripts/run/k8s/engine.sh script is clean and follows good bash practices:

  • Uses set -euo pipefail
  • Has proper error checking
  • Clear output messages
  • Good use of variables

Great work here! Just needs the NATS deployment addition.

2. Consistent Labeling

Good use of consistent labels across resources (app, component, service). This helps with querying and management.

3. Health Checks Are Properly Configured

Startup, readiness, and liveness probes are all properly configured with reasonable timeouts. Good attention to detail!


✅ What's Done Well

  1. Clear separation of singleton vs. scalable services - This is architecturally sound
  2. HPA configuration - Good autoscaling setup with both CPU and memory metrics
  3. Comprehensive manifests - All necessary resources are included
  4. Development script - The k3d setup script is very helpful for local testing
  5. Documentation structure - Multiple README files help users at different levels

🔧 Recommendations

High Priority (Must Fix)

  1. Fix the docker-stop command in justfile
  2. Deploy NATS in the k8s deployment script
  3. Fix or remove the REPLACE_WITH_CONFIG_CHECKSUM placeholder
  4. Add NATS connection config to the engine ConfigMap
  5. Fix documentation formatting issues

Medium Priority (Should Fix)

  1. Add admin token secret configuration
  2. Lower default resource requests
  3. Move postgres password to secret reference in connection string
  4. Verify singleton service arguments syntax
  5. Document public_url configuration for production

Low Priority (Nice to Have)

  1. Add PodDisruptionBudget
  2. Add pod anti-affinity rules
  3. Add NATS persistent storage
  4. Make service type configurable via kustomize overlays
  5. Add more comprehensive NATS documentation

🎯 Summary

This is a solid foundation for Kubernetes self-hosting! The architecture is sound and the implementation follows many K8s best practices. However, there are critical issues (NATS not being deployed, justfile bug, config checksum placeholder) that must be addressed before this is production-ready.

The documentation additions are valuable, though they need some cleanup. Once the critical issues are resolved, this will be a great addition to Rivet's self-hosting options.

Verdict: Request changes for critical issues, but overall this is excellent work! 🚀

@abcxff abcxff force-pushed the 10-26-feat_k8s_selfhosting branch from 1d43bd0 to 25b2df9 Compare November 5, 2025 02:04
@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 5, 2025

Merge activity

  • Nov 5, 2:04 AM UTC: Graphite rebased this pull request, because this pull request is set to merge when ready.
  • Nov 6, 5:48 AM UTC: abcxff added this pull request to the Graphite merge queue.
  • Nov 6, 5:48 AM UTC: CI is running for this pull request on a draft pull request (#3396) due to your merge queue CI optimization settings.
  • Nov 6, 5:50 AM UTC: Merged by the Graphite merge queue via draft PR: #3396.

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review: K8s Self Hosting Setup

Thanks for adding Kubernetes support for Rivet Engine! This is a substantial feature addition that will enable users to self-host on Kubernetes. Here's my detailed review:

🎯 Overall Assessment

This PR provides a solid foundation for Kubernetes deployments with good separation of concerns between main engine and singleton services. The documentation is comprehensive and the deployment scripts are well-structured.


🐛 Critical Issues

1. Bug in justfile - Wrong Command in docker-stop

Location: justfile:18-19

The docker-stop command is running docker run instead of docker stop. This will start a new container rather than stopping an existing one.


⚠️ Security Concerns

2. Hardcoded Password in Engine ConfigMap

Location: k8s/engine/02-engine-configmap.yaml:6

The PostgreSQL password is hardcoded in the connection string. This is less secure than using Kubernetes secrets and makes it harder to rotate credentials.

3. Weak Default Password

Location: k8s/engine/11-postgres-secret.yaml:9

While there's a comment about changing in production, consider adding a more prominent warning or generating a random password in the setup script.


🔧 Production Readiness Issues

4. No NATS Deployment Applied in Setup Script

Location: scripts/run/k8s/engine.sh:42-57

The script doesn't deploy NATS (files 07-09), but the NATS manifests exist. Is NATS required or should the manifests be removed?

5. Missing Persistent Volume Configuration for NATS

Location: k8s/engine/08-nats-statefulset.yaml

The NATS StatefulSet doesn't define persistent storage, meaning NATS data will be lost on pod restart.

6. Resource Limits May Be Too High

Location: k8s/engine/03-rivet-engine-deployment.yaml:57-63

With 2 replicas requiring 4 CPUs and 8GB RAM minimum, this could prevent deployment on smaller clusters. Consider more conservative defaults or providing sizing profiles.


📝 Code Quality Issues

7. Placeholder Config Checksum

Location: k8s/engine/03-rivet-engine-deployment.yaml:18

The checksum/config annotation has REPLACE_WITH_CONFIG_CHECKSUM which will never be replaced, preventing automatic pod restarts on config changes.

8. Inconsistent Image Pull Policy

Location: k8s/engine/03-rivet-engine-deployment.yaml:34

imagePullPolicy: Never is hardcoded, which only works for local development. Production deployments will fail.


✅ What's Good

  1. Excellent separation of main engine and singleton services
  2. Good probe configuration with startup, readiness, and liveness probes
  3. Comprehensive documentation in both DEPLOY.md and README.md
  4. Well-structured script with proper error handling
  5. Clear namespace organization with proper labels
  6. ServiceAccount created for future RBAC
  7. Thoughtful HPA for main engine while keeping singleton at 1 replica

✨ Priority Recommendations

  1. HIGH: Fix justfile docker-stop bug
  2. HIGH: Clarify NATS deployment requirement
  3. MEDIUM: Security improvements (password handling)
  4. MEDIUM: Resource configuration guidance
  5. LOW: Additional documentation polish

Let me know if you need any clarification on these suggestions!

graphite-app bot pushed a commit that referenced this pull request Nov 6, 2025
K8s self hosting setup
@graphite-app graphite-app bot closed this Nov 6, 2025
@graphite-app graphite-app bot deleted the 10-26-feat_k8s_selfhosting branch November 6, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants