-
Notifications
You must be signed in to change notification settings - Fork 183
Add prometheus metric tests based on a fictional electricity bidding market #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds two new test fixtures (tests 160 and 161) with Kubernetes manifests, Prometheus configs, k6 load tests, validation scripts and toolset definitions; updates shared Python Flask test image to include prometheus-client and bumps its tag from 2.1 to 2.2; tweaks one checkout test prompt. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Test Runner
participant K8s as Kubernetes
participant Bidder as Bidder Service
participant Prom as Prometheus
participant K6 as k6 Job
participant Script as Validation Script
rect rgb(230, 240, 255)
Note over Runner,K8s: Test 160 — deploy, load, validate
end
Runner->>K8s: create namespace app-160, apply Prometheus config
Runner->>K8s: deploy bidder app (Deployment + Service)
Runner->>K8s: wait for readiness
Runner->>K8s: start k6 Job
K6->>Bidder: send traffic (normal, NordPool surge, normal)
Bidder->>Prom: expose metrics (/metrics)
Runner->>Script: run run-bidder-check.sh
Script->>Prom: query metrics (rates, traffic, bids/sec)
Script-->>Runner: pass/fail result
Runner->>K8s: cleanup namespace
sequenceDiagram
participant Runner as Test Runner
participant K8s as Kubernetes
participant BidderV1 as Bidder v1
participant BidderV2 as Bidder v2
participant HPA as HorizontalPodAutoscaler
participant Prom as Prometheus
participant K6 as k6 Job
participant Script as wait-for-scaling.sh
rect rgb(230, 255, 230)
Note over Runner,K8s: Test 161 — baseline, upgrade, observe scaling
end
Runner->>K8s: create namespace app-161, apply Prometheus config
Runner->>K8s: deploy bidder v1 + HPA
Runner->>K8s: start k6-v1 Job (100 rps)
K6->>BidderV1: generate load (v1 ~50ms)
BidderV1->>Prom: record metrics
Runner->>K8s: rollout bidder v2
Runner->>K8s: start k6-v2 Job (100 rps)
K6->>BidderV2: generate load (v2 ~2s)
BidderV2->>Prom: record metrics
Runner->>Script: poll HPA / deployment until scale target
Script-->>Runner: report scaling status
Runner->>K8s: cleanup namespace
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (2)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md (1)
2-2: Minor grammar improvement: use hyphenated compound adjective.Line 2 uses "highest resolution" where "highest-resolution" would be more grammatically precise when modifying "possible".
- Problems with bid rate and traffic can change in a matter of seconds, use the highest resolution possible + Problems with bid rate and traffic can change in a matter of seconds, use the highest-resolution possibletests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml (1)
1-38: LGTM! Consider test-specific deployment naming.The HPA configuration is well-structured and correctly targets namespace
app-161. Consider naming the target deploymentbidder-161instead of justbidderto follow the convention of including test IDs in resource names and ensure uniqueness across concurrent test runs.This would require coordinating with
bidder-v2.yaml(andbidder-v1.yamlif present) to rename the deployment frombiddertobidder-161.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
tests/llm/fixtures/shared/python-flask-otel/Dockerfile(2 hunks)tests/llm/fixtures/shared/python-flask-otel/build.sh(2 hunks)tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml(1 hunks)tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
tests/**
📄 CodeRabbit inference engine (CLAUDE.md)
Test files should mirror the source structure under tests/
Files:
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.shtests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yamltests/llm/fixtures/shared/python-flask-otel/Dockerfiletests/llm/fixtures/shared/python-flask-otel/build.shtests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.mdtests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.shtests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yamltests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml
tests/llm/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/**/*.yaml: In eval manifests, ALWAYS use Kubernetes Secrets for scripts rather than inline manifests or ConfigMaps to prevent script exposure via kubectl describe
Eval resources must use neutral names, each test must use a dedicated namespace 'app-', and all pod names must be unique across tests
Files:
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yamltests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml
tests/**/toolsets.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
In eval toolsets.yaml, ALL toolset-specific configuration must be under a 'config' field; do not place toolset config at the top level
Files:
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
{holmes/plugins/toolsets/**/*.{yaml,yml},tests/**/toolsets.yaml}
📄 CodeRabbit inference engine (CLAUDE.md)
Only the following top-level fields are valid in toolset YAML: enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url (MCP toolsets only)
Files:
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yamltests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
tests/llm/**/*.sh
📄 CodeRabbit inference engine (CLAUDE.md)
When scripting kubectl operations in evals, never use a bare 'kubectl wait' immediately after creating resources; use a retry loop to avoid race conditions
Files:
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.shtests/llm/fixtures/shared/python-flask-otel/build.shtests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh
tests/**/test_case.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
Do NOT put toolset configuration directly in test_case.yaml; keep toolset config in a separate toolsets.yaml
Files:
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yamltests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yamltests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml
🪛 Checkov (3.2.334)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml
[medium] 150-210: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 150-210: Minimize the admission of root containers
(CKV_K8S_23)
[medium] 350-396: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 350-396: Minimize the admission of root containers
(CKV_K8S_23)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml
[medium] 92-175: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 92-175: Minimize the admission of root containers
(CKV_K8S_23)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml
[medium] 70-103: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 70-103: Minimize the admission of root containers
(CKV_K8S_23)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml
[medium] 92-172: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 92-172: Minimize the admission of root containers
(CKV_K8S_23)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml
[medium] 70-103: Containers should not run with allowPrivilegeEscalation
(CKV_K8S_20)
[medium] 70-103: Minimize the admission of root containers
(CKV_K8S_23)
🪛 LanguageTool
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md
[uncategorized] ~2-~2: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... change in a matter of seconds, use the highest resolution possible - The total bid request traffi...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Pre-commit checks
- GitHub Check: llm_evals
- GitHub Check: build
🔇 Additional comments (15)
tests/llm/fixtures/shared/python-flask-otel/build.sh (1)
9-9: ✅ Version bump and documentation update look good.The minor version bump (2.1 → 2.2) appropriately reflects the addition of Prometheus client support, and the informational echo statement clearly documents the new capability for users. The changes align well with the PR's Prometheus metric test objective.
Also applies to: 36-36
tests/llm/fixtures/shared/python-flask-otel/Dockerfile (1)
5-5: ✅ Prometheus client dependency addition is clean and correct.The prometheus-client 0.21.1 dependency is properly pinned and compatible with Python 3.11. The comment clearly explains its purpose for Prometheus-based tests, and the line continuation formatting is correct. This change directly supports the PR's test infrastructure expansion.
Also applies to: 16-17
tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml (2)
2-2: Prompt change aligns with test objectives.The user prompt now explicitly references the
/checkoutendpoint and namespaceapp-124as mentioned in the enriched summary. This clarification should help focus the LLM analysis on the intended target.
21-31: No issues found. Manifests comply with coding guidelines.Verification confirms that all referenced manifests properly handle configuration and secrets:
prometheus-config.yamlappropriately uses ConfigMap for static Prometheus configuration (not scripts)shared/prometheus.yamluses standard Kubernetes patterns with no embedded scriptsholmes-all-in-one-fast.yamluses Secrets for sensitive data- Namespace naming (
app-124) follows theapp-<testid>convention- Toolset configuration is correctly isolated in a separate
toolsets.yamlfile- External shell scripts are executed from the hook, not embedded in manifests
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml (1)
1-9: LGTM!Toolsets configuration correctly places prometheus/metrics settings under a
configfield per guidelines.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml (1)
1-9: LGTM!Same structure as test 160's toolsets, properly configured with
configfield.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1)
1-42: Script correctly uses retry loop (not bare kubectl wait).The polling implementation at lines 12-32 properly avoids the anti-pattern of bare
kubectl waitand instead implements a robust retry loop with timeout handling.tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml (1)
1-17: LGTM!Prometheus ConfigMap properly configured with dedicated namespace
app-160and correct in-cluster service discovery target.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml (1)
1-15: LGTM!Prometheus ConfigMap properly configured with dedicated namespace
app-161and correct service targeting.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml (2)
123-125: Verify image availability and pull strategy.The image uses
imagePullPolicy: IfNotPresent, which will skip pulling if the image already exists locally. Ensure the correct version ofpython-flask-otel:2.2is available in the test environment, or consider usingAlwaysif you want to pull the latest version.Confirm that the image
me-west1-docker.pkg.dev/robusta-development/development/python-flask-otel:2.2exists and includes Prometheus client support (used in lines 10).
1-90: LGTM! Properly uses Kubernetes Secrets for script storage.The Flask app code is correctly stored in a Kubernetes Secret (
bidder-app-v2) rather than inline in ConfigMap or manifest, following the guideline for eval resources.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml (1)
1-103: Excellent use of Secrets for k6 test script.The manifest correctly uses a Kubernetes Secret to store the k6 test script rather than embedding it inline in the Job, following security best practices for eval fixtures. The script is well-structured with proper error checks (status 200, version validation) and metrics tracking via k6's Rate metric. Namespace
app-161and unique pod naming are properly scoped.tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml (1)
1-103: Properly differentiated v2 load test with version-specific thresholds.The v2 test correctly uses a more lenient threshold (
p(95)<3000vs v1's200) and increased timeout (10svs5s), reflecting the expected slower performance of v2.0. Version validation and namespace isolation are correctly maintained.tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml (1)
1-396: Smart use of initContainers with health checks to ensure dependencies.The bidder deployment correctly uses
initContainerswith curl-based health checks (lines 360-382) to poll for service readiness rather than relying on barekubectl waitcommands. This is a more robust pattern for ensuring ordering in test setup. The k6 script is properly stored in a Secret, and the Flask app implements the intentional NordPool bid-acceptance bug with clear thread-safe logic (thenordpool_counterand lock pattern at lines 48-97).tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml (1)
1-186: Clean v1 deployment with pre-built image and solid observability.The v1 bidder correctly uses a pre-built Python Flask image (rather than runtime pip install), includes startup/readiness/liveness probes, and populates Prometheus metrics with Downward API values (pod name, namespace) for proper k8s label hygiene. Processing time (
40-60msat line 55) appropriately differentiates v1's performance from v2. The 2-replica setup aligns with the HPA-driven scaling test scenario.
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh
Outdated
Show resolved
Hide resolved
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Left one question out of curiousity, but no need to fix anything there. Great job.
No description provided.