Add prometheus metric tests based on a fictional electricity bidding market #1069

Sheeproid · 2025-10-21T15:20:03Z

No description provided.

coderabbitai · 2025-10-21T15:20:35Z

Walkthrough

Adds two new test fixtures (tests 160 and 161) with Kubernetes manifests, Prometheus configs, k6 load tests, validation scripts and toolset definitions; updates shared Python Flask test image to include prometheus-client and bumps its tag from 2.1 to 2.2; tweaks one checkout test prompt.

Changes

Cohort / File(s)	Summary
Base image & build `tests/llm/fixtures/shared/python-flask-otel/Dockerfile`, `tests/llm/fixtures/shared/python-flask-otel/build.sh`	Add `prometheus-client` to pip install and bump image tag from `2.1` to `2.2`; update build output notes.
Checkout latency test `tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml`	Clarify user prompt to reference the `/checkout` endpoint in `app-124`.
Electricity market bidding bug — app `tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml`	Add Flask bidder service (inline app) with Prometheus metrics, NordPool-specific bug logic (after 100 NordPool requests always bid), Deployment, Service, and k6 job.
Electricity market bidding bug — config & tests `tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md`, `.../prometheus-config.yaml`, `.../run-bidder-check.sh`, `.../test_case.yaml`, `.../toolsets.yaml`	Add docs, Prometheus ConfigMap (5s scrape/eval), validation script querying Prometheus and asserting rates/traffic, test_case with setup/teardown, and toolsets enabling prometheus metrics.
Bidding version performance — apps (v1 & v2) `tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml`, `.../bidder-v2.yaml`	Add two Secret-mounted Flask bidder variants (v1 fast ~50ms, v2 slow ~1.8–2.2s) exposing `/healthz`, `/bid`, `/metrics` and registering Prometheus metrics; Deployments with probes and scrape annotations.
Bidding version performance — infra & tests `tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml`, `.../k6-v1-traffic.yaml`, `.../k6-v2-traffic.yaml`, `.../prometheus-config.yaml`, `.../test_case.yaml`, `.../toolsets.yaml`, `.../wait-for-scaling.sh`	Add HPA (2–10 replicas, CPU target 50%), k6 Job secrets and Jobs for v1/v2 load (100 req/s), Prometheus ConfigMap (5s), test_case with staged upgrade workflow, toolsets config, and scaling-wait helper script.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Test Runner
    participant K8s as Kubernetes
    participant Bidder as Bidder Service
    participant Prom as Prometheus
    participant K6 as k6 Job
    participant Script as Validation Script

    rect rgb(230, 240, 255)
    Note over Runner,K8s: Test 160 — deploy, load, validate
    end

    Runner->>K8s: create namespace app-160, apply Prometheus config
    Runner->>K8s: deploy bidder app (Deployment + Service)
    Runner->>K8s: wait for readiness
    Runner->>K8s: start k6 Job
    K6->>Bidder: send traffic (normal, NordPool surge, normal)
    Bidder->>Prom: expose metrics (/metrics)
    Runner->>Script: run run-bidder-check.sh
    Script->>Prom: query metrics (rates, traffic, bids/sec)
    Script-->>Runner: pass/fail result
    Runner->>K8s: cleanup namespace

sequenceDiagram
    participant Runner as Test Runner
    participant K8s as Kubernetes
    participant BidderV1 as Bidder v1
    participant BidderV2 as Bidder v2
    participant HPA as HorizontalPodAutoscaler
    participant Prom as Prometheus
    participant K6 as k6 Job
    participant Script as wait-for-scaling.sh

    rect rgb(230, 255, 230)
    Note over Runner,K8s: Test 161 — baseline, upgrade, observe scaling
    end

    Runner->>K8s: create namespace app-161, apply Prometheus config
    Runner->>K8s: deploy bidder v1 + HPA
    Runner->>K8s: start k6-v1 Job (100 rps)
    K6->>BidderV1: generate load (v1 ~50ms)
    BidderV1->>Prom: record metrics
    Runner->>K8s: rollout bidder v2
    Runner->>K8s: start k6-v2 Job (100 rps)
    K6->>BidderV2: generate load (v2 ~2s)
    BidderV2->>Prom: record metrics
    Runner->>Script: poll HPA / deployment until scale target
    Script-->>Runner: report scaling status
    Runner->>K8s: cleanup namespace

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Suggested labels

enhancement

Suggested reviewers

aantn
moshemorad

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	No pull request description was provided by the author. While the check is intended to be lenient and only fail when descriptions are completely off-topic, the pass criterion requires that a description be "related in some way to the changeset." An empty or missing description cannot satisfy this requirement, as there is no content present to establish any relationship to the changes made in the PR.	Add a pull request description that outlines the purpose and scope of these new Prometheus metric tests. The description should briefly explain what test scenarios are being added (tests 160 and 161), what they test (Prometheus metrics in the electricity bidding market scenario), and any relevant context for reviewers.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "Add prometheus metric tests based on a fictional electricity bidding market" directly and accurately describes the main changes in the changeset. The PR introduces multiple new test fixtures and test cases specifically designed to test Prometheus metrics within a fictional electricity bidding market scenario, including tests 160 and 161 with corresponding Kubernetes manifests, scripts, and configurations. The title is clear, specific, and concise without vague terminology.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval-market

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f04277 and 1fea326.

📒 Files selected for processing (1)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pre-commit checks
GitHub Check: llm_evals
GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md (1)
2-2: Minor grammar improvement: use hyphenated compound adjective.

Line 2 uses "highest resolution" where "highest-resolution" would be more grammatically precise when modifying "possible".
- Problems with bid rate and traffic can change in a matter of seconds, use the highest resolution possible
+ Problems with bid rate and traffic can change in a matter of seconds, use the highest-resolution possible
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml (1)

1-38: LGTM! Consider test-specific deployment naming.

The HPA configuration is well-structured and correctly targets namespace app-161. Consider naming the target deployment bidder-161 instead of just bidder to follow the convention of including test IDs in resource names and ensure uniqueness across concurrent test runs.

This would require coordinating with bidder-v2.yaml (and bidder-v1.yaml if present) to rename the deployment from bidder to bidder-161.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c34af2d and 0f04277.

📒 Files selected for processing (18)

tests/llm/fixtures/shared/python-flask-otel/Dockerfile (2 hunks)
tests/llm/fixtures/shared/python-flask-otel/build.sh (2 hunks)
tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml (1 hunks)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

tests/**