Skip to content

Conversation

@Sheeproid
Copy link
Contributor

No description provided.

@Sheeproid Sheeproid requested a review from aantn October 21, 2025 15:20
@Sheeproid Sheeproid enabled auto-merge (squash) October 21, 2025 15:20
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

Walkthrough

Adds two new test fixtures (tests 160 and 161) with Kubernetes manifests, Prometheus configs, k6 load tests, validation scripts and toolset definitions; updates shared Python Flask test image to include prometheus-client and bumps its tag from 2.1 to 2.2; tweaks one checkout test prompt.

Changes

Cohort / File(s) Summary
Base image & build
tests/llm/fixtures/shared/python-flask-otel/Dockerfile, tests/llm/fixtures/shared/python-flask-otel/build.sh
Add prometheus-client to pip install and bump image tag from 2.1 to 2.2; update build output notes.
Checkout latency test
tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml
Clarify user prompt to reference the /checkout endpoint in app-124.
Electricity market bidding bug — app
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml
Add Flask bidder service (inline app) with Prometheus metrics, NordPool-specific bug logic (after 100 NordPool requests always bid), Deployment, Service, and k6 job.
Electricity market bidding bug — config & tests
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md, .../prometheus-config.yaml, .../run-bidder-check.sh, .../test_case.yaml, .../toolsets.yaml
Add docs, Prometheus ConfigMap (5s scrape/eval), validation script querying Prometheus and asserting rates/traffic, test_case with setup/teardown, and toolsets enabling prometheus metrics.
Bidding version performance — apps (v1 & v2)
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml, .../bidder-v2.yaml
Add two Secret-mounted Flask bidder variants (v1 fast ~50ms, v2 slow ~1.8–2.2s) exposing /healthz, /bid, /metrics and registering Prometheus metrics; Deployments with probes and scrape annotations.
Bidding version performance — infra & tests
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml, .../k6-v1-traffic.yaml, .../k6-v2-traffic.yaml, .../prometheus-config.yaml, .../test_case.yaml, .../toolsets.yaml, .../wait-for-scaling.sh
Add HPA (2–10 replicas, CPU target 50%), k6 Job secrets and Jobs for v1/v2 load (100 req/s), Prometheus ConfigMap (5s), test_case with staged upgrade workflow, toolsets config, and scaling-wait helper script.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Test Runner
    participant K8s as Kubernetes
    participant Bidder as Bidder Service
    participant Prom as Prometheus
    participant K6 as k6 Job
    participant Script as Validation Script

    rect rgb(230, 240, 255)
    Note over Runner,K8s: Test 160 — deploy, load, validate
    end

    Runner->>K8s: create namespace app-160, apply Prometheus config
    Runner->>K8s: deploy bidder app (Deployment + Service)
    Runner->>K8s: wait for readiness
    Runner->>K8s: start k6 Job
    K6->>Bidder: send traffic (normal, NordPool surge, normal)
    Bidder->>Prom: expose metrics (/metrics)
    Runner->>Script: run run-bidder-check.sh
    Script->>Prom: query metrics (rates, traffic, bids/sec)
    Script-->>Runner: pass/fail result
    Runner->>K8s: cleanup namespace
Loading
sequenceDiagram
    participant Runner as Test Runner
    participant K8s as Kubernetes
    participant BidderV1 as Bidder v1
    participant BidderV2 as Bidder v2
    participant HPA as HorizontalPodAutoscaler
    participant Prom as Prometheus
    participant K6 as k6 Job
    participant Script as wait-for-scaling.sh

    rect rgb(230, 255, 230)
    Note over Runner,K8s: Test 161 — baseline, upgrade, observe scaling
    end

    Runner->>K8s: create namespace app-161, apply Prometheus config
    Runner->>K8s: deploy bidder v1 + HPA
    Runner->>K8s: start k6-v1 Job (100 rps)
    K6->>BidderV1: generate load (v1 ~50ms)
    BidderV1->>Prom: record metrics
    Runner->>K8s: rollout bidder v2
    Runner->>K8s: start k6-v2 Job (100 rps)
    K6->>BidderV2: generate load (v2 ~2s)
    BidderV2->>Prom: record metrics
    Runner->>Script: poll HPA / deployment until scale target
    Script-->>Runner: report scaling status
    Runner->>K8s: cleanup namespace
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Suggested labels

enhancement

Suggested reviewers

  • aantn
  • moshemorad

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning No pull request description was provided by the author. While the check is intended to be lenient and only fail when descriptions are completely off-topic, the pass criterion requires that a description be "related in some way to the changeset." An empty or missing description cannot satisfy this requirement, as there is no content present to establish any relationship to the changes made in the PR. Add a pull request description that outlines the purpose and scope of these new Prometheus metric tests. The description should briefly explain what test scenarios are being added (tests 160 and 161), what they test (Prometheus metrics in the electricity bidding market scenario), and any relevant context for reviewers.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "Add prometheus metric tests based on a fictional electricity bidding market" directly and accurately describes the main changes in the changeset. The PR introduces multiple new test fixtures and test cases specifically designed to test Prometheus metrics within a fictional electricity bidding market scenario, including tests 160 and 161 with corresponding Kubernetes manifests, scripts, and configurations. The title is clear, specific, and concise without vague terminology.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch eval-market

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f04277 and 1fea326.

📒 Files selected for processing (1)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
  • GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md (1)

2-2: Minor grammar improvement: use hyphenated compound adjective.

Line 2 uses "highest resolution" where "highest-resolution" would be more grammatically precise when modifying "possible".

- Problems with bid rate and traffic can change in a matter of seconds, use the highest resolution possible
+ Problems with bid rate and traffic can change in a matter of seconds, use the highest-resolution possible
tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml (1)

1-38: LGTM! Consider test-specific deployment naming.

The HPA configuration is well-structured and correctly targets namespace app-161. Consider naming the target deployment bidder-161 instead of just bidder to follow the convention of including test IDs in resource names and ensure uniqueness across concurrent test runs.

This would require coordinating with bidder-v2.yaml (and bidder-v1.yaml if present) to rename the deployment from bidder to bidder-161.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c34af2d and 0f04277.

📒 Files selected for processing (18)
  • tests/llm/fixtures/shared/python-flask-otel/Dockerfile (2 hunks)
  • tests/llm/fixtures/shared/python-flask-otel/build.sh (2 hunks)
  • tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml (1 hunks)
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
tests/**

📄 CodeRabbit inference engine (CLAUDE.md)

Test files should mirror the source structure under tests/

Files:

  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml
  • tests/llm/fixtures/shared/python-flask-otel/Dockerfile
  • tests/llm/fixtures/shared/python-flask-otel/build.sh
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml
  • tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml
tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: In eval manifests, ALWAYS use Kubernetes Secrets for scripts rather than inline manifests or ConfigMaps to prevent script exposure via kubectl describe
Eval resources must use neutral names, each test must use a dedicated namespace 'app-', and all pod names must be unique across tests

Files:

  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml
  • tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/hpa.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml
tests/**/toolsets.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

In eval toolsets.yaml, ALL toolset-specific configuration must be under a 'config' field; do not place toolset config at the top level

Files:

  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
{holmes/plugins/toolsets/**/*.{yaml,yml},tests/**/toolsets.yaml}

📄 CodeRabbit inference engine (CLAUDE.md)

Only the following top-level fields are valid in toolset YAML: enabled, name, description, additional_instructions, prerequisites, tools, docs_url, icon_url, installation_instructions, config, url (MCP toolsets only)

Files:

  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml
  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml
tests/llm/**/*.sh

📄 CodeRabbit inference engine (CLAUDE.md)

When scripting kubectl operations in evals, never use a bare 'kubectl wait' immediately after creating resources; use a retry loop to avoid race conditions

Files:

  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh
  • tests/llm/fixtures/shared/python-flask-otel/build.sh
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/run-bidder-check.sh
tests/**/test_case.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Do NOT put toolset configuration directly in test_case.yaml; keep toolset config in a separate toolsets.yaml

Files:

  • tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml
  • tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/test_case.yaml
🪛 Checkov (3.2.334)
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml

[medium] 150-210: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 150-210: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 350-396: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 350-396: Minimize the admission of root containers

(CKV_K8S_23)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml

[medium] 92-175: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 92-175: Minimize the admission of root containers

(CKV_K8S_23)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml

[medium] 70-103: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 70-103: Minimize the admission of root containers

(CKV_K8S_23)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml

[medium] 92-172: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 92-172: Minimize the admission of root containers

(CKV_K8S_23)

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml

[medium] 70-103: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 70-103: Minimize the admission of root containers

(CKV_K8S_23)

🪛 LanguageTool
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidding_system.md

[uncategorized] ~2-~2: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... change in a matter of seconds, use the highest resolution possible - The total bid request traffi...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Pre-commit checks
  • GitHub Check: llm_evals
  • GitHub Check: build
🔇 Additional comments (15)
tests/llm/fixtures/shared/python-flask-otel/build.sh (1)

9-9: ✅ Version bump and documentation update look good.

The minor version bump (2.1 → 2.2) appropriately reflects the addition of Prometheus client support, and the informational echo statement clearly documents the new capability for users. The changes align well with the PR's Prometheus metric test objective.

Also applies to: 36-36

tests/llm/fixtures/shared/python-flask-otel/Dockerfile (1)

5-5: ✅ Prometheus client dependency addition is clean and correct.

The prometheus-client 0.21.1 dependency is properly pinned and compatible with Python 3.11. The comment clearly explains its purpose for Prometheus-based tests, and the line continuation formatting is correct. This change directly supports the PR's test infrastructure expansion.

Also applies to: 16-17

tests/llm/fixtures/test_ask_holmes/124_checkout_latency_prometheus/test_case.yaml (2)

2-2: Prompt change aligns with test objectives.

The user prompt now explicitly references the /checkout endpoint and namespace app-124 as mentioned in the enriched summary. This clarification should help focus the LLM analysis on the intended target.


21-31: No issues found. Manifests comply with coding guidelines.

Verification confirms that all referenced manifests properly handle configuration and secrets:

  • prometheus-config.yaml appropriately uses ConfigMap for static Prometheus configuration (not scripts)
  • shared/prometheus.yaml uses standard Kubernetes patterns with no embedded scripts
  • holmes-all-in-one-fast.yaml uses Secrets for sensitive data
  • Namespace naming (app-124) follows the app-<testid> convention
  • Toolset configuration is correctly isolated in a separate toolsets.yaml file
  • External shell scripts are executed from the hook, not embedded in manifests
tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/toolsets.yaml (1)

1-9: LGTM!

Toolsets configuration correctly places prometheus/metrics settings under a config field per guidelines.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/toolsets.yaml (1)

1-9: LGTM!

Same structure as test 160's toolsets, properly configured with config field.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/wait-for-scaling.sh (1)

1-42: Script correctly uses retry loop (not bare kubectl wait).

The polling implementation at lines 12-32 properly avoids the anti-pattern of bare kubectl wait and instead implements a robust retry loop with timeout handling.

tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/prometheus-config.yaml (1)

1-17: LGTM!

Prometheus ConfigMap properly configured with dedicated namespace app-160 and correct in-cluster service discovery target.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/prometheus-config.yaml (1)

1-15: LGTM!

Prometheus ConfigMap properly configured with dedicated namespace app-161 and correct service targeting.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v2.yaml (2)

123-125: Verify image availability and pull strategy.

The image uses imagePullPolicy: IfNotPresent, which will skip pulling if the image already exists locally. Ensure the correct version of python-flask-otel:2.2 is available in the test environment, or consider using Always if you want to pull the latest version.

Confirm that the image me-west1-docker.pkg.dev/robusta-development/development/python-flask-otel:2.2 exists and includes Prometheus client support (used in lines 10).


1-90: LGTM! Properly uses Kubernetes Secrets for script storage.

The Flask app code is correctly stored in a Kubernetes Secret (bidder-app-v2) rather than inline in ConfigMap or manifest, following the guideline for eval resources.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v1-traffic.yaml (1)

1-103: Excellent use of Secrets for k6 test script.

The manifest correctly uses a Kubernetes Secret to store the k6 test script rather than embedding it inline in the Job, following security best practices for eval fixtures. The script is well-structured with proper error checks (status 200, version validation) and metrics tracking via k6's Rate metric. Namespace app-161 and unique pod naming are properly scoped.

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/k6-v2-traffic.yaml (1)

1-103: Properly differentiated v2 load test with version-specific thresholds.

The v2 test correctly uses a more lenient threshold (p(95)<3000 vs v1's 200) and increased timeout (10s vs 5s), reflecting the expected slower performance of v2.0. Version validation and namespace isolation are correctly maintained.

tests/llm/fixtures/test_ask_holmes/160_electricity_market_bidding_bug/bidder-app.yaml (1)

1-396: Smart use of initContainers with health checks to ensure dependencies.

The bidder deployment correctly uses initContainers with curl-based health checks (lines 360-382) to poll for service readiness rather than relying on bare kubectl wait commands. This is a more robust pattern for ensuring ordering in test setup. The k6 script is properly stored in a Secret, and the Flask app implements the intentional NordPool bid-acceptance bug with clear thread-safe logic (the nordpool_counter and lock pattern at lines 48-97).

tests/llm/fixtures/test_ask_holmes/161_bidding_version_performance/bidder-v1.yaml (1)

1-186: Clean v1 deployment with pre-built image and solid observability.

The v1 bidder correctly uses a pre-built Python Flask image (rather than runtime pip install), includes startup/readiness/liveness probes, and populates Prometheus metrics with Downward API values (pod name, namespace) for proper k8s label hygiene. Processing time (40-60ms at line 55) appropriately differentiates v1's performance from v2. The 2-replica setup aligns with the HPA-driven scaling test scenario.

aantn
aantn previously approved these changes Oct 22, 2025
Copy link
Contributor

@aantn aantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Left one question out of curiousity, but no need to fix anything there. Great job.

@github-actions
Copy link
Contributor

Results of HolmesGPT evals

  • ask_holmes: 34/35 test cases were successful, 0 regressions, 1 setup failures
Test suite Test case Status
ask 01_how_many_pods
ask 02_what_is_wrong_with_pod
ask 04_related_k8s_events
ask 05_image_version
ask 09_crashpod
ask 10_image_pull_backoff
ask 110_k8s_events_image_pull
ask 11_init_containers
ask 13a_pending_node_selector_basic
ask 14_pending_resources
ask 15_failed_readiness_probe
ask 17_oom_kill
ask 19_detect_missing_app_details
ask 20_long_log_file_search
ask 24_misconfigured_pvc
ask 24a_misconfigured_pvc_basic
ask 28_permissions_error 🚧
ask 39_failed_toolset
ask 41_setup_argo
ask 42_dns_issues_steps_new_tools
ask 43_current_datetime_from_prompt
ask 45_fetch_deployment_logs_simple
ask 51_logs_summarize_errors
ask 53_logs_find_term
ask 54_not_truncated_when_getting_pods
ask 59_label_based_counting
ask 60_count_less_than
ask 61_exact_match_counting
ask 63_fetch_error_logs_no_errors
ask 79_configmap_mount_issue
ask 83_secret_not_found
ask 86_configmap_like_but_secret
ask 93_calling_datadog[0]
ask 93_calling_datadog[1]
ask 93_calling_datadog[2]

Legend

  • ✅ the test was successful
  • :minus: the test was skipped
  • ⚠️ the test failed but is known to be flaky or known to fail
  • 🚧 the test had a setup failure (not a code regression)
  • 🔧 the test failed due to mock data issues (not a code regression)
  • 🚫 the test was throttled by API rate limits/overload
  • ❌ the test failed and should be fixed before merging the PR

@Sheeproid Sheeproid merged commit 2fa86b9 into master Oct 23, 2025
8 checks passed
@Sheeproid Sheeproid deleted the eval-market branch October 23, 2025 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants