Continuous Benchmarking #1033

Malmahrouqi3 · 2025-11-10T19:45:06Z

User description

Description

Enhanced Framework to Continuously Benchmark MFC.
More Details to be Added,,,

Initial PR: #936

PR Type

Enhancement

Description

Add continuous benchmarking workflow triggered on PR reviews
Enhance bench_diff command with historical data comparison
Support loading benchmark data from JSON file for trend analysis
Improve argument parsing for bench_diff with file and test name options
Preserve data.js during documentation updates

Diagram Walkthrough

flowchart LR
  A["PR Review Trigger"] --> B["cont-bench.yml Workflow"]
  B --> C["Run Benchmarks"]
  C --> D["Load Historical Data"]
  D --> E["bench_diff Analysis"]
  E --> F["Compare vs Historical Avg"]
  F --> G["Generate Report"]
  H["data.js"] -.->|Historical Data| D

File Walkthrough

Relevant files

Enhancement

args.py `Add bench_diff arguments for data file and test name` toolchain/mfc/args.py Add `-f/--file` argument to bench_diff for specifying data file path Add `-n/--name` argument to bench_diff for test name specification Move name slugification logic from common parsing to run command only Add special handling for bench_diff command to join multi-word test names	+7/-3
bench.py `Integrate historical data loading and comparison` toolchain/mfc/bench.py Import json module for loading benchmark data Load historical benchmark data from JSON file if provided Parse benchmark entries and compute average grind times per case Add CB (Continuous Benchmarking) column to diff output table Calculate offset percentage between current and historical average Color-code performance comparisons (red/yellow/green)	+39/-2

Configuration changes

bench.yml `Enable continuous benchmarking in bench workflow` .github/workflows/bench.yml Clone documentation repository to access data.js Copy data.js to PR directory for bench_diff analysis Pass data file and test name parameters to bench_diff command Enable historical benchmark comparison in existing workflow	+3/-1
cont-bench.yml `Create continuous benchmarking workflow` .github/workflows/cont-bench.yml New workflow file for continuous benchmarking on PR reviews Triggered on pull_request_review submission or manual dispatch Matrix strategy for multiple clusters and device/interface combinations Parallel benchmark execution for master and PR branches Load and compare against historical data from documentation repo Post results using github-action-benchmark with auto-push Polish and update data.js with standardized formatting	+168/-0

Bug fix

docs.yml `Preserve benchmark data during doc updates` .github/workflows/docs.yml Add restore step to preserve documentation/data.js during doc updates Prevent accidental deletion of benchmark historical data	+1/-0

Documentation

bench.yaml `Add sample benchmark results file` bench.yaml Sample benchmark results file with multiple test cases Contains output summaries with execution times and grind metrics Includes metadata about benchmark invocation and system configuration	+285/-0
data.js `Add historical benchmark data file` data.js Historical benchmark data in JSON format with multiple test entries Contains commit information, timestamps, and performance metrics Organized by test name with multiple benchmark runs per test Includes grind values and execution statistics for trend analysis	+790/-0

qodo-merge-pro · 2025-11-10T19:45:52Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns Sensitive information exposure: The workflows clone and push using `secrets.DOC_PUSH_URL`. While masked, echoing or errors around `git` commands could leak URLs if not handled carefully. Consider using a scoped GITHUB_TOKEN with repository permissions and `actions/checkout`/`actions-js/push` patterns instead of embedding credentials in URLs.
⚡ Recommended focus areas for review Possible Issue Inconsistent handling of `args["name"]` across commands: sometimes expected as string (slugified) and sometimes as list (joined). This dual-type behavior can lead to downstream errors if other paths assume a single type. if isinstance(args["name"], str): # "Slugify" the name of the job args["name"] = re.sub(r'[\W_]+', '-', args["name"]).strip('-') elif args["command"] == "bench_diff" and len(args["name"]) > 0: args["name"] = " ".join(args["name"]) Logic Bug When building `case_times`, the condition resets the list if a later run has `grind_value` missing; this can drop prior values and skew averages. Also, `offset_pct` color logic flags positive deviation as red and zero as yellow, which may be inverted from expected semantics. if (case_name is None or case_name not in case_times) or grind_value is None: case_times[case_name] = [] case_times[case_name].append(grind_value) for case_name, values in case_times.items(): avg = sum(values) / len(values) cb_stats[case_name] = {"avg": avg, "count": len(values)} cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") else: cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") except Exception as e: cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}") table = rich.table.Table(show_header=True, box=rich.table.box.SIMPLE) table.add_column("[bold]Case[/bold]", justify="left") table.add_column("[bold]Pre Process[/bold]", justify="right") table.add_column("[bold]Simulation[/bold]", justify="right") table.add_column("[bold]Post Process[/bold]", justify="right") if cb_stats: table.add_column("[bold] CB (Grind)[/bold]", justify="right") err = 0 for slug in slugs: lhs_summary, rhs_summary = lhs["cases"][slug]["output_summary"], rhs["cases"][slug]["output_summary"] speedups = ['N/A', 'N/A', 'N/A'] grind_comparison = 'N/A' for i, target in enumerate(sorted(DEFAULT_TARGETS, key=lambda t: t.runOrder)): if (target.name not in lhs_summary) or (target.name not in rhs_summary): cons.print(f"{target.name} not present in lhs_summary or rhs_summary - Case: {slug}") err = 1; continue if not math.isfinite(lhs_summary[target.name]["exec"]) or not math.isfinite(rhs_summary[target.name]["exec"]): err = 1 cons.print(f"lhs_summary or rhs_summary reports non-real exec time for {target.name} - Case: {slug}") try: exec_time_value = lhs_summary[target.name]["exec"] / rhs_summary[target.name]["exec"] if exec_time_value < 0.9: cons.print(f"[bold yellow]Warning[/bold yellow]: Exec time speedup for {target.name} is less than 0.9 - Case: {slug}") speedups[i] = f"Exec: {exec_time_value:.2f}" if target == SIMULATION: if not math.isfinite(lhs_summary[target.name]["grind"]) or not math.isfinite(rhs_summary[target.name]["grind"]): err = 1 cons.print(f"lhs_summary or rhs_summary reports non-real grind time for {target.name} - Case: {slug}") grind_time_value = lhs_summary[target.name]["grind"] / rhs_summary[target.name]["grind"] speedups[i] += f" & Grind: {grind_time_value:.2f}" if grind_time_value <0.95: cons.print(f"[bold red]Error[/bold red]: Benchmarking failed since grind time speedup for {target.name} below acceptable threshold (<0.95) - Case: {slug}") err = 1 if slug in cb_stats: rhs_grind = rhs_summary[target.name]["grind"] avg = cb_stats[slug]["avg"] offset_pct = (rhs_grind - avg) / avg * 100 color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green" grind_comparison = f"[{color}]{offset_pct:+.2f}%[/{color}] (avg: {avg:.2f})" Fragile Workflow Using `ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION` and cloning/pushing via a secret URL may expose credentials if logs leak. Also, multiple matrix entries reuse identical names, which can create duplicate benchmark series in `data.js`. ACTIONS_RUNNER_FORCE_ACTIONS_NODE_VERSION: node16 ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: true steps: - name: Clone - PR uses: actions/checkout@v4 with: path: pr - name: Clone - Master uses: actions/checkout@v4 with: repository: MFlowCode/MFC ref: master path: master - name: Setup & Build if: matrix.build_script != '' run: \| (cd pr && ${{ matrix.build_script }}) & (cd master && ${{ matrix.build_script }}) & wait %1 && wait %2 - name: Bench (Master v. PR) run: \| (cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & wait %1 && wait %2 - name: Generate & Post Comment run: \| git clone "${{ secrets.DOC_PUSH_URL }}" ../www cp ../www/documentation/data.js ./pr/data.js (cd pr && . ./mfc.sh load -c ${{ matrix.flag }} -m g) (cd pr && ./mfc.sh bench_diff ../master/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml ../pr/bench-${{ matrix.device }}-${{ matrix.interface }}.yaml -f data.js -n "${{ matrix.name }} (${{ matrix.device }}/${{ matrix.interface }})") - name: Prep Benchmark Results run: \| # should fill in the template in pr/cont-bench-template.json - name: Post Benchmark Results uses: benchmark-action/github-action-benchmark@v1 with: name: "${{ matrix.name }} (${{ matrix.device }}/${{ matrix.interface }})" tool: 'googlecpp' output-file-path: pr/report.json github-token: ${{ secrets.TOKEN }} auto-push: true alert-threshold: '200%' comment-on-alert: true fail-on-alert: true gh-repository: 'github.com/MFlowCode/MFlowCode.github.io' gh-pages-branch: 'main' benchmark-data-dir-path: 'documentation' - name: Polishing Results run: \| TAG="${{ github.event.inputs.tag \|\| github.ref_name }}" echo "tag=$TAG" >> $GITHUB_OUTPUT echo "TAG=$TAG" >> $GITHUB_ENV set +e git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q if [ "$?" -ne "0" ]; then exit 0; fi set -e git config --global user.name 'MFC Action' git config --global user.email '<>' git clone "${{ secrets.DOC_PUSH_URL }}" ../www sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js git -C ../www add -A git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" \|\| true git -C ../www push

qodo-merge-pro · 2025-11-10T19:47:05Z

.github/workflows/cont-bench.yml

High-level Suggestion

The current benchmark data update process in cont-bench.yml is complex and fragile, using both a GitHub Action and manual sed commands. This should be consolidated into a single, robust mechanism, ideally by extending the Python toolchain to handle data transformations before pushing. [High-level, importance: 9]

Solution Walkthrough:

Before:

# in .github/workflows/cont-bench.yml # ... - name: Post Benchmark Results uses: benchmark-action/github-action-benchmark@v1 with: # ... (configuration) auto-push: true gh-repository: 'github.com/MFlowCode/MFlowCode.github.io' benchmark-data-dir-path: 'documentation' - name: Polishing Results run: | # ... git clone "${{ secrets.DOC_PUSH_URL }}" ../www sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; ...' ../www/documentation/data.js sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js git -C ../www add -A git -C ../www commit -m "..." || true git -C ../www push

After:

# in .github/workflows/cont-bench.yml # ... - name: Post and Polish Benchmark Results uses: benchmark-action/github-action-benchmark@v1 with: # ... (configuration) auto-push: true gh-repository: 'github.com/MFlowCode/MFlowCode.github.io' benchmark-data-dir-path: 'documentation' # Use a script to post-process the benchmark output before it's committed # This script would be part of the MFC toolchain post-process-script: './mfc.sh bench_polish_results' # in toolchain/mfc/bench.py def polish_results(): # Load data.js # Perform transformations in Python (robustly) # e.g., replace units, keys, and tags # Save modified data.js # The action would then commit and push this polished file.

qodo-merge-pro · 2025-11-10T19:47:06Z

.github/workflows/cont-bench.yml

+      - name: Polishing Results
+        run:  |
+          TAG="${{ github.event.inputs.tag || github.ref_name }}"
+          echo "tag=$TAG" >> $GITHUB_OUTPUT
+          echo "TAG=$TAG" >> $GITHUB_ENV
+
+          set +e
+          git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q
+          if [ "$?" -ne "0" ]; then exit 0; fi
+          set -e
+          git config --global user.name  'MFC Action'
+          git config --global user.email '<>'
+          git clone "${{ secrets.DOC_PUSH_URL }}" ../www
+          sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js
+          sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js
+
+          git -C ../www add -A
+          git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true
+          git -C ../www push


Suggestion: To fix a race condition in the Polishing Results step, implement a retry loop for the git push command. Before pushing, pull the latest changes with rebase to handle concurrent updates from other jobs. [possible issue, importance: 8]

Suggested change

- name: Polishing Results

run: |

TAG="${{ github.event.inputs.tag || github.ref_name }}"

echo "tag=$TAG" >> $GITHUB_OUTPUT

echo "TAG=$TAG" >> $GITHUB_ENV

set +e

git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q

if [ "$?" -ne "0" ]; then exit 0; fi

set -e

git config --global user.name 'MFC Action'

git config --global user.email '<>'

git clone "${{ secrets.DOC_PUSH_URL }}" ../www

sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js

sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js

git -C ../www add -A

git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true

git -C ../www push

- name: Polishing Results

run: |

TAG="${{ github.event.inputs.tag || github.ref_name }}"

echo "tag=$TAG" >> $GITHUB_OUTPUT

echo "TAG=$TAG" >> $GITHUB_ENV

set +e

git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q

if [ "$?" -ne "0" ]; then exit 0; fi

set -e

git config --global user.name 'MFC Action'

git config --global user.email '<>'

git clone "${{ secrets.DOC_PUSH_URL }}" ../www

sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js

sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js

git -C ../www add -A

git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true

for i in {1..5}; do

git -C ../www pull --rebase && git -C ../www push && break

sleep 5

done

qodo-merge-pro · 2025-11-10T19:47:06Z

toolchain/mfc/bench.py

+    try:
+        with open(ARG("file"), 'r') as f:
+            data_json = json.load(f)
+        cb_test = ARG("name")
+        if "entries" in data_json and cb_test in data_json["entries"]:
+            benchmark_runs = data_json["entries"][cb_test]
+            case_times = {}
+            for run in benchmark_runs:
+                for benches in run["benches"]:
+                    case_name = benches.get("name")
+                    grind_value = benches.get("value")
+                    if (case_name is None or case_name not in case_times) or grind_value is None:
+                        case_times[case_name] = []
+                    case_times[case_name].append(grind_value)
+            for case_name, values in case_times.items():
+                avg = sum(values) / len(values)
+                cb_stats[case_name] = {"avg": avg, "count": len(values)}
+            cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")
+        else:
+            cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")
+    except Exception as e:
+        cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}")


Suggestion: Fix the benchmark averaging logic by initializing the case_times[case_name] list only when a case_name is first encountered. This prevents overwriting historical data and ensures the average is calculated correctly across all runs. [possible issue, importance: 9]

Suggested change

try:

with open(ARG("file"), 'r') as f:

data_json = json.load(f)

cb_test = ARG("name")

if "entries" in data_json and cb_test in data_json["entries"]:

benchmark_runs = data_json["entries"][cb_test]

case_times = {}

for run in benchmark_runs:

for benches in run["benches"]:

case_name = benches.get("name")

grind_value = benches.get("value")

if (case_name is None or case_name not in case_times) or grind_value is None:

case_times[case_name] = []

case_times[case_name].append(grind_value)

for case_name, values in case_times.items():

avg = sum(values) / len(values)

cb_stats[case_name] = {"avg": avg, "count": len(values)}

cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")

else:

cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")

except Exception as e:

cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}")

try:

with open(ARG("file"), 'r') as f:

data_json = json.load(f)

cb_test = ARG("name")

if "entries" in data_json and cb_test in data_json["entries"]:

benchmark_runs = data_json["entries"][cb_test]

case_times = {}

for run in benchmark_runs:

for benches in run["benches"]:

case_name = benches.get("name")

grind_value = benches.get("value")

if case_name is None or grind_value is None:

continue

if case_name not in case_times:

case_times[case_name] = []

case_times[case_name].append(grind_value)

for case_name, values in case_times.items():

if not values: continue

avg = sum(values) / len(values)

cb_stats[case_name] = {"avg": avg, "count": len(values)}

cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")

else:

cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")

except Exception as e:

cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}")

Copilot

Pull Request Overview

This PR implements continuous benchmarking infrastructure for MFC, allowing comparison of benchmark results against historical data. The system integrates with GitHub Actions to track performance over time and detect regressions.

Key changes:

Added functionality to load and compare benchmark results against historical continuous benchmarking data stored in data.js
Enhanced the bench_diff command to accept optional file and test name parameters for historical comparison
Created a new continuous benchmarking workflow that runs benchmarks and stores results

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
toolchain/mfc/bench.py	Added logic to load historical benchmark data from JSON and display comparisons in the diff output
toolchain/mfc/args.py	Added `--file` and `--name` arguments to bench_diff command; adjusted name slugification logic
data.js	New JSON file containing historical benchmark data for CPU and GPU tests
bench.yaml	Example benchmark results file in YAML format
.github/workflows/docs.yml	Added restoration of data.js file to preserve benchmark history
.github/workflows/cont-bench.yml	New workflow for continuous benchmarking automation
.github/workflows/bench.yml	Updated to fetch and use historical data.js for comparisons

Copilot · 2025-11-10T19:47:48Z

toolchain/mfc/bench.py

+    try:
+        with open(ARG("file"), 'r') as f:
+            data_json = json.load(f)
+        cb_test = ARG("name")
+        if "entries" in data_json and cb_test in data_json["entries"]:
+            benchmark_runs = data_json["entries"][cb_test]
+            case_times = {}
+            for run in benchmark_runs:
+                for benches in run["benches"]:
+                    case_name = benches.get("name")
+                    grind_value = benches.get("value")
+                    if (case_name is None or case_name not in case_times) or grind_value is None:
+                        case_times[case_name] = []
+                    case_times[case_name].append(grind_value)
+            for case_name, values in case_times.items():
+                avg = sum(values) / len(values)
+                cb_stats[case_name] = {"avg": avg, "count": len(values)}
+            cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")
+        else:
+            cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")
+    except Exception as e:


When ARG('file') returns None (when --file is not provided), this will raise a TypeError: 'expected str, bytes or os.PathLike object, not NoneType'. The try-except catches this but silently continues. Consider checking if ARG('file') is None before attempting to open the file, or use ARG('file', None) and explicitly check for None to make the intent clearer.

Suggested change

try:

with open(ARG("file"), 'r') as f:

data_json = json.load(f)

cb_test = ARG("name")

if "entries" in data_json and cb_test in data_json["entries"]:

benchmark_runs = data_json["entries"][cb_test]

case_times = {}

for run in benchmark_runs:

for benches in run["benches"]:

case_name = benches.get("name")

grind_value = benches.get("value")

if (case_name is None or case_name not in case_times) or grind_value is None:

case_times[case_name] = []

case_times[case_name].append(grind_value)

for case_name, values in case_times.items():

avg = sum(values) / len(values)

cb_stats[case_name] = {"avg": avg, "count": len(values)}

cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")

else:

cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")

except Exception as e:

cb_file = ARG("file")

if cb_file is not None:

try:

with open(cb_file, 'r') as f:

data_json = json.load(f)

cb_test = ARG("name")

if "entries" in data_json and cb_test in data_json["entries"]:

benchmark_runs = data_json["entries"][cb_test]

case_times = {}

for run in benchmark_runs:

for benches in run["benches"]:

case_name = benches.get("name")

grind_value = benches.get("value")

if (case_name is None or case_name not in case_times) or grind_value is None:

case_times[case_name] = []

case_times[case_name].append(grind_value)

for case_name, values in case_times.items():

avg = sum(values) / len(values)

cb_stats[case_name] = {"avg": avg, "count": len(values)}

cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]")

else:

cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.")

except Exception as e:

cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load cb data file: {e}")

else:

cons.print(f"[bold yellow]Warning[/bold yellow]: No --file argument provided; skipping cb data load.")

Copilot · 2025-11-10T19:47:48Z

toolchain/mfc/bench.py

+                    if (case_name is None or case_name not in case_times) or grind_value is None:
+                        case_times[case_name] = []
+                    case_times[case_name].append(grind_value)


The logic is inverted. When case_name is None or grind_value is None, you're initializing an empty list. However, you should skip adding the value or handle the None case properly. If case_name is None, using it as a dictionary key will cause issues. The condition should be 'if case_name is not None and grind_value is not None' to append values, or explicitly handle None cases.

Suggested change

if (case_name is None or case_name not in case_times) or grind_value is None:

case_times[case_name] = []

case_times[case_name].append(grind_value)

if case_name is not None and grind_value is not None:

if case_name not in case_times:

case_times[case_name] = []

case_times[case_name].append(grind_value)

Copilot · 2025-11-10T19:47:49Z

toolchain/mfc/bench.py

+                        offset_pct = (rhs_grind - avg) / avg * 100
+                        color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green"


The parentheses around 'offset_pct' are unnecessary and the comparison semantics are unclear. Red for positive offset means slower performance (higher grind time), but this should be documented. Consider adding a comment explaining that positive offset means degradation, or use more descriptive variable names like 'performance_degradation_pct'.

Suggested change

offset_pct = (rhs_grind - avg) / avg * 100

color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green"

# Positive offset_pct means grind time is higher than average (performance degradation)

offset_pct = (rhs_grind - avg) / avg * 100

color = "red" if offset_pct > 0 else "yellow" if offset_pct == 0 else "green"

Copilot · 2025-11-10T19:47:49Z

.github/workflows/cont-bench.yml

+          wait %1 && wait %2
+
+      - name: Bench (Master v. PR)
+        run: |
+          (cd pr     && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &
+          (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &
+          wait %1 && wait %2


The 'wait %1 && wait %2' syntax is for Windows batch/PowerShell, not bash. In bash, you should use 'wait' without arguments to wait for all background jobs, or use process IDs. This will cause the workflow to fail when executed in a bash shell.

Suggested change

wait %1 && wait %2

- name: Bench (Master v. PR)

run: |

(cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &

(cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &

wait %1 && wait %2

wait

- name: Bench (Master v. PR)

run: |

(cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &

(cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &

wait

Copilot · 2025-11-10T19:47:49Z

.github/workflows/cont-bench.yml

+        run: |
+          (cd pr     && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &
+          (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) &
+          wait %1 && wait %2


Same issue as above - 'wait %1 && wait %2' is Windows syntax, not bash. Use 'wait' alone or capture PIDs and wait for them explicitly.

Suggested change

wait %1 && wait %2

wait

Copilot · 2025-11-10T19:47:50Z

.github/workflows/cont-bench.yml

+      - name: Prep Benchmark Results
+        run: |
+          # should fill in the template in pr/cont-bench-template.json
+


This step contains only a TODO comment and performs no actual work. Either implement the functionality or remove the step if it's not needed yet. Leaving empty placeholder steps in production workflows can be confusing.

Suggested change

- name: Prep Benchmark Results

run: |

# should fill in the template in pr/cont-bench-template.json

Copilot · 2025-11-10T19:47:50Z

toolchain/mfc/args.py

+    elif args["command"] == "bench_diff" and len(args["name"]) > 0:
+        args["name"] = " ".join(args["name"])


If args['name'] is not a list (e.g., if it has a default value of empty list or None), this will fail. Additionally, accessing args['name'] without checking if the key exists could raise KeyError. Consider using args.get('name', []) and verifying it's a list before calling len() and join().

Suggested change

elif args["command"] == "bench_diff" and len(args["name"]) > 0:

args["name"] = " ".join(args["name"])

elif args["command"] == "bench_diff":

name_val = args.get("name", [])

if isinstance(name_val, list) and len(name_val) > 0:

args["name"] = " ".join(name_val)

codecov · 2025-11-10T22:12:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.02%. Comparing base (bfd732c) to head (83fd1b2).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1033   +/-   ##
=======================================
  Coverage   46.02%   46.02%           
=======================================
  Files          67       67           
  Lines       13437    13437           
  Branches     1550     1550           
=======================================
  Hits         6185     6185           
  Misses       6362     6362           
  Partials      890      890

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Malmahrouqi3 added 4 commits November 10, 2025 13:16

bench_diff improve

3eff4cb

implement CB on bench.yml

a2eaec7

lint fixes

8499ec5

All Necessary Changes Implemented

83fd1b2

Copilot AI review requested due to automatic review settings November 10, 2025 19:45

Malmahrouqi3 requested review from a team and sbryngelson as code owners November 10, 2025 19:45

qodo-merge-pro bot added Possible security concern Review effort 4/5 labels Nov 10, 2025

qodo-merge-pro bot reviewed Nov 10, 2025

View reviewed changes

Copilot AI reviewed Nov 10, 2025

View reviewed changes

Malmahrouqi3 mentioned this pull request Nov 10, 2025

Continuous Benchmarking [Addendum] #1034

Open

		offset_pct = (rhs_grind - avg) / avg * 100
		color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green"

	- name: Prep Benchmark Results
	run: \|
	# should fill in the template in pr/cont-bench-template.json

		elif args["command"] == "bench_diff" and len(args["name"]) > 0:
		args["name"] = " ".join(args["name"])

-    elif args["command"] == "bench_diff" and len(args["name"]) > 0:
-        args["name"] = " ".join(args["name"])
+    elif args["command"] == "bench_diff":
+        name_val = args.get("name", [])
+        if isinstance(name_val, list) and len(name_val) > 0:
+            args["name"] = " ".join(name_val)

Continuous Benchmarking #1033

Are you sure you want to change the base?

Continuous Benchmarking #1033

Uh oh!

Conversation

Malmahrouqi3 commented Nov 10, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Nov 10, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-merge-pro bot Nov 10, 2025

Choose a reason for hiding this comment

High-level Suggestion

Solution Walkthrough:

Before:

After:

Uh oh!

qodo-merge-pro bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

qodo-merge-pro bot Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 10, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Malmahrouqi3 commented Nov 10, 2025 •

edited by qodo-merge-pro bot

Loading