-
Notifications
You must be signed in to change notification settings - Fork 121
Continuous Benchmarking #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Continuous Benchmarking #1033
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level Suggestion
The current benchmark data update process in cont-bench.yml is complex and fragile, using both a GitHub Action and manual sed commands. This should be consolidated into a single, robust mechanism, ideally by extending the Python toolchain to handle data transformations before pushing. [High-level, importance: 9]
Solution Walkthrough:
Before:
# in .github/workflows/cont-bench.yml
# ...
- name: Post Benchmark Results
uses: benchmark-action/github-action-benchmark@v1
with:
# ... (configuration)
auto-push: true
gh-repository: 'github.com/MFlowCode/MFlowCode.github.io'
benchmark-data-dir-path: 'documentation'
- name: Polishing Results
run: |
# ...
git clone "${{ secrets.DOC_PUSH_URL }}" ../www
sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; ...' ../www/documentation/data.js
sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js
git -C ../www add -A
git -C ../www commit -m "..." || true
git -C ../www push
After:
# in .github/workflows/cont-bench.yml
# ...
- name: Post and Polish Benchmark Results
uses: benchmark-action/github-action-benchmark@v1
with:
# ... (configuration)
auto-push: true
gh-repository: 'github.com/MFlowCode/MFlowCode.github.io'
benchmark-data-dir-path: 'documentation'
# Use a script to post-process the benchmark output before it's committed
# This script would be part of the MFC toolchain
post-process-script: './mfc.sh bench_polish_results'
# in toolchain/mfc/bench.py
def polish_results():
# Load data.js
# Perform transformations in Python (robustly)
# e.g., replace units, keys, and tags
# Save modified data.js
# The action would then commit and push this polished file.
| - name: Polishing Results | ||
| run: | | ||
| TAG="${{ github.event.inputs.tag || github.ref_name }}" | ||
| echo "tag=$TAG" >> $GITHUB_OUTPUT | ||
| echo "TAG=$TAG" >> $GITHUB_ENV | ||
| set +e | ||
| git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q | ||
| if [ "$?" -ne "0" ]; then exit 0; fi | ||
| set -e | ||
| git config --global user.name 'MFC Action' | ||
| git config --global user.email '<>' | ||
| git clone "${{ secrets.DOC_PUSH_URL }}" ../www | ||
| sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js | ||
| sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js | ||
| git -C ../www add -A | ||
| git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true | ||
| git -C ../www push |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: To fix a race condition in the Polishing Results step, implement a retry loop for the git push command. Before pushing, pull the latest changes with rebase to handle concurrent updates from other jobs. [possible issue, importance: 8]
| - name: Polishing Results | |
| run: | | |
| TAG="${{ github.event.inputs.tag || github.ref_name }}" | |
| echo "tag=$TAG" >> $GITHUB_OUTPUT | |
| echo "TAG=$TAG" >> $GITHUB_ENV | |
| set +e | |
| git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q | |
| if [ "$?" -ne "0" ]; then exit 0; fi | |
| set -e | |
| git config --global user.name 'MFC Action' | |
| git config --global user.email '<>' | |
| git clone "${{ secrets.DOC_PUSH_URL }}" ../www | |
| sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js | |
| sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js | |
| git -C ../www add -A | |
| git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true | |
| git -C ../www push | |
| - name: Polishing Results | |
| run: | | |
| TAG="${{ github.event.inputs.tag || github.ref_name }}" | |
| echo "tag=$TAG" >> $GITHUB_OUTPUT | |
| echo "TAG=$TAG" >> $GITHUB_ENV | |
| set +e | |
| git ls-remote "${{ secrets.DOC_PUSH_URL }}" -q | |
| if [ "$?" -ne "0" ]; then exit 0; fi | |
| set -e | |
| git config --global user.name 'MFC Action' | |
| git config --global user.email '<>' | |
| git clone "${{ secrets.DOC_PUSH_URL }}" ../www | |
| sed -i 's/"unit": "s\/iter"/"unit": "s"/g; s/cpu: /runtime: /g; s/iterations: /time steps: /g' ../www/documentation/data.js | |
| sed -i "s/${GITHUB_SHA::7}/$TAG/g" ../www/documentation/data.js | |
| git -C ../www add -A | |
| git -C ../www commit -m "Docs @ ${GITHUB_SHA::7}" || true | |
| for i in {1..5}; do | |
| git -C ../www pull --rebase && git -C ../www push && break | |
| sleep 5 | |
| done |
| try: | ||
| with open(ARG("file"), 'r') as f: | ||
| data_json = json.load(f) | ||
| cb_test = ARG("name") | ||
| if "entries" in data_json and cb_test in data_json["entries"]: | ||
| benchmark_runs = data_json["entries"][cb_test] | ||
| case_times = {} | ||
| for run in benchmark_runs: | ||
| for benches in run["benches"]: | ||
| case_name = benches.get("name") | ||
| grind_value = benches.get("value") | ||
| if (case_name is None or case_name not in case_times) or grind_value is None: | ||
| case_times[case_name] = [] | ||
| case_times[case_name].append(grind_value) | ||
| for case_name, values in case_times.items(): | ||
| avg = sum(values) / len(values) | ||
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | ||
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | ||
| else: | ||
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | ||
| except Exception as e: | ||
| cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Fix the benchmark averaging logic by initializing the case_times[case_name] list only when a case_name is first encountered. This prevents overwriting historical data and ensures the average is calculated correctly across all runs. [possible issue, importance: 9]
| try: | |
| with open(ARG("file"), 'r') as f: | |
| data_json = json.load(f) | |
| cb_test = ARG("name") | |
| if "entries" in data_json and cb_test in data_json["entries"]: | |
| benchmark_runs = data_json["entries"][cb_test] | |
| case_times = {} | |
| for run in benchmark_runs: | |
| for benches in run["benches"]: | |
| case_name = benches.get("name") | |
| grind_value = benches.get("value") | |
| if (case_name is None or case_name not in case_times) or grind_value is None: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) | |
| for case_name, values in case_times.items(): | |
| avg = sum(values) / len(values) | |
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | |
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | |
| else: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | |
| except Exception as e: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}") | |
| try: | |
| with open(ARG("file"), 'r') as f: | |
| data_json = json.load(f) | |
| cb_test = ARG("name") | |
| if "entries" in data_json and cb_test in data_json["entries"]: | |
| benchmark_runs = data_json["entries"][cb_test] | |
| case_times = {} | |
| for run in benchmark_runs: | |
| for benches in run["benches"]: | |
| case_name = benches.get("name") | |
| grind_value = benches.get("value") | |
| if case_name is None or grind_value is None: | |
| continue | |
| if case_name not in case_times: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) | |
| for case_name, values in case_times.items(): | |
| if not values: continue | |
| avg = sum(values) / len(values) | |
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | |
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | |
| else: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | |
| except Exception as e: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load data file: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements continuous benchmarking infrastructure for MFC, allowing comparison of benchmark results against historical data. The system integrates with GitHub Actions to track performance over time and detect regressions.
Key changes:
- Added functionality to load and compare benchmark results against historical continuous benchmarking data stored in
data.js - Enhanced the
bench_diffcommand to accept optional file and test name parameters for historical comparison - Created a new continuous benchmarking workflow that runs benchmarks and stores results
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| toolchain/mfc/bench.py | Added logic to load historical benchmark data from JSON and display comparisons in the diff output |
| toolchain/mfc/args.py | Added --file and --name arguments to bench_diff command; adjusted name slugification logic |
| data.js | New JSON file containing historical benchmark data for CPU and GPU tests |
| bench.yaml | Example benchmark results file in YAML format |
| .github/workflows/docs.yml | Added restoration of data.js file to preserve benchmark history |
| .github/workflows/cont-bench.yml | New workflow for continuous benchmarking automation |
| .github/workflows/bench.yml | Updated to fetch and use historical data.js for comparisons |
| try: | ||
| with open(ARG("file"), 'r') as f: | ||
| data_json = json.load(f) | ||
| cb_test = ARG("name") | ||
| if "entries" in data_json and cb_test in data_json["entries"]: | ||
| benchmark_runs = data_json["entries"][cb_test] | ||
| case_times = {} | ||
| for run in benchmark_runs: | ||
| for benches in run["benches"]: | ||
| case_name = benches.get("name") | ||
| grind_value = benches.get("value") | ||
| if (case_name is None or case_name not in case_times) or grind_value is None: | ||
| case_times[case_name] = [] | ||
| case_times[case_name].append(grind_value) | ||
| for case_name, values in case_times.items(): | ||
| avg = sum(values) / len(values) | ||
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | ||
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | ||
| else: | ||
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | ||
| except Exception as e: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ARG('file') returns None (when --file is not provided), this will raise a TypeError: 'expected str, bytes or os.PathLike object, not NoneType'. The try-except catches this but silently continues. Consider checking if ARG('file') is None before attempting to open the file, or use ARG('file', None) and explicitly check for None to make the intent clearer.
| try: | |
| with open(ARG("file"), 'r') as f: | |
| data_json = json.load(f) | |
| cb_test = ARG("name") | |
| if "entries" in data_json and cb_test in data_json["entries"]: | |
| benchmark_runs = data_json["entries"][cb_test] | |
| case_times = {} | |
| for run in benchmark_runs: | |
| for benches in run["benches"]: | |
| case_name = benches.get("name") | |
| grind_value = benches.get("value") | |
| if (case_name is None or case_name not in case_times) or grind_value is None: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) | |
| for case_name, values in case_times.items(): | |
| avg = sum(values) / len(values) | |
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | |
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | |
| else: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | |
| except Exception as e: | |
| cb_file = ARG("file") | |
| if cb_file is not None: | |
| try: | |
| with open(cb_file, 'r') as f: | |
| data_json = json.load(f) | |
| cb_test = ARG("name") | |
| if "entries" in data_json and cb_test in data_json["entries"]: | |
| benchmark_runs = data_json["entries"][cb_test] | |
| case_times = {} | |
| for run in benchmark_runs: | |
| for benches in run["benches"]: | |
| case_name = benches.get("name") | |
| grind_value = benches.get("value") | |
| if (case_name is None or case_name not in case_times) or grind_value is None: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) | |
| for case_name, values in case_times.items(): | |
| avg = sum(values) / len(values) | |
| cb_stats[case_name] = {"avg": avg, "count": len(values)} | |
| cons.print(f"[bold]Loaded cb data for test: [bold]{cb_test}[/bold] ({len(cb_stats)} cases)[/bold]") | |
| else: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Test '[bold]{cb_test}[/bold]' not found in data file.") | |
| except Exception as e: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: Could not load cb data file: {e}") | |
| else: | |
| cons.print(f"[bold yellow]Warning[/bold yellow]: No --file argument provided; skipping cb data load.") |
| if (case_name is None or case_name not in case_times) or grind_value is None: | ||
| case_times[case_name] = [] | ||
| case_times[case_name].append(grind_value) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is inverted. When case_name is None or grind_value is None, you're initializing an empty list. However, you should skip adding the value or handle the None case properly. If case_name is None, using it as a dictionary key will cause issues. The condition should be 'if case_name is not None and grind_value is not None' to append values, or explicitly handle None cases.
| if (case_name is None or case_name not in case_times) or grind_value is None: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) | |
| if case_name is not None and grind_value is not None: | |
| if case_name not in case_times: | |
| case_times[case_name] = [] | |
| case_times[case_name].append(grind_value) |
| offset_pct = (rhs_grind - avg) / avg * 100 | ||
| color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green" |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentheses around 'offset_pct' are unnecessary and the comparison semantics are unclear. Red for positive offset means slower performance (higher grind time), but this should be documented. Consider adding a comment explaining that positive offset means degradation, or use more descriptive variable names like 'performance_degradation_pct'.
| offset_pct = (rhs_grind - avg) / avg * 100 | |
| color = "red" if (offset_pct) > 0 else "yellow" if (offset_pct) == 0 else "green" | |
| # Positive offset_pct means grind time is higher than average (performance degradation) | |
| offset_pct = (rhs_grind - avg) / avg * 100 | |
| color = "red" if offset_pct > 0 else "yellow" if offset_pct == 0 else "green" |
| wait %1 && wait %2 | ||
| - name: Bench (Master v. PR) | ||
| run: | | ||
| (cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | ||
| (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | ||
| wait %1 && wait %2 |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'wait %1 && wait %2' syntax is for Windows batch/PowerShell, not bash. In bash, you should use 'wait' without arguments to wait for all background jobs, or use process IDs. This will cause the workflow to fail when executed in a bash shell.
| wait %1 && wait %2 | |
| - name: Bench (Master v. PR) | |
| run: | | |
| (cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | |
| (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | |
| wait %1 && wait %2 | |
| wait | |
| - name: Bench (Master v. PR) | |
| run: | | |
| (cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | |
| (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | |
| wait |
| run: | | ||
| (cd pr && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | ||
| (cd master && bash .github/workflows/${{ matrix.cluster }}/submit-bench.sh .github/workflows/${{ matrix.cluster }}/bench.sh ${{ matrix.device }} ${{ matrix.interface }}) & | ||
| wait %1 && wait %2 |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above - 'wait %1 && wait %2' is Windows syntax, not bash. Use 'wait' alone or capture PIDs and wait for them explicitly.
| wait %1 && wait %2 | |
| wait |
| - name: Prep Benchmark Results | ||
| run: | | ||
| # should fill in the template in pr/cont-bench-template.json | ||
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step contains only a TODO comment and performs no actual work. Either implement the functionality or remove the step if it's not needed yet. Leaving empty placeholder steps in production workflows can be confusing.
| - name: Prep Benchmark Results | |
| run: | | |
| # should fill in the template in pr/cont-bench-template.json |
| elif args["command"] == "bench_diff" and len(args["name"]) > 0: | ||
| args["name"] = " ".join(args["name"]) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If args['name'] is not a list (e.g., if it has a default value of empty list or None), this will fail. Additionally, accessing args['name'] without checking if the key exists could raise KeyError. Consider using args.get('name', []) and verifying it's a list before calling len() and join().
| elif args["command"] == "bench_diff" and len(args["name"]) > 0: | |
| args["name"] = " ".join(args["name"]) | |
| elif args["command"] == "bench_diff": | |
| name_val = args.get("name", []) | |
| if isinstance(name_val, list) and len(name_val) > 0: | |
| args["name"] = " ".join(name_val) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1033 +/- ##
=======================================
Coverage 46.02% 46.02%
=======================================
Files 67 67
Lines 13437 13437
Branches 1550 1550
=======================================
Hits 6185 6185
Misses 6362 6362
Partials 890 890 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
Description
Enhanced Framework to Continuously Benchmark MFC.
More Details to be Added,,,
Initial PR: #936
PR Type
Enhancement
Description
Add continuous benchmarking workflow triggered on PR reviews
Enhance bench_diff command with historical data comparison
Support loading benchmark data from JSON file for trend analysis
Improve argument parsing for bench_diff with file and test name options
Preserve data.js during documentation updates
Diagram Walkthrough
File Walkthrough
args.py
Add bench_diff arguments for data file and test nametoolchain/mfc/args.py
-f/--fileargument to bench_diff for specifying data file path-n/--nameargument to bench_diff for test name specificationnames
bench.py
Integrate historical data loading and comparisontoolchain/mfc/bench.py
bench.yml
Enable continuous benchmarking in bench workflow.github/workflows/bench.yml
cont-bench.yml
Create continuous benchmarking workflow.github/workflows/cont-bench.yml
combinations
docs.yml
Preserve benchmark data during doc updates.github/workflows/docs.yml
bench.yaml
Add sample benchmark results filebench.yaml
data.js
Add historical benchmark data filedata.js