-
Notifications
You must be signed in to change notification settings - Fork 247
chore: extract comparison into separate tool #2632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| case (a: Array[_], b: Array[_]) => | ||
| a.length == b.length && a.zip(b).forall(x => same(x._1, x._2)) | ||
| case (a: WrappedArray[_], b: WrappedArray[_]) => | ||
| case (a: mutable.WrappedArray[_], b: mutable.WrappedArray[_]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved it from #2614
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2632 +/- ##
============================================
+ Coverage 56.12% 59.21% +3.08%
- Complexity 976 1449 +473
============================================
Files 119 147 +28
Lines 11743 13755 +2012
Branches 2251 2365 +114
============================================
+ Hits 6591 8145 +1554
- Misses 4012 4387 +375
- Partials 1140 1223 +83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I don't think that we should have a combined fuzz-testing-and-tpc-benchmark tool. They serve quite different purposes. I think it would be better to move the DataFrame comparison logic into a shared class somewhere and then update our benchmarking tool to be able to use it. This probably means that we need to convert our benchmark script from Python to Scala. |
Another option would be to update the existing Python benchmark script to save query results to Parquet, and then implement a command-line tool for comparing the Parquet files produced from the Spark and Comet runs. |
I created #2640 to add a new option to the benchmark script, to write query results to Parquet. |
Right, this option looks better IMO so we can have a command line utility similar to fuzzer and reuse comparison logic. We still need this PR in some way as it has some refactoring to reuse comparison |
9cee835 to
f381c3d
Compare
| output_path = f"{write_path}/q{query}" | ||
| df.coalesce(1).write.mode("overwrite").parquet(output_path) | ||
| print(f"Query {query} results written to {output_path}") | ||
| if len(df.columns) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark complains on saving df with empty schema, this can happen for DDL statements which came across in TPC sets
| verify() | ||
| } | ||
|
|
||
| object ComparisonToolMain { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be named ComparisonTool?
| // Read Comet parquet files | ||
| val cometDf = spark.read.parquet(cometSubfolderPath.getAbsolutePath) | ||
| val cometRows = cometDf.collect() | ||
| val cometPlan = cometDf.queryExecution.executedPlan.toString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need to do anything with the plans for reading the Parquet files. Shouldn't we just be comparing the data in the Parquet files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison has nothing to do with plans it is true. Plans needed to be displayed when assertion happening down the road. Let me think if I can get rid of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @comphead
Which issue does this PR close?
Related #2614 #2611 .
Rationale for this change
Extract comparison to separate tool to run against already generated Comet and Spark results.
Added schema comparison and fixed minor bugs in the runner
What changes are included in this PR?
How are these changes tested?