-
-
Notifications
You must be signed in to change notification settings - Fork 20
Fix ranking calculation for ties. #174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #174 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 10 10
Lines 2047 2047
Branches 513 513
=========================================
Hits 2047 2047 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Sorry for the delay in responding. But this was intentional, or a mistake on purpose from my side. I am actually aware of this already. However, accepting this PR will reduce accuracy on our benchmarks. I know it seems counter intuitive, but I'm trying to square what's causing this. Feel free to run the benchmarks with and without the "fix" to compare. Maybe you can come up with something interesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run the benchmarks to make sure it's not reducing accuracy.
@vivekjoshy I looked at the benchmarks
PR benchmark
|
Model | Margin | main |
PR | Difference |
---|---|---|---|---|
PlackettLuce | 0.0 | 1377/2154 | 1375/2154 | -2 |
PlackettLuce | 1.0 | 1391/2154 | 1394/2154 | +3 |
ThurstoneMostellerPart | 0.0 | 1359/2154 | 1290/2154 | -69 |
ThurstoneMostellerPart | 1.0 | 1382/2154 | 1314/2154 | -68 |
ThurstoneMostellerFull | 0.0 | 1459/2154 | 1407/2154 | -52 |
ThurstoneMostellerFull | 1.0 | 1450/2154 | 1404/2154 | -46 |
BradleyTerryFull | 0.0 | 1415/2154 | 1342/2154 | -73 |
BradleyTerryFull | 1.0 | 1440/2154 | 1378/2154 | -62 |
BradleyTerryPart | 0.0 | 1415/2154 | 1342/2154 | -73 |
BradleyTerryPart | 1.0 | 1440/2154 | 1378/2154 | -62 |
Evaluation
Actually PlackettLuce with a margin got better :), but yes I agree the results are worse overall.
Why is this happening?
I looked at the data in overwatch.jsonl
and can see that by score there are 44091
matches with a labelled with a WIN
or LOSS
, and 866
matches labelled as a DRAW
, however there are also matches which are labelled as a WIN
or LOSS
but actually have a tied score (blue_won_fights
vs red_won_fights
). Thus, there are actually 3042
matches with equal fights scores, although 886
were actually a draw.
This is causing the difference in results.
Reproduction
import jsonlines as jl
import polars as pl
data = list(jl.open("benchmark/data/overwatch.jsonl").iter())
actual_data = []
for entry in data:
if entry["compositions"]:
entry["blue_score"] = entry["compositions"]["blue_won_fights"]
entry["red_score"] = entry["compositions"]["red_won_fights"]
actual_data.append(entry)
df = pl.DataFrame(actual_data)
print(df["result"].value_counts()) # LOSS = 19551, DRAW = 886, WIN = 24540
print(df.filter(pl.col("blue_score") == pl.col("red_score")).height) # 3042
Thank you for your valuable work. @all-contributors please add @xelandernt for code and bug |
I've put up a pull request to add @xelandernt! 🎉 |
Description of Changes
0
is treated as "falsy" by python, this results in an incorrect ranking calculation inmodel._calculate_rankings()
mu
Issue(s) Resolved
Fixes #173
Affirmation
By submitting this Pull Request or typing my (user)name below,
I affirm the Developer Certificate of Origin
with respect to all commits and content included in this PR,
and understand I am releasing the same under openskill.py's MIT license.
I certify the above statement is true and correct: xelandernt