Skip to content

Conversation

xelandernt
Copy link
Contributor

Description of Changes

  • 0 is treated as "falsy" by python, this results in an incorrect ranking calculation in model._calculate_rankings()
  • Wrote a test to ensure draws between players with the same rating do not affect mu

Issue(s) Resolved

Fixes #173

Affirmation

By submitting this Pull Request or typing my (user)name below,
I affirm the Developer Certificate of Origin
with respect to all commits and content included in this PR,
and understand I am releasing the same under openskill.py's MIT license.

I certify the above statement is true and correct: xelandernt

@codecov
Copy link

codecov bot commented Sep 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (11ac382) to head (07e5391).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #174   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines         2047      2047           
  Branches       513       513           
=========================================
  Hits          2047      2047           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vivekjoshy
Copy link
Owner

Sorry for the delay in responding. But this was intentional, or a mistake on purpose from my side. I am actually aware of this already. However, accepting this PR will reduce accuracy on our benchmarks. I know it seems counter intuitive, but I'm trying to square what's causing this.

Feel free to run the benchmarks with and without the "fix" to compare. Maybe you can come up with something interesting.

@vivekjoshy vivekjoshy self-requested a review October 2, 2025 14:29
Copy link
Owner

@vivekjoshy vivekjoshy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run the benchmarks to make sure it's not reducing accuracy.

@xelandernt
Copy link
Contributor Author

xelandernt commented Oct 3, 2025

@vivekjoshy I looked at the benchmarks
For me the results are as follows:

  • draw.ipynb stays the same (tested only Plackettluce)
  • rank.ipynb stays the same (tested only Plackettluce)
  • win.ipynb stays the same (tested only Plackettluce)
  • benchmark.py I have different results than on the main branch

PR benchmark

Fold 1/3
PlackettLuce (margin=0.0): 64.12% accuracy, 1.97s
PlackettLuce (margin=1.0): 65.21% accuracy, 1.62s
ThurstoneMostellerPart (margin=0.0): 61.12% accuracy, 1.60s
ThurstoneMostellerPart (margin=1.0): 61.66% accuracy, 1.63s
ThurstoneMostellerFull (margin=0.0): 64.67% accuracy, 1.62s
ThurstoneMostellerFull (margin=1.0): 64.26% accuracy, 1.63s
BradleyTerryFull (margin=0.0): 62.62% accuracy, 1.61s
BradleyTerryFull (margin=1.0): 63.85% accuracy, 1.61s
BradleyTerryPart (margin=0.0): 62.62% accuracy, 1.56s
BradleyTerryPart (margin=1.0): 63.85% accuracy, 1.58s

Fold 2/3
PlackettLuce (margin=0.0): 63.02% accuracy, 1.65s
PlackettLuce (margin=1.0): 63.16% accuracy, 1.63s
ThurstoneMostellerPart (margin=0.0): 59.14% accuracy, 1.61s
ThurstoneMostellerPart (margin=1.0): 60.53% accuracy, 1.62s
ThurstoneMostellerFull (margin=0.0): 63.99% accuracy, 1.61s
ThurstoneMostellerFull (margin=1.0): 64.40% accuracy, 1.62s
BradleyTerryFull (margin=0.0): 60.25% accuracy, 1.58s
BradleyTerryFull (margin=1.0): 62.88% accuracy, 1.56s
BradleyTerryPart (margin=0.0): 60.25% accuracy, 1.62s
BradleyTerryPart (margin=1.0): 62.88% accuracy, 1.66s

Fold 3/3
PlackettLuce (margin=0.0): 64.38% accuracy, 1.74s
PlackettLuce (margin=1.0): 65.81% accuracy, 1.73s
ThurstoneMostellerPart (margin=0.0): 59.37% accuracy, 1.72s
ThurstoneMostellerPart (margin=1.0): 60.80% accuracy, 1.63s
ThurstoneMostellerFull (margin=0.0): 67.38% accuracy, 1.63s
ThurstoneMostellerFull (margin=1.0): 66.95% accuracy, 1.65s
BradleyTerryFull (margin=0.0): 64.09% accuracy, 1.57s
BradleyTerryFull (margin=1.0): 65.24% accuracy, 1.57s
BradleyTerryPart (margin=0.0): 64.09% accuracy, 1.58s
BradleyTerryPart (margin=1.0): 65.24% accuracy, 1.65s

Benchmark Results:
               Rating System Benchmark Results - Margin Comparison               
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model                  ┃ Margin ┃ Accuracy       ┃ Predictions ┃ Avg Time (s) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ PlackettLuce           │ 0.0    │ 63.83% ± 0.59% │ 1375/2154   │ 1.79         │
│ PlackettLuce           │ 1.0    │ 64.72% ± 1.14% │ 1394/2154   │ 1.66         │
│ ThurstoneMostellerPart │ 0.0    │ 59.89% ± 0.88% │ 1290/2154   │ 1.65         │
│ ThurstoneMostellerPart │ 1.0    │ 61.00% ± 0.48% │ 1314/2154   │ 1.63         │
│ ThurstoneMostellerFull │ 0.0    │ 65.32% ± 1.47% │ 1407/2154   │ 1.62         │
│ ThurstoneMostellerFull │ 1.0    │ 65.18% ± 1.24% │ 1404/2154   │ 1.64         │
│ BradleyTerryFull       │ 0.0    │ 62.30% ± 1.58% │ 1342/2154   │ 1.59         │
│ BradleyTerryFull       │ 1.0    │ 63.97% ± 0.97% │ 1378/2154   │ 1.58         │
│ BradleyTerryPart       │ 0.0    │ 62.30% ± 1.58% │ 1342/2154   │ 1.59         │
│ BradleyTerryPart       │ 1.0    │ 63.97% ± 0.97% │ 1378/2154   │ 1.63         │
└────────────────────────┴────────┴────────────────┴─────────────┴──────────────┘


                      Margin Impact Analysis                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Model                  ┃ Accuracy Difference ┃ Speed Difference ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PlackettLuce           │ +0.88% (better)     │ 6.99% faster     │
│ ThurstoneMostellerPart │ +1.11% (better)     │ 1.23% faster     │
│ ThurstoneMostellerFull │ -0.14% (worse)      │ 0.85% slower     │
│ BradleyTerryFull       │ +1.67% (better)     │ 0.42% faster     │
│ BradleyTerryPart       │ +1.67% (better)     │ 2.75% slower     │
└────────────────────────┴─────────────────────┴──────────────────┘

main Benchmark

Fold 1/3
PlackettLuce (margin=0.0): 64.12% accuracy, 1.96s
PlackettLuce (margin=1.0): 65.21% accuracy, 1.71s
ThurstoneMostellerPart (margin=0.0): 64.26% accuracy, 1.60s
ThurstoneMostellerPart (margin=1.0): 64.67% accuracy, 1.61s
ThurstoneMostellerFull (margin=0.0): 67.94% accuracy, 1.60s
ThurstoneMostellerFull (margin=1.0): 67.53% accuracy, 1.60s
BradleyTerryFull (margin=0.0): 65.89% accuracy, 1.54s
BradleyTerryFull (margin=1.0): 67.39% accuracy, 1.55s
BradleyTerryPart (margin=0.0): 65.89% accuracy, 1.55s
BradleyTerryPart (margin=1.0): 67.39% accuracy, 1.55s

Fold 2/3
PlackettLuce (margin=0.0): 63.16% accuracy, 1.64s
PlackettLuce (margin=1.0): 62.88% accuracy, 1.64s
ThurstoneMostellerPart (margin=0.0): 61.91% accuracy, 1.65s
ThurstoneMostellerPart (margin=1.0): 63.16% accuracy, 1.61s
ThurstoneMostellerFull (margin=0.0): 65.93% accuracy, 1.65s
ThurstoneMostellerFull (margin=1.0): 66.07% accuracy, 1.61s
BradleyTerryFull (margin=0.0): 64.40% accuracy, 1.59s
BradleyTerryFull (margin=1.0): 65.10% accuracy, 1.56s
BradleyTerryPart (margin=0.0): 64.40% accuracy, 1.58s
BradleyTerryPart (margin=1.0): 65.10% accuracy, 1.56s

Fold 3/3
PlackettLuce (margin=0.0): 64.52% accuracy, 1.64s
PlackettLuce (margin=1.0): 65.67% accuracy, 1.65s
ThurstoneMostellerPart (margin=0.0): 63.09% accuracy, 1.64s
ThurstoneMostellerPart (margin=1.0): 64.66% accuracy, 1.66s
ThurstoneMostellerFull (margin=0.0): 69.38% accuracy, 1.61s
ThurstoneMostellerFull (margin=1.0): 68.38% accuracy, 1.82s
BradleyTerryFull (margin=0.0): 66.81% accuracy, 1.58s
BradleyTerryFull (margin=1.0): 68.10% accuracy, 1.57s
BradleyTerryPart (margin=0.0): 66.81% accuracy, 1.57s
BradleyTerryPart (margin=1.0): 68.10% accuracy, 1.57s

Benchmark Results:
               Rating System Benchmark Results - Margin Comparison               
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model                  ┃ Margin ┃ Accuracy       ┃ Predictions ┃ Avg Time (s) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ PlackettLuce           │ 0.0    │ 63.93% ± 0.57% │ 1377/2154   │ 1.75         │
│ PlackettLuce           │ 1.0    │ 64.58% ± 1.22% │ 1391/2154   │ 1.67         │
│ ThurstoneMostellerPart │ 0.0    │ 63.09% ± 0.96% │ 1359/2154   │ 1.63         │
│ ThurstoneMostellerPart │ 1.0    │ 64.16% ± 0.71% │ 1382/2154   │ 1.63         │
│ ThurstoneMostellerFull │ 0.0    │ 67.73% ± 1.42% │ 1459/2154   │ 1.62         │
│ ThurstoneMostellerFull │ 1.0    │ 67.32% ± 0.96% │ 1450/2154   │ 1.68         │
│ BradleyTerryFull       │ 0.0    │ 65.69% ± 0.99% │ 1415/2154   │ 1.57         │
│ BradleyTerryFull       │ 1.0    │ 66.85% ± 1.28% │ 1440/2154   │ 1.56         │
│ BradleyTerryPart       │ 0.0    │ 65.69% ± 0.99% │ 1415/2154   │ 1.56         │
│ BradleyTerryPart       │ 1.0    │ 66.85% ± 1.28% │ 1440/2154   │ 1.56         │
└────────────────────────┴────────┴────────────────┴─────────────┴──────────────┘


                      Margin Impact Analysis                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Model                  ┃ Accuracy Difference ┃ Speed Difference ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PlackettLuce           │ +0.65% (better)     │ 4.63% faster     │
│ ThurstoneMostellerPart │ +1.07% (better)     │ 0.33% faster     │
│ ThurstoneMostellerFull │ -0.42% (worse)      │ 3.79% slower     │
│ BradleyTerryFull       │ +1.16% (better)     │ 0.73% faster     │
│ BradleyTerryPart       │ +1.16% (better)     │ 0.01% faster     │
└────────────────────────┴─────────────────────┴──────────────────┘

Results

Model Margin main PR Difference
PlackettLuce 0.0 1377/2154 1375/2154 -2
PlackettLuce 1.0 1391/2154 1394/2154 +3
ThurstoneMostellerPart 0.0 1359/2154 1290/2154 -69
ThurstoneMostellerPart 1.0 1382/2154 1314/2154 -68
ThurstoneMostellerFull 0.0 1459/2154 1407/2154 -52
ThurstoneMostellerFull 1.0 1450/2154 1404/2154 -46
BradleyTerryFull 0.0 1415/2154 1342/2154 -73
BradleyTerryFull 1.0 1440/2154 1378/2154 -62
BradleyTerryPart 0.0 1415/2154 1342/2154 -73
BradleyTerryPart 1.0 1440/2154 1378/2154 -62

Evaluation

Actually PlackettLuce with a margin got better :), but yes I agree the results are worse overall.

Why is this happening?

I looked at the data in overwatch.jsonl and can see that by score there are 44091 matches with a labelled with a WIN or LOSS, and 866 matches labelled as a DRAW, however there are also matches which are labelled as a WIN or LOSS but actually have a tied score (blue_won_fights vs red_won_fights). Thus, there are actually 3042 matches with equal fights scores, although 886 were actually a draw.

This is causing the difference in results.

Reproduction

import jsonlines as jl
import polars as pl
data = list(jl.open("benchmark/data/overwatch.jsonl").iter())
actual_data = []
for entry in data:
    if entry["compositions"]:
        entry["blue_score"] = entry["compositions"]["blue_won_fights"]
        entry["red_score"] = entry["compositions"]["red_won_fights"]
        actual_data.append(entry)
df = pl.DataFrame(actual_data)
print(df["result"].value_counts()) # LOSS = 19551, DRAW  = 886, WIN = 24540
print(df.filter(pl.col("blue_score") == pl.col("red_score")).height) # 3042

@vivekjoshy
Copy link
Owner

Thank you for your valuable work.

@all-contributors please add @xelandernt for code and bug

@allcontributors
Copy link
Contributor

@vivekjoshy

I've put up a pull request to add @xelandernt! 🎉

@vivekjoshy vivekjoshy merged commit 5e918f5 into vivekjoshy:main Oct 3, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ties give unexpected results

2 participants