Fix ranking calculation for ties. #174

xelandernt · 2025-09-29T06:14:49Z

Description of Changes

0 is treated as "falsy" by python, this results in an incorrect ranking calculation in model._calculate_rankings()
Wrote a test to ensure draws between players with the same rating do not affect mu

Issue(s) Resolved

Fixes #173

Affirmation

By submitting this Pull Request or typing my (user)name below,
I affirm the Developer Certificate of Origin
with respect to all commits and content included in this PR,
and understand I am releasing the same under openskill.py's MIT license.

I certify the above statement is true and correct: xelandernt

codecov · 2025-09-29T07:03:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (11ac382) to head (07e5391).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #174   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           10        10           
  Lines         2047      2047           
  Branches       513       513           
=========================================
  Hits          2047      2047

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vivekjoshy · 2025-10-02T14:27:57Z

Sorry for the delay in responding. But this was intentional, or a mistake on purpose from my side. I am actually aware of this already. However, accepting this PR will reduce accuracy on our benchmarks. I know it seems counter intuitive, but I'm trying to square what's causing this.

Feel free to run the benchmarks with and without the "fix" to compare. Maybe you can come up with something interesting.

vivekjoshy

Please run the benchmarks to make sure it's not reducing accuracy.

openskill/models/weng_lin/bradley_terry_full.py

xelandernt · 2025-10-03T16:26:46Z

@vivekjoshy I looked at the benchmarks
For me the results are as follows:

draw.ipynb stays the same (tested only Plackettluce)
rank.ipynb stays the same (tested only Plackettluce)
win.ipynb stays the same (tested only Plackettluce)
benchmark.py I have different results than on the main branch

PR benchmark

Fold 1/3
PlackettLuce (margin=0.0): 64.12% accuracy, 1.97s
PlackettLuce (margin=1.0): 65.21% accuracy, 1.62s
ThurstoneMostellerPart (margin=0.0): 61.12% accuracy, 1.60s
ThurstoneMostellerPart (margin=1.0): 61.66% accuracy, 1.63s
ThurstoneMostellerFull (margin=0.0): 64.67% accuracy, 1.62s
ThurstoneMostellerFull (margin=1.0): 64.26% accuracy, 1.63s
BradleyTerryFull (margin=0.0): 62.62% accuracy, 1.61s
BradleyTerryFull (margin=1.0): 63.85% accuracy, 1.61s
BradleyTerryPart (margin=0.0): 62.62% accuracy, 1.56s
BradleyTerryPart (margin=1.0): 63.85% accuracy, 1.58s

Fold 2/3
PlackettLuce (margin=0.0): 63.02% accuracy, 1.65s
PlackettLuce (margin=1.0): 63.16% accuracy, 1.63s
ThurstoneMostellerPart (margin=0.0): 59.14% accuracy, 1.61s
ThurstoneMostellerPart (margin=1.0): 60.53% accuracy, 1.62s
ThurstoneMostellerFull (margin=0.0): 63.99% accuracy, 1.61s
ThurstoneMostellerFull (margin=1.0): 64.40% accuracy, 1.62s
BradleyTerryFull (margin=0.0): 60.25% accuracy, 1.58s
BradleyTerryFull (margin=1.0): 62.88% accuracy, 1.56s
BradleyTerryPart (margin=0.0): 60.25% accuracy, 1.62s
BradleyTerryPart (margin=1.0): 62.88% accuracy, 1.66s

Fold 3/3
PlackettLuce (margin=0.0): 64.38% accuracy, 1.74s
PlackettLuce (margin=1.0): 65.81% accuracy, 1.73s
ThurstoneMostellerPart (margin=0.0): 59.37% accuracy, 1.72s
ThurstoneMostellerPart (margin=1.0): 60.80% accuracy, 1.63s
ThurstoneMostellerFull (margin=0.0): 67.38% accuracy, 1.63s
ThurstoneMostellerFull (margin=1.0): 66.95% accuracy, 1.65s
BradleyTerryFull (margin=0.0): 64.09% accuracy, 1.57s
BradleyTerryFull (margin=1.0): 65.24% accuracy, 1.57s
BradleyTerryPart (margin=0.0): 64.09% accuracy, 1.58s
BradleyTerryPart (margin=1.0): 65.24% accuracy, 1.65s

Benchmark Results:
               Rating System Benchmark Results - Margin Comparison               
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model                  ┃ Margin ┃ Accuracy       ┃ Predictions ┃ Avg Time (s) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ PlackettLuce           │ 0.0    │ 63.83% ± 0.59% │ 1375/2154   │ 1.79         │
│ PlackettLuce           │ 1.0    │ 64.72% ± 1.14% │ 1394/2154   │ 1.66         │
│ ThurstoneMostellerPart │ 0.0    │ 59.89% ± 0.88% │ 1290/2154   │ 1.65         │
│ ThurstoneMostellerPart │ 1.0    │ 61.00% ± 0.48% │ 1314/2154   │ 1.63         │
│ ThurstoneMostellerFull │ 0.0    │ 65.32% ± 1.47% │ 1407/2154   │ 1.62         │
│ ThurstoneMostellerFull │ 1.0    │ 65.18% ± 1.24% │ 1404/2154   │ 1.64         │
│ BradleyTerryFull       │ 0.0    │ 62.30% ± 1.58% │ 1342/2154   │ 1.59         │
│ BradleyTerryFull       │ 1.0    │ 63.97% ± 0.97% │ 1378/2154   │ 1.58         │
│ BradleyTerryPart       │ 0.0    │ 62.30% ± 1.58% │ 1342/2154   │ 1.59         │
│ BradleyTerryPart       │ 1.0    │ 63.97% ± 0.97% │ 1378/2154   │ 1.63         │
└────────────────────────┴────────┴────────────────┴─────────────┴──────────────┘


                      Margin Impact Analysis                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Model                  ┃ Accuracy Difference ┃ Speed Difference ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PlackettLuce           │ +0.88% (better)     │ 6.99% faster     │
│ ThurstoneMostellerPart │ +1.11% (better)     │ 1.23% faster     │
│ ThurstoneMostellerFull │ -0.14% (worse)      │ 0.85% slower     │
│ BradleyTerryFull       │ +1.67% (better)     │ 0.42% faster     │
│ BradleyTerryPart       │ +1.67% (better)     │ 2.75% slower     │
└────────────────────────┴─────────────────────┴──────────────────┘

`main` Benchmark

Fold 1/3
PlackettLuce (margin=0.0): 64.12% accuracy, 1.96s
PlackettLuce (margin=1.0): 65.21% accuracy, 1.71s
ThurstoneMostellerPart (margin=0.0): 64.26% accuracy, 1.60s
ThurstoneMostellerPart (margin=1.0): 64.67% accuracy, 1.61s
ThurstoneMostellerFull (margin=0.0): 67.94% accuracy, 1.60s
ThurstoneMostellerFull (margin=1.0): 67.53% accuracy, 1.60s
BradleyTerryFull (margin=0.0): 65.89% accuracy, 1.54s
BradleyTerryFull (margin=1.0): 67.39% accuracy, 1.55s
BradleyTerryPart (margin=0.0): 65.89% accuracy, 1.55s
BradleyTerryPart (margin=1.0): 67.39% accuracy, 1.55s

Fold 2/3
PlackettLuce (margin=0.0): 63.16% accuracy, 1.64s
PlackettLuce (margin=1.0): 62.88% accuracy, 1.64s
ThurstoneMostellerPart (margin=0.0): 61.91% accuracy, 1.65s
ThurstoneMostellerPart (margin=1.0): 63.16% accuracy, 1.61s
ThurstoneMostellerFull (margin=0.0): 65.93% accuracy, 1.65s
ThurstoneMostellerFull (margin=1.0): 66.07% accuracy, 1.61s
BradleyTerryFull (margin=0.0): 64.40% accuracy, 1.59s
BradleyTerryFull (margin=1.0): 65.10% accuracy, 1.56s
BradleyTerryPart (margin=0.0): 64.40% accuracy, 1.58s
BradleyTerryPart (margin=1.0): 65.10% accuracy, 1.56s

Fold 3/3
PlackettLuce (margin=0.0): 64.52% accuracy, 1.64s
PlackettLuce (margin=1.0): 65.67% accuracy, 1.65s
ThurstoneMostellerPart (margin=0.0): 63.09% accuracy, 1.64s
ThurstoneMostellerPart (margin=1.0): 64.66% accuracy, 1.66s
ThurstoneMostellerFull (margin=0.0): 69.38% accuracy, 1.61s
ThurstoneMostellerFull (margin=1.0): 68.38% accuracy, 1.82s
BradleyTerryFull (margin=0.0): 66.81% accuracy, 1.58s
BradleyTerryFull (margin=1.0): 68.10% accuracy, 1.57s
BradleyTerryPart (margin=0.0): 66.81% accuracy, 1.57s
BradleyTerryPart (margin=1.0): 68.10% accuracy, 1.57s

Benchmark Results:
               Rating System Benchmark Results - Margin Comparison               
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Model                  ┃ Margin ┃ Accuracy       ┃ Predictions ┃ Avg Time (s) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ PlackettLuce           │ 0.0    │ 63.93% ± 0.57% │ 1377/2154   │ 1.75         │
│ PlackettLuce           │ 1.0    │ 64.58% ± 1.22% │ 1391/2154   │ 1.67         │
│ ThurstoneMostellerPart │ 0.0    │ 63.09% ± 0.96% │ 1359/2154   │ 1.63         │
│ ThurstoneMostellerPart │ 1.0    │ 64.16% ± 0.71% │ 1382/2154   │ 1.63         │
│ ThurstoneMostellerFull │ 0.0    │ 67.73% ± 1.42% │ 1459/2154   │ 1.62         │
│ ThurstoneMostellerFull │ 1.0    │ 67.32% ± 0.96% │ 1450/2154   │ 1.68         │
│ BradleyTerryFull       │ 0.0    │ 65.69% ± 0.99% │ 1415/2154   │ 1.57         │
│ BradleyTerryFull       │ 1.0    │ 66.85% ± 1.28% │ 1440/2154   │ 1.56         │
│ BradleyTerryPart       │ 0.0    │ 65.69% ± 0.99% │ 1415/2154   │ 1.56         │
│ BradleyTerryPart       │ 1.0    │ 66.85% ± 1.28% │ 1440/2154   │ 1.56         │
└────────────────────────┴────────┴────────────────┴─────────────┴──────────────┘


                      Margin Impact Analysis                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Model                  ┃ Accuracy Difference ┃ Speed Difference ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ PlackettLuce           │ +0.65% (better)     │ 4.63% faster     │
│ ThurstoneMostellerPart │ +1.07% (better)     │ 0.33% faster     │
│ ThurstoneMostellerFull │ -0.42% (worse)      │ 3.79% slower     │
│ BradleyTerryFull       │ +1.16% (better)     │ 0.73% faster     │
│ BradleyTerryPart       │ +1.16% (better)     │ 0.01% faster     │
└────────────────────────┴─────────────────────┴──────────────────┘

Results

Model	Margin	`main`	PR	Difference
PlackettLuce	0.0	1377/2154	1375/2154	-2
PlackettLuce	1.0	1391/2154	1394/2154	+3
ThurstoneMostellerPart	0.0	1359/2154	1290/2154	-69
ThurstoneMostellerPart	1.0	1382/2154	1314/2154	-68
ThurstoneMostellerFull	0.0	1459/2154	1407/2154	-52
ThurstoneMostellerFull	1.0	1450/2154	1404/2154	-46
BradleyTerryFull	0.0	1415/2154	1342/2154	-73
BradleyTerryFull	1.0	1440/2154	1378/2154	-62
BradleyTerryPart	0.0	1415/2154	1342/2154	-73
BradleyTerryPart	1.0	1440/2154	1378/2154	-62

Evaluation

Actually PlackettLuce with a margin got better :), but yes I agree the results are worse overall.

Why is this happening?

I looked at the data in overwatch.jsonl and can see that by score there are 44091 matches with a labelled with a WIN or LOSS, and 866 matches labelled as a DRAW, however there are also matches which are labelled as a WIN or LOSS but actually have a tied score (blue_won_fights vs red_won_fights). Thus, there are actually 3042 matches with equal fights scores, although 886 were actually a draw.

This is causing the difference in results.

Reproduction

import jsonlines as jl
import polars as pl
data = list(jl.open("benchmark/data/overwatch.jsonl").iter())
actual_data = []
for entry in data:
    if entry["compositions"]:
        entry["blue_score"] = entry["compositions"]["blue_won_fights"]
        entry["red_score"] = entry["compositions"]["red_won_fights"]
        actual_data.append(entry)
df = pl.DataFrame(actual_data)
print(df["result"].value_counts()) # LOSS = 19551, DRAW  = 886, WIN = 24540
print(df.filter(pl.col("blue_score") == pl.col("red_score")).height) # 3042

vivekjoshy · 2025-10-03T16:55:11Z

Thank you for your valuable work.

@all-contributors please add @xelandernt for code and bug

allcontributors · 2025-10-03T16:55:21Z

@vivekjoshy

I've put up a pull request to add @xelandernt! 🎉

Fix ranking calculation for ties.

07e5391

xelandernt force-pushed the main branch from aae701a to 07e5391 Compare September 29, 2025 06:18

vivekjoshy self-requested a review October 2, 2025 14:29

vivekjoshy reviewed Oct 2, 2025

View reviewed changes

openskill/models/weng_lin/bradley_terry_full.py Show resolved Hide resolved

allcontributors bot mentioned this pull request Oct 3, 2025

docs: add xelandernt as a contributor for code, and bug #175

Merged

vivekjoshy merged commit 5e918f5 into vivekjoshy:main Oct 3, 2025
43 checks passed

xelandernt mentioned this pull request Oct 7, 2025

Fix PlackettLuce ties. #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix ranking calculation for ties. #174

Fix ranking calculation for ties. #174

Uh oh!

xelandernt commented Sep 29, 2025

Uh oh!

codecov bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

vivekjoshy commented Oct 2, 2025

Uh oh!

vivekjoshy left a comment

Uh oh!

Uh oh!

xelandernt commented Oct 3, 2025 •

edited

Loading

Uh oh!

vivekjoshy commented Oct 3, 2025

Uh oh!

allcontributors bot commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix ranking calculation for ties. #174

Fix ranking calculation for ties. #174

Uh oh!

Conversation

xelandernt commented Sep 29, 2025

Description of Changes

Issue(s) Resolved

Affirmation

Uh oh!

codecov bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vivekjoshy commented Oct 2, 2025

Uh oh!

vivekjoshy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xelandernt commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR benchmark

main Benchmark

Results

Evaluation

Why is this happening?

Reproduction

Uh oh!

vivekjoshy commented Oct 3, 2025

Uh oh!

allcontributors bot commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 29, 2025 •

edited

Loading

xelandernt commented Oct 3, 2025 •

edited

Loading

`main` Benchmark