Invalid ability estimates on real world data, caused by numerical instability issue

Dear Jonas,

thanks again for your great package, and for making it open source! I have encountered an issue with `adaptivetesting` on our real world data, in which numerical overflows seem to lead to incorrectly computed ability estimates. I have created a fully reproducible example here, implemented as a unit test for your package: [test_realworld.py](https://github.com/condecon/adaptivetesting/commit/df77121c5fc8376bf97c114490991d029a5ec97f)

To run it yourself:

```sh
cd
git clone https://github.com/dfsp-spirit/adaptivetesting.git adaptivetesting-ts
cd adaptivetesting-ts/
git checkout illustrate_issue
git checkout df77121      # Illustrates the broken state.
uv sync --editable .
uv run python -m unittest
```

If you run it, you will get output like this:
```shell
uv run python -m unittest
........Expected percentage correct: 49.2%
Actual percentage correct: 52.0%
.0.0667121752511823
.First 5 items - Expected probabilities for ability=0:
Item 0: a=1.051, b=-0.560, c=0.060, d=0.814 -> P=0.545
Item 1: a=0.994, b=-0.230, c=0.241, d=0.805 -> P=0.555
Item 2: a=0.991, b=1.559, c=0.150, d=0.898 -> P=0.282
Item 3: a=1.274, b=0.070, c=0.129, d=0.817 -> P=0.457
Item 4: a=0.955, b=0.129, c=0.101, d=0.883 -> P=0.468
.......................Item ID: S0811, Correct Answer: diff, User Answer: same. Score: 0
After item #1 with ID S001: estimated ability and standard error: -0.13013013013013008, 0.9785610828814044
Item ID: S049, Correct Answer: diff, User Answer: same. Score: 0
After item #2 with ID S003: estimated ability and standard error: -0.3303303303303302, 0.3249376308979352
Item ID: S007, Correct Answer: diff, User Answer: same. Score: 0
After item #3 with ID S005: estimated ability and standard error: -0.39039039039039025, 0.26918390077873305
Item ID: S075, Correct Answer: diff, User Answer: same. Score: 0
After item #4 with ID S007: estimated ability and standard error: -0.4904904904904903, 0.21795263455719996
Item ID: S003, Correct Answer: diff, User Answer: same. Score: 0
/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/math/estimators/__functions/__estimators.py:28: RuntimeWarning: overflow encountered in exp
  value = c + (d - c) * (np.exp(a * (mu - b))) / \
/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/math/estimators/__functions/__estimators.py:29: RuntimeWarning: overflow encountered in exp
  (1 + np.exp(a * (mu - b)))
/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/math/estimators/__functions/__estimators.py:28: RuntimeWarning: invalid value encountered in divide
  value = c + (d - c) * (np.exp(a * (mu - b))) / \
/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/math/estimators/__functions/__estimators.py:28: RuntimeWarning: invalid value encountered in scalar divide
  value = c + (d - c) * (np.exp(a * (mu - b))) / \
After item #5 with ID S009: estimated ability and standard error: 9.77977977977978, nan
Item ID: S192, Correct Answer: same, User Answer: same. Score: 1
After item #6 with ID S013: estimated ability and standard error: 9.77977977977978, nan
Item ID: S0908, Correct Answer: same, User Answer: same. Score: 1
After item #7 with ID S015: estimated ability and standard error: 9.77977977977978, nan
Item ID: S0712, Correct Answer: same, User Answer: same. Score: 1
After item #8 with ID S017: estimated ability and standard error: 9.77977977977978, nan
// many more lines omitted here
After item #137 with ID S1410: estimated ability and standard error: 9.77977977977978, nan
E......................
======================================================================
ERROR: test_our_issue (adaptivetesting.tests.test_realworld.TestRealWorld.test_our_issue)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/tests/test_realworld.py", line 71, in test_our_issue
    adaptive_test.run_test_once()
  File "/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/implementations/__test_assembler.py", line 231, in run_test_once
    return super().run_test_once()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/models/__adaptive_test.py", line 150, in run_test_once
    item = self.get_next_item()
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/implementations/__test_assembler.py", line 157, in get_next_item
    item = self.__item_selector(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ts/develop_mpiae/adaptivetesting_myfork/adaptivetesting/math/item_selection/__maximum_information_criterion.py", line 53, in maximum_information_criterion
    raise ItemSelectionException("No appropriate item could be selected.")
adaptivetesting.models.__item_selection_exception.ItemSelectionException: No appropriate item could be selected.
----------------------------------------------------------------------
Ran 56 tests in 6.019s
FAILED (errors=1)
```

Note the RunTimeWarnings, and the result that ability is always `9.77977977977978` afterwards (and SE=`NaN`), and that both stay like this till the end once they are in this state.

The [second commit in the same branch](https://github.com/condecon/adaptivetesting/commit/dfb1fb43f595c3fc9b258f7b5b7d5abb955ab251) rescues this behavior by using more stable math and avoiding the runtime warnings, and thus the issue with the wrong ability estimate (commands continued from above):

```sh
git checkout dfb1fb43f595c3fc9b258f7b5b7d5abb955ab251
uv run python -m unittest
```

This shows expected behavior:

```sh
uv run python -m unittest
........Expected percentage correct: 49.2%
Actual percentage correct: 52.0%
.0.06671217525215464
.First 5 items - Expected probabilities for ability=0:
Item 0: a=1.051, b=-0.560, c=0.060, d=0.814 -> P=0.545
Item 1: a=0.994, b=-0.230, c=0.241, d=0.805 -> P=0.555
Item 2: a=0.991, b=1.559, c=0.150, d=0.898 -> P=0.282
Item 3: a=1.274, b=0.070, c=0.129, d=0.817 -> P=0.457
Item 4: a=0.955, b=0.129, c=0.101, d=0.883 -> P=0.468
.......................Item ID: S0811, Correct Answer: diff, User Answer: same. Score: 0
After item #1 with ID S001: estimated ability and standard error: -0.13013013013013008, 0.9785610828814044
Item ID: S049, Correct Answer: diff, User Answer: same. Score: 0
After item #2 with ID S003: estimated ability and standard error: -0.3303303303303302, 0.3249376308979352
Item ID: S007, Correct Answer: diff, User Answer: same. Score: 0
After item #3 with ID S005: estimated ability and standard error: -0.39039039039039025, 0.26918390077873305
Item ID: S075, Correct Answer: diff, User Answer: same. Score: 0
After item #4 with ID S007: estimated ability and standard error: -0.4904904904904903, 0.21795263455719996
Item ID: S003, Correct Answer: diff, User Answer: same. Score: 0
After item #5 with ID S009: estimated ability and standard error: -0.5705705705705704, 0.1670767407810938
Item ID: S081, Correct Answer: diff, User Answer: same. Score: 0
After item #6 with ID S013: estimated ability and standard error: -0.6106106106106104, 0.22674237978014508
Item ID: S151, Correct Answer: same, User Answer: same. Score: 1
After item #7 with ID S015: estimated ability and standard error: -0.5705705705705704, 0.0647880110496236
Item ID: S065, Correct Answer: diff, User Answer: same. Score: 0
After item #8 with ID S017: estimated ability and standard error: -0.5705705705705704, 0.06046497163058291
Item ID: S083, Correct Answer: diff, User Answer: same. Score: 0
After item #9 with ID S019: estimated ability and standard error: -0.5705705705705704, 0.05724104458744014
Item ID: S023, Correct Answer: diff, User Answer: same. Score: 0
After item #10 with ID S021: estimated ability and standard error: -0.6706706706706704, 0.03326211160134737
Item ID: S0406, Correct Answer: diff, User Answer: same. Score: 0
// Many lines omitted here
Item ID: S1401, Correct Answer: same, User Answer: same. Score: 1
After item #138 with ID S1411: estimated ability and standard error: -1.3113113113113108, 0.017204674401946778
.......................
----------------------------------------------------------------------
Ran 56 tests in 11.572s
OK
```

I think in  `item_information_function()`, the problem is that when `p_y1` approaches 0 or 1, the denominator `p_y1 * (1 - p_y1)` approaches 0, causing division by very small numbers and resulting in overflow. In `probability_y1()`, there is a potential numerical overflow in `np.exp(a * (mu - b))` when the exponent becomes very large.

And a note: there may be more such numerical stability issues hiding in other math functions. To rescue our use case, changing `item_information_function()` and `probability_y1()` as done in the second commit was sufficient, but with other data more may show up.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Invalid ability estimates on real world data, caused by numerical instability issue #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Invalid ability estimates on real world data, caused by numerical instability issue #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions