[fix]: fixes balanced subsampling bug in data/emnist.py #84

mariovas3 · 2024-03-20T18:29:37Z

Account for y labels being offset by NUM_SPECIAL_TOKENS when calling np.bincount in emnist balance subsampling.

The offsetting is found here:

fsdl-text-recognizer-2022/text_recognizer/data/emnist.py

Line 104 in ac59bfe

y_train = data["dataset"]["train"][0, 0]["labels"][0, 0] + NUM_SPECIAL_TOKENS

and here:

fsdl-text-recognizer-2022/text_recognizer/data/emnist.py

Line 106 in ac59bfe

y_test = data["dataset"]["test"][0, 0]["labels"][0, 0] + NUM_SPECIAL_TOKENS

np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset.

Example bug:

>>> import numpy as np
>>> y = np.array([0, 1, 0, 2, 1])
>>> np.bincount(y)
array([2, 2, 1])
>>> NUM_SPECIAL_TOKENS = 4
>>> np.bincount(y + NUM_SPECIAL_TOKENS)
array([0, 0, 0, 0, 2, 2, 1])

…alling np.bincount in emnist balance subsampling np.bincount will prepend zeros for elements that were not found starting from 0 to y_min_element-1; this will bias the mean to be lower if not controlled and will result in fewer samples in the balanced dataset

mariovas3 mentioned this pull request Mar 22, 2024

[bug]: np.bincount prepends zeros in data/emnist.py #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[fix]: fixes balanced subsampling bug in data/emnist.py #84

[fix]: fixes balanced subsampling bug in data/emnist.py #84

Uh oh!

mariovas3 commented Mar 20, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[fix]: fixes balanced subsampling bug in data/emnist.py #84

Are you sure you want to change the base?

[fix]: fixes balanced subsampling bug in data/emnist.py #84

Uh oh!

Conversation

mariovas3 commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mariovas3 commented Mar 20, 2024 •

edited

Loading