Skip to content

Conversation

@pandeconscious
Copy link
Contributor

@pandeconscious pandeconscious commented Oct 27, 2025

This PR was picked up and inspired from https://github.com/pandas-dev/pandas/pull/60493/files since that PR became stale almost a year ago

@pandeconscious pandeconscious changed the title BUG: ordered categorical data now calculates right kendall/spearman correlations BUG: for ordered categorical data implements correct computation of kendall/spearman correlations Oct 27, 2025
@pandeconscious pandeconscious marked this pull request as ready for review November 5, 2025 14:29
self,
method,
):
pytest.importorskip("scipy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you are going to use the import, you can just add this as a @td.skip_if_no("scipy") decorator to the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

tm.assert_almost_equal(df.transpose().corr(method=my_corr), expected)

@pytest.mark.parametrize("method", ["kendall", "spearman"])
def test_corr_rank_ordered_categorical(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is pretty long, to the point where its unclear what its intent is. Maybe its worth breaking up into a few tests? Or adding parameterization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns

if len(cols_convert) > 0:
data = self.copy(deep=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of taking an entire copy of the dataframe in instances where there might be ordered categoricals; that's a potentially large performance hit, and the usage of this seems pretty niche

I see @rhshadrach commented on the original issue, so lets see what his thoughts are

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach are you suggesting an asv benchmark or to profile it and paste the results in the description of the PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For benchmarking, we don't have any ASVs that hit this case. You can just setup an example that hits this case and use timeit to compare this PR to main. Aim for 10-100ms in runtime so we aren't merely benchmarking overhead. If you want any assistance in setting this up, just let me know.

cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns

if len(cols_convert) > 0:
data = self.copy(deep=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

Comment on lines 12026 to 12028
data[cols_convert] = data[cols_convert].transform(
lambda x: x.cat.codes.replace(-1, np.nan)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will fail when a DataFrame has duplicate column names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this, fixing this!

@pandeconscious pandeconscious marked this pull request as draft November 7, 2025 21:03
@pandeconscious pandeconscious marked this pull request as ready for review November 17, 2025 14:33

return correl

def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this a bit and make it more performant.

result = self
made_copy = False
for idx, dtype in enumerate(self.dtypes):
    if not dtype == "category" or not dtype.ordered:
        continue
    col = result._ixs(idx, axis=1)
    if not made_copy:
        made_copy = True
        result = result.copy(deep=False)
    result._iset_item(idx, col.cat.codes.replace(-1, np.nan))
return result

Can you move this to pandas.core.methods.corr (this file does not yet exist) and make it take a DataFrame as input - we can move the remaining parts of the implementation in a later PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: spearman correlation doesn't work on non-numeric data

3 participants