BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

pandeconscious · 2025-10-27T13:44:55Z

closes BUG: spearman correlation doesn't work on non-numeric data #60306
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR was picked up and inspired from https://github.com/pandas-dev/pandas/pull/60493/files since that PR became stale almost a year ago

WillAyd · 2025-11-05T18:35:29Z

pandas/tests/frame/methods/test_cov_corr.py

+        self,
+        method,
+    ):
+        pytest.importorskip("scipy")


Unless you are going to use the import, you can just add this as a @td.skip_if_no("scipy") decorator to the test

thanks, fixed

WillAyd · 2025-11-05T18:36:17Z

pandas/tests/series/methods/test_cov_corr.py

        tm.assert_almost_equal(df.transpose().corr(method=my_corr), expected)
+
+    @pytest.mark.parametrize("method", ["kendall", "spearman"])
+    def test_corr_rank_ordered_categorical(


This test is pretty long, to the point where its unclear what its intent is. Maybe its worth breaking up into a few tests? Or adding parameterization?

WillAyd · 2025-11-05T19:20:24Z

pandas/core/frame.py

+        cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns
+
+        if len(cols_convert) > 0:
+            data = self.copy(deep=False)


I'm a bit wary of taking an entire copy of the dataframe in instances where there might be ordered categoricals; that's a potentially large performance hit, and the usage of this seems pretty niche

I see @rhshadrach commented on the original issue, so lets see what his thoughts are

deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

@rhshadrach are you suggesting an asv benchmark or to profile it and paste the results in the description of the PR?

For benchmarking, we don't have any ASVs that hit this case. You can just setup an example that hits this case and use timeit to compare this PR to main. Aim for 10-100ms in runtime so we aren't merely benchmarking overhead. If you want any assistance in setting this up, just let me know.

rhshadrach · 2025-11-06T18:46:00Z

pandas/core/frame.py

+        cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns
+
+        if len(cols_convert) > 0:
+            data = self.copy(deep=False)


deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.

rhshadrach · 2025-11-06T18:51:24Z

pandas/core/frame.py

+            data[cols_convert] = data[cols_convert].transform(
+                lambda x: x.cat.codes.replace(-1, np.nan)
+            )


I think this will fail when a DataFrame has duplicate column names.

thanks for catching this, fixing this!

rhshadrach · 2025-11-18T02:32:24Z

pandas/core/frame.py


        return correl

+    def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame:


I think we can simplify this a bit and make it more performant.

result = self made_copy = False for idx, dtype in enumerate(self.dtypes): if not dtype == "category" or not dtype.ordered: continue col = result._ixs(idx, axis=1) if not made_copy: made_copy = True result = result.copy(deep=False) result._iset_item(idx, col.cat.codes.replace(-1, np.nan)) return result

Can you move this to pandas.core.methods.corr (this file does not yet exist) and make it take a DataFrame as input - we can move the remaining parts of the implementation in a later PR.

pandeconscious added 7 commits October 23, 2025 10:46

init commit kendall spearman ordinal cats

1f8c628

Merge branch 'pandas-dev:main' into ordered_cat_corr

906f1e4

series test update and fixes

497dc7e

cat desc longer in tests

583aca6

testing frame corr

e069810

pre commit fixes v2

b90726f

cleanup

65a506c

pandeconscious changed the title ~~BUG: ordered categorical data now calculates right kendall/spearman correlations~~ BUG: for ordered categorical data implements correct computation of kendall/spearman correlations Oct 27, 2025

pandeconscious added 5 commits November 4, 2025 15:00

Merge branch 'pandas-dev:main' into ordered_cat_corr

ab3b8b9

test import scipy fix

e93ed83

rst sorting autofix

ec4d97e

Merge branch 'pandas-dev:main' into ordered_cat_corr

ebfc3b0

Merge branch 'pandas-dev:main' into ordered_cat_corr

8cfacef

pandeconscious marked this pull request as ready for review November 5, 2025 14:29

pandeconscious mentioned this pull request Nov 5, 2025

BUG: spearman correlation doesn't work on non-numeric data #60306

Open

1 task

WillAyd requested changes Nov 5, 2025

View reviewed changes

rhshadrach requested changes Nov 6, 2025

View reviewed changes

pandeconscious marked this pull request as draft November 7, 2025 21:03

pandeconscious added 6 commits November 12, 2025 12:07

Merge branch 'pandas-dev:main' into ordered_cat_corr

7ef7fb2

refactor

588808a

fix dtype for duplicates

c484552

Merge branch 'pandas-dev:main' into ordered_cat_corr

216475c

clean up

e997747

Merge branch 'pandas-dev:main' into ordered_cat_corr

4184167

pandeconscious marked this pull request as ready for review November 17, 2025 14:33

pandeconscious requested review from WillAyd and rhshadrach November 17, 2025 14:35

rhshadrach requested changes Nov 18, 2025

View reviewed changes


		return correl

		def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame:

Uh oh!

BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

Are you sure you want to change the base?

BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880

Conversation

pandeconscious commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pandeconscious commented Oct 27, 2025 •

edited

Loading