-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: for ordered categorical data implements correct computation of kendall/spearman correlations #62880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| self, | ||
| method, | ||
| ): | ||
| pytest.importorskip("scipy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you are going to use the import, you can just add this as a @td.skip_if_no("scipy") decorator to the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
| tm.assert_almost_equal(df.transpose().corr(method=my_corr), expected) | ||
|
|
||
| @pytest.mark.parametrize("method", ["kendall", "spearman"]) | ||
| def test_corr_rank_ordered_categorical( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is pretty long, to the point where its unclear what its intent is. Maybe its worth breaking up into a few tests? Or adding parameterization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
pandas/core/frame.py
Outdated
| cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns | ||
|
|
||
| if len(cols_convert) > 0: | ||
| data = self.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of taking an entire copy of the dataframe in instances where there might be ordered categoricals; that's a potentially large performance hit, and the usage of this seems pretty niche
I see @rhshadrach commented on the original issue, so lets see what his thoughts are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach are you suggesting an asv benchmark or to profile it and paste the results in the description of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For benchmarking, we don't have any ASVs that hit this case. You can just setup an example that hits this case and use timeit to compare this PR to main. Aim for 10-100ms in runtime so we aren't merely benchmarking overhead. If you want any assistance in setting this up, just let me know.
pandas/core/frame.py
Outdated
| cols_convert = categ.loc[:, categ.agg(lambda x: x.cat.ordered)].columns | ||
|
|
||
| if len(cols_convert) > 0: | ||
| data = self.copy(deep=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep=False shouldn't be large as it doesn't copy the underlying data, but agreed we should measure the performance here.
pandas/core/frame.py
Outdated
| data[cols_convert] = data[cols_convert].transform( | ||
| lambda x: x.cat.codes.replace(-1, np.nan) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will fail when a DataFrame has duplicate column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for catching this, fixing this!
|
|
||
| return correl | ||
|
|
||
| def _transform_ord_cat_cols_to_coded_cols(self) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this a bit and make it more performant.
result = self
made_copy = False
for idx, dtype in enumerate(self.dtypes):
if not dtype == "category" or not dtype.ordered:
continue
col = result._ixs(idx, axis=1)
if not made_copy:
made_copy = True
result = result.copy(deep=False)
result._iset_item(idx, col.cat.codes.replace(-1, np.nan))
return resultCan you move this to pandas.core.methods.corr (this file does not yet exist) and make it take a DataFrame as input - we can move the remaining parts of the implementation in a later PR.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.This PR was picked up and inspired from https://github.com/pandas-dev/pandas/pull/60493/files since that PR became stale almost a year ago