-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: fix doctests for pandas/core/strings/accessor.py for new string dtype #61908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: fix doctests for pandas/core/strings/accessor.py for new string dtype #61908
Conversation
I noticed that some of the docstrings highlight differences between returning Other files may have similar cases, though I haven't done a full check yet. |
Good catch. That was an intentional change, see #54805 / #59616, so it is fine to update the docstrings here while updating them to use the string dtype. |
>>> ind = pd.Index(["Mouse", "dog", "house and parrot", "23.0", np.nan]) | ||
>>> ind.str.contains("23", regex=False) | ||
Index([False, False, False, True, nan], dtype='object') | ||
array([False, False, False, True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that it changes here from Index to array does not seem to be intentional, though. Will look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently all Index operations that return a boolean result (eg also pd.Index([1, 2, 3]) == 2
) uses a numpy bool array, not an Index object. So this change is then "expected" given that we decided to return a bool dtype instead of the original object dtype from this operation (because the NaN now propagates as False).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why did we decide to propagate NaN
as False
in these boolean operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the links I mentioned above in #61908 (comment).
The main reason is that the current "object-dtype with NaN" is not that useful in practice. For example, it means that boolean filtering like ser[ser.str.contains("B")]
only works as long as ser
does not contain missing values
4 NaN | ||
dtype: object | ||
4 False | ||
dtype: bool | ||
|
||
Specifying `na` to be `False` instead of `NaN` replaces NaN values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that this example is now also a bit outdated (we could still show it when starting with object dtype, or show it with filling with another value), but that is also fine to keep for another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> #61960
@@ -3640,7 +3640,7 @@ def casefold(self): | |||
>>> s3 = pd.Series(['23', '³', '⅕', '']) | |||
>>> s3.str.isdigit() | |||
0 True | |||
1 True | |||
1 False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a behaviour change we should actually fix, see #61466
@arthurlw thanks for the PR! |
Going to merge this, so we can enable the doctests again. |
Thanks @arthurlw! |
closes #xxxx (Replace xxxx with the GitHub issue number)Added type annotations to new arguments/methods/functions.Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Part of #61886