Skip to content

Commit a91c50a

Browse files
Backport PR #62147 on branch 2.3.x (DOC: correct and rewrite string migration section on astype(str)) (#62151)
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1 parent f7a2cfd commit a91c50a

File tree

1 file changed

+45
-20
lines changed

1 file changed

+45
-20
lines changed

doc/source/user_guide/migration-3-strings.rst

Lines changed: 45 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,14 @@ let pandas do the inference. But if you want to be specific, you can specify the
188188
This is actually compatible with pandas 2.x as well, since in pandas < 3,
189189
``dtype="str"`` was essentially treated as an alias for object dtype.
190190

191+
.. attention::
192+
193+
While using ``dtype="str"`` in constructors is compatible with pandas 2.x,
194+
specifying it as the dtype in :meth:`~Series.astype` runs into the issue
195+
of also stringifying missing values in pandas 2.x. See the section
196+
:ref:`string_migration_guide-astype_str` for more details.
197+
198+
191199
The missing value sentinel is now always NaN
192200
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
193201

@@ -310,52 +318,69 @@ case.
310318
Notable bug fixes
311319
~~~~~~~~~~~~~~~~~
312320

321+
.. _string_migration_guide-astype_str:
322+
313323
``astype(str)`` preserving missing values
314324
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
315325

316-
This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
326+
The stringifying of missing values is a long standing "bug" or misfeature, as
327+
discussed in https://github.com/pandas-dev/pandas/issues/25353, but fixing it
328+
introduces a significant behaviour change.
317329

318-
With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
319-
``astype("str")``!), the operation would convert every element to a string,
320-
including the missing values:
330+
With pandas < 3, when using ``astype(str)`` or ``astype("str")``, the operation
331+
would convert every element to a string, including the missing values:
321332

322333
.. code-block:: python
323334
324335
# OLD behavior in pandas < 3
325-
>>> ser = pd.Series(["a", np.nan], dtype=object)
336+
>>> ser = pd.Series([1.5, np.nan])
326337
>>> ser
327-
0 a
338+
0 1.5
328339
1 NaN
329-
dtype: object
330-
>>> ser.astype(str)
331-
0 a
340+
dtype: float64
341+
>>> ser.astype("str")
342+
0 1.5
332343
1 nan
333344
dtype: object
334-
>>> ser.astype(str).to_numpy()
335-
array(['a', 'nan'], dtype=object)
345+
>>> ser.astype("str").to_numpy()
346+
array(['1.5', 'nan'], dtype=object)
336347
337348
Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
338349
not the intended behavior, and it was inconsistent with how other dtypes handled
339350
missing values.
340351

341-
With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
342-
for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
343-
the missing values:
352+
With pandas 3, this behavior has been fixed, and now ``astype("str")`` will cast
353+
to the new string dtype, which preserves the missing values:
344354

345355
.. code-block:: python
346356
347357
# NEW behavior in pandas 3
348358
>>> pd.options.future.infer_string = True
349-
>>> ser = pd.Series(["a", np.nan], dtype=object)
350-
>>> ser.astype(str)
351-
0 a
359+
>>> ser = pd.Series([1.5, np.nan])
360+
>>> ser.astype("str")
361+
0 1.5
352362
1 NaN
353363
dtype: str
354-
>>> ser.astype(str).values
355-
array(['a', nan], dtype=object)
364+
>>> ser.astype("str").to_numpy()
365+
array(['1.5', nan], dtype=object)
356366
357367
If you want to preserve the old behaviour of converting every object to a
358-
string, you can use ``ser.map(str)`` instead.
368+
string, you can use ``ser.map(str)`` instead. If you want do such conversion
369+
while preserving the missing values in a way that works with both pandas 2.x and
370+
3.x, you can use ``ser.map(str, na_action="ignore")`` (for pandas 3.x only, you
371+
can do ``ser.astype("str")``).
372+
373+
If you want to convert to object or string dtype for pandas 2.x and 3.x,
374+
respectively, without needing to stringify each individual element, you will
375+
have to use a conditional check on the pandas version.
376+
For example, to convert a categorical Series with string categories to its
377+
dense non-categorical version with object or string dtype:
378+
379+
.. code-block:: python
380+
381+
>>> import pandas as pd
382+
>>> ser = pd.Series(["a", np.nan], dtype="category")
383+
>>> ser.astype(object if pd.__version__ < "3" else "str")
359384
360385
361386
``prod()`` raising for string data

0 commit comments

Comments
 (0)