Skip to content

Commit 8d5f769

Browse files
KhemkaranKhemkaran
authored andcommitted
Merge remote-tracking branch 'upstream/main' into issue_61863
2 parents 09245fd + a067fff commit 8d5f769

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

70 files changed

+979
-613
lines changed

.github/workflows/unit-tests.yml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
env_file: [actions-310.yaml, actions-311.yaml, actions-312.yaml, actions-313.yaml]
3131
# Prevent the include jobs from overriding other jobs
3232
pattern: [""]
33-
pandas_future_infer_string: ["0"]
33+
pandas_future_infer_string: ["1"]
3434
include:
3535
- name: "Downstream Compat"
3636
env_file: actions-311-downstream_compat.yaml
@@ -45,6 +45,10 @@ jobs:
4545
env_file: actions-313-freethreading.yaml
4646
pattern: "not slow and not network and not single_cpu"
4747
platform: ubuntu-24.04
48+
- name: "Without PyArrow"
49+
env_file: actions-312.yaml
50+
pattern: "not slow and not network and not single_cpu"
51+
platform: ubuntu-24.04
4852
- name: "Locale: it_IT"
4953
env_file: actions-311.yaml
5054
pattern: "not slow and not network and not single_cpu"
@@ -67,13 +71,9 @@ jobs:
6771
# It will be temporarily activated during tests with locale.setlocale
6872
extra_loc: "zh_CN"
6973
platform: ubuntu-24.04
70-
- name: "Future infer strings"
74+
- name: "Past no infer strings"
7175
env_file: actions-312.yaml
72-
pandas_future_infer_string: "1"
73-
platform: ubuntu-24.04
74-
- name: "Future infer strings (without pyarrow)"
75-
env_file: actions-311.yaml
76-
pandas_future_infer_string: "1"
76+
pandas_future_infer_string: "0"
7777
platform: ubuntu-24.04
7878
- name: "Numpy Dev"
7979
env_file: actions-311-numpydev.yaml
@@ -83,7 +83,6 @@ jobs:
8383
- name: "Pyarrow Nightly"
8484
env_file: actions-311-pyarrownightly.yaml
8585
pattern: "not slow and not network and not single_cpu"
86-
pandas_future_infer_string: "1"
8786
platform: ubuntu-24.04
8887
fail-fast: false
8988
name: ${{ matrix.name || format('{0} {1}', matrix.platform, matrix.env_file) }}
@@ -92,13 +91,13 @@ jobs:
9291
LANG: ${{ matrix.lang || 'C.UTF-8' }}
9392
LC_ALL: ${{ matrix.lc_all || '' }}
9493
PANDAS_CI: '1'
95-
PANDAS_FUTURE_INFER_STRING: ${{ matrix.pandas_future_infer_string || '0' }}
94+
PANDAS_FUTURE_INFER_STRING: ${{ matrix.pandas_future_infer_string || '1' }}
9695
TEST_ARGS: ${{ matrix.test_args || '' }}
9796
PYTEST_WORKERS: 'auto'
9897
PYTEST_TARGET: ${{ matrix.pytest_target || 'pandas' }}
9998
# Clipboard tests
10099
QT_QPA_PLATFORM: offscreen
101-
REMOVE_PYARROW: ${{ matrix.name == 'Future infer strings (without pyarrow)' && '1' || '0' }}
100+
REMOVE_PYARROW: ${{ matrix.name == 'Without PyArrow' && '1' || '0' }}
102101
concurrency:
103102
# https://github.community/t/concurrecy-not-work-for-push/183068/7
104103
group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.env_file }}-${{ matrix.pattern }}-${{ matrix.extra_apt || '' }}-${{ matrix.pandas_future_infer_string }}-${{ matrix.platform }}

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ details, see the commit logs at https://github.com/pandas-dev/pandas.
115115
## Dependencies
116116
- [NumPy - Adds support for large, multi-dimensional arrays, matrices and high-level mathematical functions to operate on these arrays](https://www.numpy.org)
117117
- [python-dateutil - Provides powerful extensions to the standard datetime module](https://dateutil.readthedocs.io/en/stable/index.html)
118-
- [pytz - Brings the Olson tz database into Python which allows accurate and cross platform timezone calculations](https://github.com/stub42/pytz)
118+
- [tzdata - Provides an IANA time zone database](https://tzdata.readthedocs.io/en/latest/)
119119

120120
See the [full installation instructions](https://pandas.pydata.org/pandas-docs/stable/install.html#dependencies) for minimum supported versions of required, recommended and optional dependencies.
121121

ci/code_checks.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@ if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
5858

5959
MSG='Python and Cython Doctests' ; echo "$MSG"
6060
python -c 'import pandas as pd; pd.test(run_doctests=True)'
61-
RET=$(($RET + $?)) ; echo "$MSG" "DONE"
61+
# TEMP don't let doctests fail the build until all string dtype changes are fixed
62+
# RET=$(($RET + $?)) ; echo "$MSG" "DONE"
63+
echo "$MSG" "DONE"
6264

6365
fi
6466

@@ -72,6 +74,7 @@ if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
7274
-i "pandas.Series.dt PR01" `# Accessors are implemented as classes, but we do not document the Parameters section` \
7375
-i "pandas.Period.freq GL08" \
7476
-i "pandas.Period.ordinal GL08" \
77+
-i "pandas.errors.IncompatibleFrequency SA01,SS06,EX01" \
7578
-i "pandas.core.groupby.DataFrameGroupBy.plot PR02" \
7679
-i "pandas.core.groupby.SeriesGroupBy.plot PR02" \
7780
-i "pandas.core.resample.Resampler.quantile PR01,PR07" \

ci/deps/actions-311-downstream_compat.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,7 @@ dependencies:
5050
- pytz>=2023.4
5151
- pyxlsb>=1.0.10
5252
- s3fs>=2023.12.2
53-
# TEMP upper pin for scipy (https://github.com/statsmodels/statsmodels/issues/9584)
54-
- scipy>=1.12.0,<1.16
53+
- scipy>=1.12.0
5554
- sqlalchemy>=2.0.0
5655
- tabulate>=0.9.0
5756
- xarray>=2024.1.1

doc/source/development/maintaining.rst

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -388,8 +388,11 @@ Pre-release
388388

389389
3. Make sure the CI is green for the last commit of the branch being released.
390390

391-
4. If not a release candidate, make sure all backporting pull requests to the branch
392-
being released are merged.
391+
4. If not a release candidate, make sure all backporting pull requests to the
392+
branch being released are merged, and no merged pull requests are missing a
393+
backport (check the
394+
["Still Needs Manual Backport"](https://github.com/pandas-dev/pandas/labels/Still%20Needs%20Manual%20Backport)
395+
label for this).
393396

394397
5. Create a new issue and milestone for the version after the one being released.
395398
If the release was a release candidate, we would usually want to create issues and
@@ -435,6 +438,9 @@ which will be triggered when the tag is pushed.
435438

436439
scripts/download_wheels.sh <VERSION>
437440

441+
ATTENTION: this is currently not downloading *all* wheels, and you have to
442+
manually download the remainings wheels and sdist!
443+
438444
4. Create a `new GitHub release <https://github.com/pandas-dev/pandas/releases/new>`_:
439445

440446
- Tag: ``<version>``
@@ -462,15 +468,22 @@ Post-Release
462468
````````````
463469

464470
1. Update symlinks to stable documentation by logging in to our web server, and
465-
editing ``/var/www/html/pandas-docs/stable`` to point to ``version/<latest-version>``
466-
for major and minor releases, or ``version/<minor>`` to ``version/<patch>`` for
471+
editing ``/var/www/html/pandas-docs/stable`` to point to ``version/<X.Y>``
472+
for major and minor releases, or ``version/<X.Y>`` to ``version/<patch>`` for
467473
patch releases. The exact instructions are (replace the example version numbers by
468474
the appropriate ones for the version you are releasing):
469475

470476
- Log in to the server and use the correct user.
471477
- ``cd /var/www/html/pandas-docs/``
472-
- ``ln -sfn version/2.1 stable`` (for a major or minor release)
473-
- ``ln -sfn version/2.0.3 version/2.0`` (for a patch release)
478+
- For a major or minor release (assuming the ``/version/2.1.0/`` docs have been uploaded to the server):
479+
480+
- Create a new X.Y symlink to X.Y.Z: ``cd version; ln -sfn 2.1.0 2.1``
481+
- Update stable symlink to point to X.Y: ``ln -sfn version/2.1 stable``
482+
483+
- For a patch release (assuming the ``/version/2.1.3/`` docs have been uploaded to the server):
484+
485+
- Update the X.Y symlink to the new X.Y.Z patch version: ``cd version; ln -sfn 2.1.3 2.1``
486+
- (the stable symlink should already be pointing to the correct X.Y version)
474487

475488
2. If releasing a major or minor release, open a PR in our source code to update
476489
``web/pandas/versions.json``, to have the desired versions in the documentation

doc/source/reference/testing.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ Exceptions and warnings
3636
errors.DuplicateLabelError
3737
errors.EmptyDataError
3838
errors.IncompatibilityWarning
39+
errors.IncompatibleFrequency
3940
errors.IndexingError
4041
errors.InvalidColumnName
4142
errors.InvalidComparison

doc/source/user_guide/basics.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -590,7 +590,7 @@ arguments. The special value ``all`` can also be used:
590590

591591
.. ipython:: python
592592
593-
frame.describe(include=["object"])
593+
frame.describe(include=["str"])
594594
frame.describe(include=["number"])
595595
frame.describe(include="all")
596596

doc/source/user_guide/io.rst

Lines changed: 27 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5228,33 +5228,32 @@ languages easy. Parquet can use a variety of compression techniques to shrink th
52285228
while still maintaining good read performance.
52295229

52305230
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
5231-
dtypes, including extension dtypes such as datetime with tz.
5231+
dtypes, including extension dtypes such as datetime with timezone.
52325232

52335233
Several caveats.
52345234

52355235
* Duplicate column names and non-string columns names are not supported.
5236-
* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
5237-
indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238-
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
5236+
* The DataFrame index is written as separate column(s) when it is a non-default range index.
5237+
This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238+
force including or omitting indexes with the ``index`` argument.
52395239
* Index level names, if specified, must be strings.
52405240
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
5241-
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
5242-
* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message
5243-
on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0.
5241+
* The ``pyarrow`` engine supports the ``Period`` and ``Interval`` dtypes. ``fastparquet`` does not support those.
5242+
* Non supported types include actual Python object types. These will raise a helpful error message
5243+
on an attempt at serialization.
52445244
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data
5245-
type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols,
5245+
type (this can also work for external extension types, requiring the extension type to implement the needed protocols,
52465246
see the :ref:`extension types documentation <extending.extension.arrow>`).
52475247

52485248
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
52495249
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
5250-
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
5250+
then ``pyarrow`` is used when installed, and falling back to ``fastparquet``.
52515251

52525252
See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__.
52535253

52545254
.. note::
52555255

5256-
These engines are very similar and should read/write nearly identical parquet format files.
5257-
``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
5256+
These engines are very similar and should read/write nearly identical parquet format files for most cases.
52585257
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
52595258

52605259
.. ipython:: python
@@ -5280,24 +5279,21 @@ Write to a parquet file.
52805279

52815280
.. ipython:: python
52825281
5283-
df.to_parquet("example_pa.parquet", engine="pyarrow")
5284-
df.to_parquet("example_fp.parquet", engine="fastparquet")
5282+
# specify engine="pyarrow" or engine="fastparquet" to use a specific engine
5283+
df.to_parquet("example.parquet")
52855284
52865285
Read from a parquet file.
52875286

52885287
.. ipython:: python
52895288
5290-
result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
5291-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
5292-
5289+
result = pd.read_parquet("example.parquet")
52935290
result.dtypes
52945291
52955292
By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame.
52965293

52975294
.. ipython:: python
52985295
5299-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow")
5300-
5296+
result = pd.read_parquet("example.parquet", dtype_backend="pyarrow")
53015297
result.dtypes
53025298
53035299
.. note::
@@ -5309,41 +5305,36 @@ Read only certain columns of a parquet file.
53095305

53105306
.. ipython:: python
53115307
5312-
result = pd.read_parquet(
5313-
"example_fp.parquet",
5314-
engine="fastparquet",
5315-
columns=["a", "b"],
5316-
)
5317-
result = pd.read_parquet(
5318-
"example_pa.parquet",
5319-
engine="pyarrow",
5320-
columns=["a", "b"],
5321-
)
5308+
result = pd.read_parquet("example.parquet", columns=["a", "b"])
53225309
result.dtypes
53235310
53245311
53255312
.. ipython:: python
53265313
:suppress:
53275314
5328-
os.remove("example_pa.parquet")
5329-
os.remove("example_fp.parquet")
5315+
os.remove("example.parquet")
53305316
53315317
53325318
Handling indexes
53335319
''''''''''''''''
53345320

53355321
Serializing a ``DataFrame`` to parquet may include the implicit index as one or
5336-
more columns in the output file. Thus, this code:
5322+
more columns in the output file. For example, this code:
53375323

53385324
.. ipython:: python
53395325
5340-
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
5326+
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
53415327
df.to_parquet("test.parquet", engine="pyarrow")
53425328
5343-
creates a parquet file with *three* columns if you use ``pyarrow`` for serialization:
5344-
``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the
5345-
index `may or may not <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`_
5346-
be written to the file.
5329+
creates a parquet file with *three* columns (``a``, ``b``, and
5330+
``__index_level_0__`` when using the ``pyarrow`` engine, or ``index``, ``a``,
5331+
and ``b`` when using the ``fastparquet`` engine) because the index in this case
5332+
is not a default range index. In general, the index *may or may not* be written
5333+
to the file (see the
5334+
`preserve_index keyword for pyarrow <https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes>`__
5335+
or the
5336+
`write_index keyword for fastparquet <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`__
5337+
to check the default behaviour).
53475338

53485339
This unexpected extra column causes some databases like Amazon Redshift to reject
53495340
the file, because that column doesn't exist in the target table.

doc/source/user_guide/migration-3-strings.rst

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,12 +118,17 @@ through the ``str`` accessor will work the same:
118118
Overview of behavior differences and how to address them
119119
---------------------------------------------------------
120120

121-
The dtype is no longer object dtype
122-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121+
The dtype is no longer a numpy "object" dtype
122+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123123

124124
When inferring or reading string data, the data type of the resulting DataFrame
125125
column or Series will silently start being the new ``"str"`` dtype instead of
126-
``"object"`` dtype, and this can have some impact on your code.
126+
the numpy ``"object"`` dtype, and this can have some impact on your code.
127+
128+
The new string dtype is a pandas data type ("extension dtype"), and no longer a
129+
numpy ``np.dtype`` instance. Therefore, passing the dtype of a string column to
130+
numpy functions will no longer work (e.g. passing it to a ``dtype=`` argument
131+
of a numpy function, or using ``np.issubdtype`` to check the dtype).
127132

128133
Checking the dtype
129134
^^^^^^^^^^^^^^^^^^

doc/source/whatsnew/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Version 2.3
2424
.. toctree::
2525
:maxdepth: 2
2626

27+
v2.3.2
2728
v2.3.1
2829
v2.3.0
2930

0 commit comments

Comments
 (0)