Skip to content

Commit e0f1f37

Browse files
authored
Merge branch 'main' into issue_61863
2 parents 17fcfc1 + d5eab1b commit e0f1f37

File tree

11 files changed

+88
-81
lines changed

11 files changed

+88
-81
lines changed

doc/source/user_guide/io.rst

Lines changed: 27 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5228,33 +5228,32 @@ languages easy. Parquet can use a variety of compression techniques to shrink th
52285228
while still maintaining good read performance.
52295229

52305230
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
5231-
dtypes, including extension dtypes such as datetime with tz.
5231+
dtypes, including extension dtypes such as datetime with timezone.
52325232

52335233
Several caveats.
52345234

52355235
* Duplicate column names and non-string columns names are not supported.
5236-
* The ``pyarrow`` engine always writes the index to the output, but ``fastparquet`` only writes non-default
5237-
indexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238-
force including or omitting indexes with the ``index`` argument, regardless of the underlying engine.
5236+
* The DataFrame index is written as separate column(s) when it is a non-default range index.
5237+
This extra column can cause problems for non-pandas consumers that are not expecting it. You can
5238+
force including or omitting indexes with the ``index`` argument.
52395239
* Index level names, if specified, must be strings.
52405240
* In the ``pyarrow`` engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize as their primitive dtype.
5241-
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
5242-
* Non supported types include ``Interval`` and actual Python object types. These will raise a helpful error message
5243-
on an attempt at serialization. ``Period`` type is supported with pyarrow >= 0.16.0.
5241+
* The ``pyarrow`` engine supports the ``Period`` and ``Interval`` dtypes. ``fastparquet`` does not support those.
5242+
* Non supported types include actual Python object types. These will raise a helpful error message
5243+
on an attempt at serialization.
52445244
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data
5245-
type (requiring pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols,
5245+
type (this can also work for external extension types, requiring the extension type to implement the needed protocols,
52465246
see the :ref:`extension types documentation <extending.extension.arrow>`).
52475247

52485248
You can specify an ``engine`` to direct the serialization. This can be one of ``pyarrow``, or ``fastparquet``, or ``auto``.
52495249
If the engine is NOT specified, then the ``pd.options.io.parquet.engine`` option is checked; if this is also ``auto``,
5250-
then ``pyarrow`` is tried, and falling back to ``fastparquet``.
5250+
then ``pyarrow`` is used when installed, and falling back to ``fastparquet``.
52515251

52525252
See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__.
52535253

52545254
.. note::
52555255

5256-
These engines are very similar and should read/write nearly identical parquet format files.
5257-
``pyarrow>=8.0.0`` supports timedelta data, ``fastparquet>=0.1.4`` supports timezone aware datetimes.
5256+
These engines are very similar and should read/write nearly identical parquet format files for most cases.
52585257
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
52595258

52605259
.. ipython:: python
@@ -5280,24 +5279,21 @@ Write to a parquet file.
52805279

52815280
.. ipython:: python
52825281
5283-
df.to_parquet("example_pa.parquet", engine="pyarrow")
5284-
df.to_parquet("example_fp.parquet", engine="fastparquet")
5282+
# specify engine="pyarrow" or engine="fastparquet" to use a specific engine
5283+
df.to_parquet("example.parquet")
52855284
52865285
Read from a parquet file.
52875286

52885287
.. ipython:: python
52895288
5290-
result = pd.read_parquet("example_fp.parquet", engine="fastparquet")
5291-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow")
5292-
5289+
result = pd.read_parquet("example.parquet")
52935290
result.dtypes
52945291
52955292
By setting the ``dtype_backend`` argument you can control the default dtypes used for the resulting DataFrame.
52965293

52975294
.. ipython:: python
52985295
5299-
result = pd.read_parquet("example_pa.parquet", engine="pyarrow", dtype_backend="pyarrow")
5300-
5296+
result = pd.read_parquet("example.parquet", dtype_backend="pyarrow")
53015297
result.dtypes
53025298
53035299
.. note::
@@ -5309,41 +5305,36 @@ Read only certain columns of a parquet file.
53095305

53105306
.. ipython:: python
53115307
5312-
result = pd.read_parquet(
5313-
"example_fp.parquet",
5314-
engine="fastparquet",
5315-
columns=["a", "b"],
5316-
)
5317-
result = pd.read_parquet(
5318-
"example_pa.parquet",
5319-
engine="pyarrow",
5320-
columns=["a", "b"],
5321-
)
5308+
result = pd.read_parquet("example.parquet", columns=["a", "b"])
53225309
result.dtypes
53235310
53245311
53255312
.. ipython:: python
53265313
:suppress:
53275314
5328-
os.remove("example_pa.parquet")
5329-
os.remove("example_fp.parquet")
5315+
os.remove("example.parquet")
53305316
53315317
53325318
Handling indexes
53335319
''''''''''''''''
53345320

53355321
Serializing a ``DataFrame`` to parquet may include the implicit index as one or
5336-
more columns in the output file. Thus, this code:
5322+
more columns in the output file. For example, this code:
53375323

53385324
.. ipython:: python
53395325
5340-
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
5326+
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
53415327
df.to_parquet("test.parquet", engine="pyarrow")
53425328
5343-
creates a parquet file with *three* columns if you use ``pyarrow`` for serialization:
5344-
``a``, ``b``, and ``__index_level_0__``. If you're using ``fastparquet``, the
5345-
index `may or may not <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`_
5346-
be written to the file.
5329+
creates a parquet file with *three* columns (``a``, ``b``, and
5330+
``__index_level_0__`` when using the ``pyarrow`` engine, or ``index``, ``a``,
5331+
and ``b`` when using the ``fastparquet`` engine) because the index in this case
5332+
is not a default range index. In general, the index *may or may not* be written
5333+
to the file (see the
5334+
`preserve_index keyword for pyarrow <https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes>`__
5335+
or the
5336+
`write_index keyword for fastparquet <https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write>`__
5337+
to check the default behaviour).
53475338

53485339
This unexpected extra column causes some databases like Amazon Redshift to reject
53495340
the file, because that column doesn't exist in the target table.

doc/source/whatsnew/v3.0.0.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,7 @@ Other API changes
414414
- Index set operations (like union or intersection) will now ignore the dtype of
415415
an empty ``RangeIndex`` or empty ``Index`` with object dtype when determining
416416
the dtype of the resulting Index (:issue:`60797`)
417+
- Comparison operations between :class:`Index` and :class:`Series` now consistently return :class:`Series` regardless of which object is on the left or right (:issue:`36759`)
417418
- Numpy functions like ``np.isinf`` that return a bool dtype when called on a :class:`Index` object now return a bool-dtype :class:`Index` instead of ``np.ndarray`` (:issue:`52676`)
418419

419420
.. ---------------------------------------------------------------------------
@@ -717,6 +718,7 @@ Datetimelike
717718
Timedelta
718719
^^^^^^^^^
719720
- Accuracy improvement in :meth:`Timedelta.to_pytimedelta` to round microseconds consistently for large nanosecond based Timedelta (:issue:`57841`)
721+
- Bug in :class:`Timedelta` constructor failing to raise when passed an invalid keyword (:issue:`53801`)
720722
- Bug in :meth:`DataFrame.cumsum` which was raising ``IndexError`` if dtype is ``timedelta64[ns]`` (:issue:`57956`)
721723

722724
Timezones

pandas/_libs/tslibs/timedeltas.pyx

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2006,6 +2006,20 @@ class Timedelta(_Timedelta):
20062006
"milliseconds", "microseconds", "nanoseconds"}
20072007

20082008
def __new__(cls, object value=_no_input, unit=None, **kwargs):
2009+
unsupported_kwargs = set(kwargs)
2010+
unsupported_kwargs.difference_update(cls._req_any_kwargs_new)
2011+
if unsupported_kwargs or (
2012+
value is _no_input and
2013+
not cls._req_any_kwargs_new.intersection(kwargs)
2014+
):
2015+
raise ValueError(
2016+
# GH#53801
2017+
"cannot construct a Timedelta from the passed arguments, "
2018+
"allowed keywords are "
2019+
"[weeks, days, hours, minutes, seconds, "
2020+
"milliseconds, microseconds, nanoseconds]"
2021+
)
2022+
20092023
if value is _no_input:
20102024
if not len(kwargs):
20112025
raise ValueError("cannot construct a Timedelta without a "
@@ -2014,16 +2028,6 @@ class Timedelta(_Timedelta):
20142028

20152029
kwargs = {key: _to_py_int_float(kwargs[key]) for key in kwargs}
20162030

2017-
unsupported_kwargs = set(kwargs)
2018-
unsupported_kwargs.difference_update(cls._req_any_kwargs_new)
2019-
if unsupported_kwargs or not cls._req_any_kwargs_new.intersection(kwargs):
2020-
raise ValueError(
2021-
"cannot construct a Timedelta from the passed arguments, "
2022-
"allowed keywords are "
2023-
"[weeks, days, hours, minutes, seconds, "
2024-
"milliseconds, microseconds, nanoseconds]"
2025-
)
2026-
20272031
# GH43764, convert any input to nanoseconds first and then
20282032
# create the timedelta. This ensures that any potential
20292033
# nanosecond contributions from kwargs parsed as floats

pandas/core/arrays/datetimelike.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1486,7 +1486,8 @@ def __rsub__(self, other):
14861486
# GH#19959 datetime - datetime is well-defined as timedelta,
14871487
# but any other type - datetime is not well-defined.
14881488
raise TypeError(
1489-
f"cannot subtract {type(self).__name__} from {type(other).__name__}"
1489+
f"cannot subtract {type(self).__name__} from "
1490+
f"{type(other).__name__}[{other.dtype}]"
14901491
)
14911492
elif isinstance(self.dtype, PeriodDtype) and lib.is_np_dtype(other_dtype, "m"):
14921493
# TODO: Can we simplify/generalize these cases at all?
@@ -1495,8 +1496,14 @@ def __rsub__(self, other):
14951496
self = cast("TimedeltaArray", self)
14961497
return (-self) + other
14971498

1499+
flipped = self - other
1500+
if flipped.dtype.kind == "M":
1501+
# GH#59571 give a more helpful exception message
1502+
raise TypeError(
1503+
f"cannot subtract {type(self).__name__} from {type(other).__name__}"
1504+
)
14981505
# We get here with e.g. datetime objects
1499-
return -(self - other)
1506+
return -flipped
15001507

15011508
def __iadd__(self, other) -> Self:
15021509
result = self + other

pandas/core/ops/common.py

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -56,20 +56,14 @@ def _unpack_zerodim_and_defer(method: F, name: str) -> F:
5656
-------
5757
method
5858
"""
59-
stripped_name = name.removeprefix("__").removesuffix("__")
60-
is_cmp = stripped_name in {"eq", "ne", "lt", "le", "gt", "ge"}
6159

6260
@wraps(method)
6361
def new_method(self, other):
64-
if is_cmp and isinstance(self, ABCIndex) and isinstance(other, ABCSeries):
65-
# For comparison ops, Index does *not* defer to Series
66-
pass
67-
else:
68-
prio = getattr(other, "__pandas_priority__", None)
69-
if prio is not None:
70-
if prio > self.__pandas_priority__:
71-
# e.g. other is DataFrame while self is Index/Series/EA
72-
return NotImplemented
62+
prio = getattr(other, "__pandas_priority__", None)
63+
if prio is not None:
64+
if prio > self.__pandas_priority__:
65+
# e.g. other is DataFrame while self is Index/Series/EA
66+
return NotImplemented
7367

7468
other = item_from_zerodim(other)
7569

pandas/tests/arithmetic/common.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -111,24 +111,19 @@ def xbox2(x):
111111
return x.astype(bool)
112112
return x
113113

114-
# rev_box: box to use for reversed comparisons
115-
rev_box = xbox
116-
if isinstance(right, Index) and isinstance(left, Series):
117-
rev_box = np.array
118-
119114
result = xbox2(left == right)
120115
expected = xbox(np.zeros(result.shape, dtype=np.bool_))
121116

122117
tm.assert_equal(result, expected)
123118

124119
result = xbox2(right == left)
125-
tm.assert_equal(result, rev_box(expected))
120+
tm.assert_equal(result, xbox(expected))
126121

127122
result = xbox2(left != right)
128123
tm.assert_equal(result, ~expected)
129124

130125
result = xbox2(right != left)
131-
tm.assert_equal(result, rev_box(~expected))
126+
tm.assert_equal(result, xbox(~expected))
132127

133128
msg = "|".join(
134129
[

pandas/tests/arithmetic/test_datetime64.py

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -770,11 +770,18 @@ def test_dti_cmp_tdi_tzawareness(self, other):
770770

771771
result = dti == other
772772
expected = np.array([False] * 10)
773-
tm.assert_numpy_array_equal(result, expected)
773+
if isinstance(other, Series):
774+
tm.assert_series_equal(result, Series(expected, index=other.index))
775+
else:
776+
tm.assert_numpy_array_equal(result, expected)
774777

775778
result = dti != other
776779
expected = np.array([True] * 10)
777-
tm.assert_numpy_array_equal(result, expected)
780+
if isinstance(other, Series):
781+
tm.assert_series_equal(result, Series(expected, index=other.index))
782+
else:
783+
tm.assert_numpy_array_equal(result, expected)
784+
778785
msg = "Invalid comparison between"
779786
with pytest.raises(TypeError, match=msg):
780787
dti < other
@@ -955,7 +962,12 @@ def test_dt64arr_add_sub_td64ndarray(self, tz_naive_fixture, box_with_array):
955962

956963
result = dtarr - tdarr
957964
tm.assert_equal(result, expected)
958-
msg = "cannot subtract|(bad|unsupported) operand type for unary"
965+
msg = "|".join(
966+
[
967+
"cannot subtract DatetimeArray from ndarray",
968+
"cannot subtract a datelike from a TimedeltaArray",
969+
]
970+
)
959971
with pytest.raises(TypeError, match=msg):
960972
tdarr - dtarr
961973

@@ -1272,7 +1284,7 @@ def test_dt64arr_series_sub_tick_DateOffset(self, box_with_array):
12721284

12731285
result2 = -pd.offsets.Second(5) + ser
12741286
tm.assert_equal(result2, expected)
1275-
msg = "(bad|unsupported) operand type for unary"
1287+
msg = "cannot subtract DatetimeArray from Second"
12761288
with pytest.raises(TypeError, match=msg):
12771289
pd.offsets.Second(5) - ser
12781290

@@ -1317,9 +1329,7 @@ def test_dti_add_tick_tzaware(self, tz_aware_fixture, box_with_array):
13171329
roundtrip = offset - scalar
13181330
tm.assert_equal(roundtrip, dates)
13191331

1320-
msg = "|".join(
1321-
["bad operand type for unary -", "cannot subtract DatetimeArray"]
1322-
)
1332+
msg = "cannot subtract DatetimeArray from"
13231333
with pytest.raises(TypeError, match=msg):
13241334
scalar - dates
13251335

@@ -1378,7 +1388,7 @@ def test_dt64arr_add_sub_relativedelta_offsets(self, box_with_array, unit):
13781388
expected = DatetimeIndex([x - off for x in vec_items]).as_unit(exp_unit)
13791389
expected = tm.box_expected(expected, box_with_array)
13801390
tm.assert_equal(expected, vec - off)
1381-
msg = "(bad|unsupported) operand type for unary"
1391+
msg = "cannot subtract DatetimeArray from"
13821392
with pytest.raises(TypeError, match=msg):
13831393
off - vec
13841394

@@ -1494,7 +1504,7 @@ def test_dt64arr_add_sub_DateOffsets(
14941504
expected = DatetimeIndex([offset + x for x in vec_items]).as_unit(unit)
14951505
expected = tm.box_expected(expected, box_with_array)
14961506
tm.assert_equal(expected, offset + vec)
1497-
msg = "(bad|unsupported) operand type for unary"
1507+
msg = "cannot subtract DatetimeArray from"
14981508
with pytest.raises(TypeError, match=msg):
14991509
offset - vec
15001510

@@ -1983,7 +1993,7 @@ def test_operators_datetimelike_with_timezones(self):
19831993
result = dt1 - td1[0]
19841994
exp = (dt1.dt.tz_localize(None) - td1[0]).dt.tz_localize(tz)
19851995
tm.assert_series_equal(result, exp)
1986-
msg = "(bad|unsupported) operand type for unary"
1996+
msg = "cannot subtract DatetimeArray from"
19871997
with pytest.raises(TypeError, match=msg):
19881998
td1[0] - dt1
19891999

pandas/tests/arithmetic/test_timedelta64.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -320,7 +320,7 @@ def test_subtraction_ops(self):
320320
with pytest.raises(TypeError, match=msg):
321321
td - dt
322322

323-
msg = "(bad|unsupported) operand type for unary"
323+
msg = "cannot subtract DatetimeArray from Timedelta"
324324
with pytest.raises(TypeError, match=msg):
325325
td - dti
326326

pandas/tests/indexes/multi/test_equivalence.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,8 @@ def test_equals_op(idx):
6464
with pytest.raises(ValueError, match="Lengths must match"):
6565
index_a == series_b
6666

67-
tm.assert_numpy_array_equal(index_a == series_a, expected1)
68-
tm.assert_numpy_array_equal(index_a == series_c, expected2)
67+
tm.assert_series_equal(index_a == series_a, Series(expected1))
68+
tm.assert_series_equal(index_a == series_c, Series(expected2))
6969

7070
# cases where length is 1 for one of them
7171
with pytest.raises(ValueError, match="Lengths must match"):

pandas/tests/indexes/test_old_base.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -560,8 +560,8 @@ def test_equals_op(self, simple_index):
560560
with pytest.raises(ValueError, match=msg):
561561
index_a == series_b
562562

563-
tm.assert_numpy_array_equal(index_a == series_a, expected1)
564-
tm.assert_numpy_array_equal(index_a == series_c, expected2)
563+
tm.assert_series_equal(index_a == series_a, Series(expected1))
564+
tm.assert_series_equal(index_a == series_c, Series(expected2))
565565

566566
# cases where length is 1 for one of them
567567
with pytest.raises(ValueError, match="Lengths must match"):

0 commit comments

Comments
 (0)