Skip to content

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR separates SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF into a dedicated serializer ArrowStreamAggPandasUDFSerializer, aligning with the existing ArrowStreamAggArrowUDFSerializer architecture.

Why are the changes needed?

  1. Input/Output type differences: Aggregation UDFs (SQL_GROUPED_AGG_PANDAS_UDF and SQL_WINDOW_AGG_PANDAS_UDF) have different input/output types compared to grouped map UDFs:

    • Aggregation UDFs: Input is pd.Series (entire group/partition), output is scalar
    • Grouped map UDFs: Input is (keys, vals) where vals is pd.DataFrame, output is pd.DataFrame
  2. Multi-UDF support: Aggregation UDFs support multiple UDFs in a single projection/aggregation, while grouped map UDFs do not.

Does this PR introduce any user-facing change?

No. This is an internal refactoring that does not change the public API or behavior. The serialization logic remains functionally equivalent.

How was this patch tested?

All existing tests continue to pass, and a new multi-UDF test (test_pandas_udf_window.py::WindowPandasUDFTests::test_multiple_udfs) was added.

Was this patch authored or co-authored using generative AI tooling?

No

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54531] Introduce ArrowStreamAggPandasUDFSerializer [SPARK-54531][PYTHON] Introduce ArrowStreamAggPandasUDFSerializer Nov 26, 2025
@Yicong-Huang Yicong-Huang changed the title [SPARK-54531][PYTHON] Introduce ArrowStreamAggPandasUDFSerializer [SPARK-54531][CORE][PYTHON][SQL] Introduce ArrowStreamAggPandasUDFSerializer Nov 26, 2025
@zhengruifeng zhengruifeng changed the title [SPARK-54531][CORE][PYTHON][SQL] Introduce ArrowStreamAggPandasUDFSerializer [SPARK-54531][PYTHON] Introduce ArrowStreamAggPandasUDFSerializer Nov 27, 2025
@zhengruifeng
Copy link
Contributor

remove [CORE] and [SQL] from the title, since it's not related to spark core or sql

@Yicong-Huang
Copy link
Contributor Author

remove [CORE] and [SQL] from the title, since it's not related to spark core or sql

I added them according to the labels added by github actions. Are they accurate?
Screenshot 2025-11-26 at 8 41 57 PM

@zhengruifeng
Copy link
Contributor

remove [CORE] and [SQL] from the title, since it's not related to spark core or sql

I added them according to the labels added by github actions. Are they accurate? Screenshot 2025-11-26 at 8 41 57 PM

they are not very accurate

@zhengruifeng
Copy link
Contributor

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants