Skip to content

[Data] Add fillna and dropna functions #54844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

soffer-anyscale
Copy link
Contributor

Why are these changes needed?

Null handling is a basic and common ETL requirement that other data frameworks have. It is important for feature parity and for ML dataset preprocessing to have common null handling features.

This PR adds ds.fillna and ds.dropna functions, modeled after Pandas and PySpark functionalities.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
@soffer-anyscale soffer-anyscale requested a review from a team as a code owner July 22, 2025 22:55
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@soffer-anyscale soffer-anyscale changed the title inital commit [Data] Add fillna and dropna functions Jul 22, 2025
@soffer-anyscale soffer-anyscale requested a review from Copilot July 22, 2025 22:56
@soffer-anyscale soffer-anyscale added the data Ray Data-related issues label Jul 22, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds two essential null handling functions (fillna and dropna) to Ray Data, bringing it to feature parity with pandas and PySpark for basic ETL operations and ML dataset preprocessing.

Key changes include:

  • Implementation of fillna method to replace missing values with scalar or column-specific values
  • Implementation of dropna method to remove rows containing missing values with flexible filtering options
  • Comprehensive test suites covering edge cases and different data types

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/ray/data/dataset.py Adds public API methods fillna and dropna with comprehensive documentation and examples
python/ray/data/_internal/logical/operators/map_operator.py Implements logical operators FillNa and DropNa for the execution framework
python/ray/data/_internal/planner/plan_udf_map_op.py Implements planning functions with PyArrow-based transformations for null handling
python/ray/data/_internal/planner/planner.py Registers the new logical operators with their planning functions
python/ray/data/tests/test_fillna.py Comprehensive test suite for fillna functionality
python/ray/data/tests/test_dropna.py Comprehensive test suite for dropna functionality
python/ray/data/BUILD Adds build targets for the new test files
Comments suppressed due to low confidence (6)

python/ray/data/tests/test_fillna.py:105

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:53

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_fillna.py:192

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:104

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_fillna.py:220

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:208

  • [nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.
    for i, (actual, exp) in enumerate(zip(rows, expected)):

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant