- 
                Notifications
    
You must be signed in to change notification settings  - Fork 669
 
FIX-#4479: Prevent users from using a local filepath when performing a distributed write #4484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…n performing a distributed write Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
          Codecov Report
 @@            Coverage Diff             @@
##           master    #4484      +/-   ##
==========================================
- Coverage   86.82%   78.21%   -8.62%     
==========================================
  Files         222      230       +8     
  Lines       18038    23692    +5654     
==========================================
+ Hits        15662    18530    +2868     
- Misses       2376     5162    +2786     
 📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD thanks for the change. I have some style suggestions.
        
          
                modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be added for to_sql as well?
| if not cls._to_csv_check_support(kwargs): | ||
| return RayIO.to_csv(qc, **kwargs) | ||
| 
               | 
          ||
| if len(ray.nodes()) > 1: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to _to_csv_check_support, ditto for parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but _to_csv_check_support currently just checks if we can do a parallel write, so it would be a bit of a paradigm shift to raise an error here, unless you're suggesting we return false and default to a serial write on the head node if a local path is passed?
| if S3_ADDRESS_REGEX.match(path_or_buf) is not None or "://" in path_or_buf: | ||
| return False # S3 or network path. | ||
| if isinstance(path_or_buf, str) or isinstance(path_or_buf, pathlib.PurePath): | ||
| return os.path.exists(path_or_buf) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code as is will not warn users for the common case where they try to write to a non-existent path, i.e. when they want to create and write to a new file. We should warn even if the path doesn't exist locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to check device_id instead! I originally did this because apparently os.path.exists returns false for networked file paths. I believe it should now work for both non-existing paths and networked paths.
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
| 
           We don't need to do this for   | 
    
Signed-off-by: Rehan Durrani <rehan@ponder.io>
…tent local paths Signed-off-by: Rehan Durrani <rehan@ponder.io>
| 
           This pull request introduces 1 alert when merging dcadcf4 into 60518ca - view on LGTM.com new alerts: 
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RehanSD thanks for the changes. I have two minor suggestions. Do we have ray cluster unit tests to which we can add a case for this? Otherwise, have you tested this manually?
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
Co-authored-by: Mahesh Vashishtha <mvashishtha@users.noreply.github.com>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
| 
           This pull request introduces 1 alert when merging 0608f9e into 958c26a - view on LGTM.com new alerts: 
  | 
    
        
          
                modin/core/execution/ray/implementations/pandas_on_ray/io/io.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani <rehan@ponder.io>
Signed-off-by: Rehan Durrani rehan@ponder.io
What do these changes do?
Prevent users from specifying a local file path on a distributed write.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.pyblack --check modin/ asv_bench/benchmarks scripts/doc_checker.pygit commit -sdocs/development/architecture.rstis up-to-date