-
Notifications
You must be signed in to change notification settings - Fork 4
Description
I am an economist who does a lot of work with python polars, but with much of the upstream data I use is in SAS.
Thanks for this package! I forked/extended it to read sas data to an arrow sink to pass to polars, in polars_readstat. It scales much better than readstat for batch reading very large data sets (for SAS, as in 100 million rows) and doesn't suffer from some of the shortcomings of the pandas sas reader.
If you had the time and inclination, I'd love to get your feedback on changes I made (or mostly Claude made on the cpp side...). I am super out of my element with C++ (I work in python and lots of statistical languages like R, Stata, with some background in C# and beginner's knowledge of Rust). The relevant changes to the code are:
- Arrow sink
- This reads the data to an arrow RecordBatch
- I had to add iconv to map to utf-8 values since arrow and polars only accept utf-8 strings (without this, my code would crash on a string like ÿ)
- For windows, I use a local version of iconv to match the version used in readstat/readstat-rs in my package (rather than using conan, which presumably would have been simpler here)
- Decompression fix for large binary compressed files with very, very specific patterns of repeated values
- Think many columns of mostly repeated values like "-1"
- This didn't show up, as far as I could tell, in the test data, but did in some of the actual data I work with. I created a MRE file of meaningless data that replicates it and tested against that. It was actually kind of hard to create one that worked since it depends on the number of columns and types of repeated values.
- Arrow ffi (.h file, .cpp file) so that I can pass the data zero copy from c++->rust polars->python polars
I very much understand if you do not have the time or interest in helping/looking things over. Either way, thanks again!