Commit 5773b7f
authored
perf: iterate over generators when writing datafiles to reduce memory pressure (apache#2671)
# Rationale for this change
When writing to partitioned tables, there is a large memory spike when
the partitions are computed because we `.combine_chunks()` on the new
partitioned arrow tables and we materialize the entire list of
partitions before writing data files.
This PR switches the partition computation to a generator to avoid
materializing all the partitions in memory at once, reducing the memory
overhead of writing to partitioned tables.
## Are these changes tested?
No new tests. The tests using this method were updated to consume the
generator as a list.
However, in my personal use case, I am using
`pa.total_allocated_bytes()` to determine memory allocation before and
after the write and see the following across 5 writes of ~128 MB:
| Run | Original Impl (Before Write) | Original Impl (After Write) |
Iters (Before Write) | Iters (After Write) |
|---|---|---|---|---|
| 1 | 29.31 MB | 151.62 MB | 28.38 MB | 30.40 MB |
| 2 | 27.74 MB | 151.62 MB | 28.85 MB | 30.36 MB |
| 3 | 28.81 MB | 151.62 MB | 28.52 MB | 31.29 MB |
| 4 | 28.71 MB | 151.62 MB | 29.27 MB | 30.64 MB |
| 5 | 28.60 MB | 151.61 MB | 28.29 MB | 31.11 MB |
This scales with the size of the write: if I want to write a 3 GB arrow
table to a partitioned table, I need at least 6 GB RAM.
## Are there any user-facing changes?
No.1 parent 8878b2c commit 5773b7f
2 files changed
+21
-28
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2790 | 2790 | | |
2791 | 2791 | | |
2792 | 2792 | | |
2793 | | - | |
2794 | | - | |
2795 | | - | |
2796 | | - | |
2797 | | - | |
| 2793 | + | |
| 2794 | + | |
| 2795 | + | |
2798 | 2796 | | |
2799 | 2797 | | |
2800 | 2798 | | |
2801 | 2799 | | |
2802 | 2800 | | |
2803 | 2801 | | |
2804 | 2802 | | |
2805 | | - | |
2806 | | - | |
2807 | | - | |
2808 | | - | |
2809 | | - | |
2810 | | - | |
2811 | | - | |
2812 | | - | |
2813 | | - | |
2814 | | - | |
2815 | | - | |
2816 | | - | |
| 2803 | + | |
| 2804 | + | |
| 2805 | + | |
| 2806 | + | |
| 2807 | + | |
| 2808 | + | |
| 2809 | + | |
| 2810 | + | |
| 2811 | + | |
| 2812 | + | |
2817 | 2813 | | |
2818 | 2814 | | |
2819 | 2815 | | |
| |||
2824 | 2820 | | |
2825 | 2821 | | |
2826 | 2822 | | |
2827 | | - | |
| 2823 | + | |
2828 | 2824 | | |
2829 | 2825 | | |
2830 | 2826 | | |
| |||
2852 | 2848 | | |
2853 | 2849 | | |
2854 | 2850 | | |
2855 | | - | |
2856 | | - | |
2857 | 2851 | | |
2858 | 2852 | | |
2859 | 2853 | | |
| |||
2880 | 2874 | | |
2881 | 2875 | | |
2882 | 2876 | | |
2883 | | - | |
2884 | | - | |
| 2877 | + | |
| 2878 | + | |
| 2879 | + | |
2885 | 2880 | | |
2886 | 2881 | | |
2887 | | - | |
2888 | | - | |
2889 | 2882 | | |
2890 | 2883 | | |
2891 | 2884 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2479 | 2479 | | |
2480 | 2480 | | |
2481 | 2481 | | |
2482 | | - | |
| 2482 | + | |
2483 | 2483 | | |
2484 | 2484 | | |
2485 | 2485 | | |
| |||
2518 | 2518 | | |
2519 | 2519 | | |
2520 | 2520 | | |
2521 | | - | |
| 2521 | + | |
2522 | 2522 | | |
2523 | 2523 | | |
2524 | 2524 | | |
| |||
2550 | 2550 | | |
2551 | 2551 | | |
2552 | 2552 | | |
2553 | | - | |
| 2553 | + | |
2554 | 2554 | | |
2555 | 2555 | | |
2556 | 2556 | | |
| |||
2621 | 2621 | | |
2622 | 2622 | | |
2623 | 2623 | | |
2624 | | - | |
| 2624 | + | |
2625 | 2625 | | |
2626 | 2626 | | |
2627 | 2627 | | |
| |||
0 commit comments