Add benchmark utility to profile peak memory usage #16814

ding-young · 2025-07-18T07:21:50Z

Which issue does this PR close?

Closes Support memory profiling in benchmarks #16720 .

Rationale for this change

What changes are included in this PR?

In each benchmark (clickbench, tpch, sort-tpch, imdb, h2o), we now call print_memory_stats(); to print memory usage statistics via mimalloc. (This only works when compiled with --features mimalloc_extended.)
A new utility mem_profile is added. It builds dfbench with the mimalloc_extended feature enabled, and then runs each benchmark query in a separate subprocess to collect memory stats. The utility captures the subprocess’s stdout and summarizes the results for all queries.

Are these changes tested?

Yes. Via
(Updated)

// Run all queries 
cargo run --profile release-nonlto --bin mem_profile -- tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
cargo run --profile release-nonlto --bin mem_profile -- h2o
cargo run --profile release-nonlto --bin mem_profile -- clickbench
cargo run --profile release-nonlto --bin mem_profile -- imdb --path benchmarks/data/imdb/
cargo run --profile release-nonlto --bin mem_profile -- sort-tpch --path benchmarks/data/tpch_sf1 --partitions 4

and

// Run specific query
cargo run --profile release-nonlto --bin mem_profile -- tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet --query 1

// Set profile to release-nonlto

cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet --query 1

Are there any user-facing changes?

ding-young · 2025-07-25T03:19:15Z

benchmarks/README.md

+# Profiling Memory Stats for each benchmark query
+The `mem_profile` program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.
+
+Subcommands supported by mem_profile are the subset of those in `dfbench`.
+Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
+
+Before running benchmarks, `mem_profile` automatically compiles the benchmark binary (`dfbench`) using `cargo build` with the same cargo profile (e.g., --release) as mem_profile itself. By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.
+
+Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.
+
+Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located. 


Here's more description about this utility and supported metrics.

ding-young · 2025-07-25T03:20:31Z

@2010YOUY01 This is ready for review :) I would love to hear your feedback.

2010YOUY01

Thank you, this is great! Using inner metrics provided by the mi-malloc also let us able to check additional info inside mimalloc, if someone is interested in the future.

I have several suggestions

The --profile option now seems only apply to the mem_profile binary, and the dfbench binary will use the default --profile release option, and takes a long time to compile. Is it possible to also let that configurable?

cargo run --profile release-nonlto --bin mem_profile -- tpch --query 1 --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
...
 Finished `release-nonlto` profile [optimized] target(s) in 2m 17s
     Running `target/release-nonlto/mem_profile tpch --query 1 --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet`
...
 Compiling datafusion-benchmarks v49.0.0 (/Users/yongting/Code/datafusion/benchmarks)
    Finished `release` profile [optimized] target(s) in 7m 42s

I suggest to add one e2e test for this feature, you can use ci feature flag, then it will can be run in CI when the benchmark data is ready.
Here is a example:

datafusion/benchmarks/src/imdb/run.rs

Line 480 in cbda394

#[cfg(feature = "ci")]
Open issues to integrate that into ./bench.sh and ./bench.sh compare branch-a branch-b utility.
Ideally, it can be run like

MEM_PROFILE=true ./bench.sh run tpch
./bench.sh compare branch-a branch-b

And the result will compare both time and memory usage side by side

2010YOUY01 · 2025-07-27T03:17:51Z

benchmarks/README.md

+
+Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.
+
+Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located. 


I suggest to add it (should be runned under datafusion/) to the error message

Now such panic will be triggered, if it's running under a different directory.

Benchmark binary built successfully. thread 'main' panicked at benchmarks/src/bin/mem_profile.rs:152:10: Failed to start benchmark: Os { code: 2, kind: NotFound, message: "No such file or directory" }

benchmarks/README.md

benchmarks/src/util/memory.rs

ding-young · 2025-07-28T06:08:09Z

benchmarks/src/bin/mem_profile.rs

+    let command = format!("{target_dir}/{profile}/dfbench");
+    // Check whether benchmark binary exists
+    if !Path::new(&command).exists() {
+        panic!(
+            "Benchmark binary not found: `{command}`\nRun this command from the top-level `datafusion/` directory so `target/{profile}/dfbench` can be found.",
+        );
+    }


Added check with err msg (reply to #16814 (comment))

ding-young · 2025-07-28T06:16:28Z

benchmarks/src/bin/mem_profile.rs

+    #[test]
+    // This test checks whether `mem_profile` runs successfully and produces expected output
+    // using TPC-H query 6 (which runs quickly).
+    fn mem_profile_e2e_tpch_q6() -> Result<()> {


e2e test added here

ding-young · 2025-07-28T06:17:44Z

benchmarks/src/bin/mem_profile.rs

+    let target_dir =
+        env::var("CARGO_TARGET_DIR").unwrap_or_else(|_| "target".to_string());
+    let command = format!("{target_dir}/{profile}/dfbench");


Also updated to respect CARGO_TARGET_DIR to run ci test (since cargo test's working directory may not be top level directory)

2010YOUY01

LGTM, thanks again!

Just one nit: When new benchmark is added, we have to update mem_profile.rs to add memory profiling support, it would be better to indicate that in the error message.

 cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto nlj

error: Found argument 'nlj' which wasn't expected, or isn't valid in this context

Suggested: error: 'nlj' is not supported for memory profiling. If this is a new benchmark, please update the memory profiling benchmarks to include it.

ding-young · 2025-07-28T08:14:01Z

@2010YOUY01 Thank you ! I mentioned your suggestion in follow up issue, and I'll address it when I add support for nlj benchmark :)

benchmarks/src/bin/mem_profile.rs

UBarney · 2025-07-28T09:26:27Z

Personally, I prefer to use time -v dfbench xxxxx to get memory information. The main advantage is that it doesn't require any extra compilation of dfbench.

2010YOUY01 · 2025-07-28T10:09:35Z

Personally, I prefer to use time -v dfbench xxxxx to get memory information. The main advantage is that it doesn't require any extra compilation of dfbench.

Another advantage of this external profiling approach is that we can compare different memory allocators side by side (e.g., jemalloc and mimalloc). I'm interested in doing this because I've heard that mimalloc is not optimized for the most efficient memory usage—for example, if the application only needs 1GB of memory, it might hold >> 1GB to make allocation faster. Additionally, jemalloc seems to have a tuning option to optimize for space, which would be great in memory-limited cases.

However, this PR's approach allows us to inspect more internal mimalloc metrics, which I find interesting too.

If there's another external profiling implementation, I'm happy to review it as well.

ding-young · 2025-07-28T11:02:42Z

Personally, I prefer to use time -v dfbench xxxxx to get memory information. The main advantage is that it doesn't require any extra compilation of dfbench.

@UBarney Thanks for the feedback!

I actually considered writing a Python script (+ time -v or reading /proc/stat~~) instead, for the same reason you mentioned. But I eventually chose this because I wanted to make sure that the utility user compiles the benchmark binary ahead of time and for the reason @2010YOUY01 mentioned. (although max rss will be our primary focus)

Anyway, I agree the double compilation step is not so desirable. If you think it is too painful, I’m definitely open to alternative ideas — for example, rewriting mem_profile.rs as a Python script.

UBarney · 2025-07-30T14:18:14Z

Another advantage of this external profiling approach is that we can compare different memory allocators side by side (e.g., jemalloc and mimalloc).

From what I can see, the new utility in this PR is designed to get RSS statistics specifically from binaries using mimalloc. If that's the case, we wouldn't be able to use it to perform a side-by-side RSS comparison between different memory allocators (like jemalloc and mimalloc). This Python-based approach was capable of this.

It also seems we can already compare the timing of different allocators with the bench.sh script by setting this environment variable

datafusion/benchmarks/bench.sh

Line 43 in 483444a

CARGO_COMMAND=${CARGO_COMMAND:-"cargo run --release"}

(e.g. CARGO_COMMAND='cargo run --no-default-features --features snmalloc' ./benchmarks/bench.sh run tpch10)

2010YOUY01 · 2025-07-31T04:35:51Z

Thanks again @ding-young and @UBarney

* add benchmark utility to profile memory usage * get memory stats from mimalloc, not procfs * support more benchmarks * update benchmarks/README and refactor * fix sort-tpch output format & taplo format * add e2e test & comments * update description to major page fault --------- Co-authored-by: Yongting You <2010youy01@gmail.com>

ding-young force-pushed the memory-profiling branch from 62737d3 to aec3191 Compare July 18, 2025 07:34

ding-young mentioned this pull request Jul 25, 2025

Benchmark: Add micro-benchmark for Nested Loop Join operator #16819

Closed

ding-young force-pushed the memory-profiling branch from 52ffcdc to 5d8b177 Compare July 25, 2025 02:57

ding-young commented Jul 25, 2025

View reviewed changes

ding-young marked this pull request as ready for review July 25, 2025 03:19

2010YOUY01 reviewed Jul 27, 2025

View reviewed changes

ding-young added 5 commits July 28, 2025 05:54

add benchmark utility to profile memory usage

04c34c6

get memory stats from mimalloc, not procfs

53153d7

support more benchmarks

740a611

update benchmarks/README and refactor

b9cded9

fix sort-tpch output format & taplo format

a233c8b

ding-young force-pushed the memory-profiling branch 2 times, most recently from 31747e7 to 796b79b Compare July 28, 2025 06:06

ding-young commented Jul 28, 2025

View reviewed changes

add e2e test & comments

67cadfa

ding-young force-pushed the memory-profiling branch from 796b79b to 67cadfa Compare July 28, 2025 06:14

ding-young commented Jul 28, 2025

View reviewed changes

2010YOUY01 approved these changes Jul 28, 2025

View reviewed changes

This was referenced Jul 28, 2025

Integrate mem_profile utility to bench.sh #16938

Open

Support nlj benchmark in mem_profile utility #16939

Open

UBarney reviewed Jul 28, 2025

View reviewed changes

benchmarks/src/bin/mem_profile.rs Outdated Show resolved Hide resolved

update description to major page fault

f9e9c56

Merge branch 'main' into memory-profiling

151168f

2010YOUY01 merged commit 853eee0 into apache:main Jul 31, 2025
29 checks passed


		Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.

		Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.

Add benchmark utility to profile peak memory usage #16814

Add benchmark utility to profile peak memory usage #16814

Uh oh!

Conversation

ding-young commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ding-young Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young commented Jul 25, 2025

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ding-young Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

ding-young commented Jul 28, 2025

Uh oh!

Uh oh!

UBarney commented Jul 28, 2025

Uh oh!

2010YOUY01 commented Jul 28, 2025

Uh oh!

ding-young commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

UBarney commented Jul 30, 2025

Uh oh!

Uh oh!

2010YOUY01 commented Jul 31, 2025

Uh oh!

Uh oh!

ding-young commented Jul 18, 2025 •

edited

Loading

2010YOUY01 left a comment •

edited

Loading

ding-young commented Jul 28, 2025 •

edited

Loading