Skip to content

Conversation

@godnight10061
Copy link
Contributor

@godnight10061 godnight10061 commented Jan 7, 2026

This PR tries to reduce time_elapsed_opening for Vortex scans (observed vs Parquet).

Changes:

  • vortex-file: shrink default footer initial read from 1MiB to MAX_POSTSCRIPT_SIZE + EOF_SIZE (~64KiB) and add a regression test.
  • vortex-scan: make ScanBuilder::into_stream() lazy (defer prepare() / split registration until first poll) and add a unit test to ensure stream construction has no split-planning side effects.
  • vortex-datafusion: expose the footer initial read size as a format option (footer_initial_read_size_bytes) and plumb it into VortexOpenOptions::with_initial_read_size.

Notes:

  • Scan planning errors now surface on first poll instead of during into_stream() construction.
  • If the footer/schema/layout don’t fit in the initial window, read_footer will issue additional reads as before.

Tests:

  • cargo +nightly fmt --all --check
  • cargo clippy -p vortex-datafusion --all-targets --all-features -- -D warnings
  • cargo test --locked -p vortex-file -p vortex-scan -p vortex-io
  • cargo test --locked -p vortex-datafusion

Related: #4677

@AdamGS
Copy link
Contributor

AdamGS commented Jan 7, 2026

I think that for many modern setups, 1Mb and 64KB are pretty close in terms of latency, maybe the takeaway here is to expose that as a config for the DataFusion integration?
The deferred scan stream looking very promising, I'll give it a deeper look tomorrow.

@AdamGS AdamGS assigned AdamGS and unassigned AdamGS Jan 7, 2026
@AdamGS AdamGS self-requested a review January 7, 2026 17:06
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 70.96774% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.91%. Comparing base (be977c0) to head (57cc0af).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
vortex-scan/src/scan_builder.rs 42.10% 44 Missing ⚠️
vortex-file/src/open.rs 98.18% 1 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@joseph-isaacs joseph-isaacs added the benchmark Run benchmarks on this branch label Jan 7, 2026
@AdamGS
Copy link
Contributor

AdamGS commented Jan 7, 2026

Seems like running benchmarks from forks is still painful, I'm working on fixing it so we can get some numbers here.

@AdamGS AdamGS added performance Release label indicating an improvement to performance benchmark-sql and removed benchmark Run benchmarks on this branch labels Jan 7, 2026
@godnight10061
Copy link
Contributor Author

godnight10061 commented Jan 8, 2026

FYI: 'Rust (semver checks)' was failing due to cargo-semver-checks enabling all features, which pulls in the est-harness feature. That feature exports stest_reuse #[template] #[export]-generated hashed macro_rules! names (e.g. search_sorted_conformance_), and semver-checks flags them as removed/renamed vs the baseline crate.

This PR updates the semver job to use eature-group: default-features so the semver surface matches the supported public API and avoids those false positives.

Because this touches .github/workflows, GitHub marks the latest CI run as action_required until a maintainer approves running workflows for this fork PR.

@AdamGS
Copy link
Contributor

AdamGS commented Jan 8, 2026

we're working on the semver check elsewhere, it's not a required check anyway, more of a general indication

@godnight10061
Copy link
Contributor Author

Thanks! I'll turn the initial read size into a config as suggested. Really appreciate you handling the benchmark infra for forks—looking forward to your feedback tomorrow!

Comment on lines 367 to 370
with:
# Avoid enabling test-only feature flags (e.g. `test-harness`) that export unstable
# procedural-macro-generated items and create false-positive semver diffs.
feature-group: default-features
Copy link
Contributor

@AdamGS AdamGS Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this from this PR, we might refactor that whole macro or use an upstream fix

@AdamGS
Copy link
Contributor

AdamGS commented Jan 8, 2026

You can see the benchmark results in the run summary, seems to make more of a difference for DuckDB but I'll take that, I expect it to also make a difference for DataFusion or at least for that metric, I'll try it out soon.

@AdamGS
Copy link
Contributor

AdamGS commented Jan 8, 2026

A small local test I ran showed very nice improvements to time_elapsed_opening which is cool.

…egression test

Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
Signed-off-by: godnight10061 <godnight10061@users.noreply.github.com>
@godnight10061 godnight10061 force-pushed the fix/4677-initial-read-size branch from 1dcf3ae to 4856cf1 Compare January 8, 2026 16:13
Copy link
Contributor

@AdamGS AdamGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AdamGS AdamGS enabled auto-merge (squash) January 8, 2026 17:12
@AdamGS AdamGS merged commit f9c3c20 into vortex-data:develop Jan 8, 2026
47 of 48 checks passed
@connortsui20
Copy link
Contributor

There's a noticeable improvement in several of our benchmarks because of this which is nice! Though I did notice regressions in Clickbench Q6 and Q23 that someone should probably investigate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Release label indicating an improvement to performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants