fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

HuangruiChu · 2025-10-07T21:18:19Z

Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.

Issue resolved by this Pull Request:
Resolves #2307

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

github-actions · 2025-10-07T21:18:28Z

✅ DCO Check Passed

Thanks @HuangruiChu, all your commits are properly signed off. 🎉

dosubot · 2025-10-07T21:18:31Z

Related Documentation

Checked 3 published document(s). No updates required.

^{How did I do? Any feedback?}

mergify · 2025-10-07T21:18:52Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Fix error Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

HuangruiChu · 2025-10-08T13:41:50Z

Need to fix the "Ruff formatter...........................................................Failed".

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

codecov · 2025-10-08T19:08:22Z

Codecov Report

❌ Patch coverage is 4.54545% with 21 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msexcel_backend.py	4.54%	21 Missing ⚠️

📢 Thoughts on this report? Let us know!

…99@gmail.com>) Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

cau-git · 2025-10-10T07:39:55Z

@HuangruiChu thanks for your contribution. Can you please restore the test_backend_msexcel test unit, which apparently was deleted in this PR?

ceberam

Thanks @HuangruiChu for your contribution. Here are some comments:

Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
The new function _find_true_data_bounds is computationally expensive but it limits the scan range in _find_data_tables and overall it may pay off using it. However, it should also be leveraged in _find_table_bottom and _find_table_right, which still have some unbounded scans. This could explain the increase of processing time in this PR.
Restore the test_backend_msexcel.py to ensure regression tests on this backend pass, as pointed out by @cau-git .
Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables

HuangruiChu · 2025-10-12T01:16:20Z

Thank you for your reply. Sorry, I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".

"it should also be leveraged in _find_table_bottom and _find_table_right." Thank you for reminding me I am working on updating this.
"Remove some unnecessary comments." Sure, I will remove that.
"Did you test and have any evidence that this PR solves the intended issue Large spreadsheets - very slow processing #2307 ?" I create a version "test-02.xlsx" where the max_row is close to 1M by copying "test-01.xlsx" and adding a format at A1048496 at datasheet1.

@ceberam The processing time for the whole datasheet1, after I experienced the process 100 times to calculate the real processing time for the corner case.
With find_true_data_boundary Mean: 2.845316 s, Std: 0.364838 s.
Without find_true_data_boundary Mean: 2.635403 s, Std: 0.215835 s.

After I read the whole code of _find_table_bottom , _find_table_right , _find_table_bounds, _find_data_tables, I don't think the "find_true_data_bounds" helps in reducing the actual time the codes spend in finding the table data.

One thing that really going to help is add an Early Stopping of a number like 1000 rows for no data--> Assume the datasheet is empty for the rest. It could be risky for the sheets have tables far away from each other.

ceberam · 2025-10-15T09:10:46Z

After I read the whole code of _find_table_bottom , _find_table_right , _find_table_bounds, _find_data_tables, I don't think the "find_true_data_bounds" helps in reducing the actual time the codes spend in finding the table data.

Some methods still use sheet.max_row or sheet.max_column instead of the calculated bounds. Let me suggest a small refactoring on those methods. I still believe that having the actual data bounds can reduce the overall time.

One thing that really going to help is add an Early Stopping of a number like 1000 rows for no data--> Assume the datasheet is empty for the rest. It could be risky for the sheets have tables far away from each other.

I would still try to optimize the current code as suggested above, but keeping this heuristic approach as an alternative if that fails.

Would you mind giving me write access to your fork? I would then add a commit with my suggestion.
In the meantime you could please resolve the existing conflict with tests/test_backend_msexcel.py

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

Update msexcel_backend.py

1073e99

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

HuangruiChu changed the title ~~Update msexcel_backend.py~~ fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly Oct 8, 2025

HuangruiChu marked this pull request as draft October 8, 2025 13:22

Update msexcel_backend.py

576b36f

Fix error Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

HuangruiChu marked this pull request as ready for review October 8, 2025 13:26

dolfim-ibm requested a review from PeterStaar-IBM October 8, 2025 13:36

dolfim-ibm assigned PeterStaar-IBM Oct 8, 2025

HuangruiChu closed this Oct 8, 2025

HuangruiChu reopened this Oct 8, 2025

Fix linting issues

9e5aa2b

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

HuangruiChu force-pushed the patch-1 branch from f7d2c5b to 9e5aa2b Compare October 8, 2025 14:14

Add test files and data (Signed-off-by: Huangrui Chu <huangrui.chu.19…

d4b5225

…99@gmail.com>) Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

ceberam requested changes Oct 10, 2025

View reviewed changes

dolfim-ibm assigned ceberam and unassigned PeterStaar-IBM Oct 13, 2025

resolve conflict with test_backend_msexecl; update the boundary

700551a

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Uh oh!

HuangruiChu commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

dosubot bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

mergify bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

HuangruiChu commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

cau-git commented Oct 10, 2025

Uh oh!

ceberam left a comment

Uh oh!

HuangruiChu commented Oct 12, 2025 •

edited

Loading

Uh oh!

ceberam commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Are you sure you want to change the base?

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Uh oh!

Conversation

HuangruiChu commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

HuangruiChu commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cau-git commented Oct 10, 2025

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

HuangruiChu commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceberam commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Oct 7, 2025 •

edited

Loading

dosubot bot commented Oct 7, 2025 •

edited

Loading

mergify bot commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 8, 2025 •

edited

Loading

HuangruiChu commented Oct 12, 2025 •

edited

Loading