Skip to content

Conversation

HuangruiChu
Copy link

Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.

Issue resolved by this Pull Request:
Resolves #2307

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment).

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Copy link
Contributor

github-actions bot commented Oct 7, 2025

DCO Check Passed

Thanks @HuangruiChu, all your commits are properly signed off. 🎉

Copy link

dosubot bot commented Oct 7, 2025

Related Documentation

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link

mergify bot commented Oct 7, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@HuangruiChu HuangruiChu changed the title Update msexcel_backend.py fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly Oct 8, 2025
@HuangruiChu HuangruiChu marked this pull request as draft October 8, 2025 13:22
Fix error

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
@HuangruiChu HuangruiChu marked this pull request as ready for review October 8, 2025 13:26
@HuangruiChu HuangruiChu closed this Oct 8, 2025
@HuangruiChu
Copy link
Author

Need to fix the "Ruff formatter...........................................................Failed".

@HuangruiChu HuangruiChu reopened this Oct 8, 2025
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 4.54545% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 4.54% 21 Missing ⚠️

📢 Thoughts on this report? Let us know!

…99@gmail.com>)

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
@cau-git
Copy link
Contributor

cau-git commented Oct 10, 2025

@HuangruiChu thanks for your contribution. Can you please restore the test_backend_msexcel test unit, which apparently was deleted in this PR?

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HuangruiChu for your contribution. Here are some comments:

  • Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
  • The new function _find_true_data_bounds is computationally expensive but it limits the scan range in _find_data_tables and overall it may pay off using it. However, it should also be leveraged in _find_table_bottom and _find_table_right, which still have some unbounded scans. This could explain the increase of processing time in this PR.
  • Restore the test_backend_msexcel.py to ensure regression tests on this backend pass, as pointed out by @cau-git .
  • Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables

@HuangruiChu
Copy link
Author

HuangruiChu commented Oct 12, 2025

Thank you for your reply. Sorry, I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".

  1. "it should also be leveraged in _find_table_bottom and _find_table_right." Thank you for reminding me I am working on updating this.
  2. "Remove some unnecessary comments." Sure, I will remove that.
  3. "Did you test and have any evidence that this PR solves the intended issue Large spreadsheets - very slow processing #2307 ?" I create a version "test-02.xlsx" where the max_row is close to 1M by copying "test-01.xlsx" and adding a format at A1048496 at datasheet1.

@ceberam The processing time for the whole datasheet1, after I experienced the process 100 times to calculate the real processing time for the corner case.
With find_true_data_boundary Mean: 2.845316 s, Std: 0.364838 s.
Without find_true_data_boundary Mean: 2.635403 s, Std: 0.215835 s.

After I read the whole code of _find_table_bottom , _find_table_right , _find_table_bounds, _find_data_tables, I don't think the "find_true_data_bounds" helps in reducing the actual time the codes spend in finding the table data.

One thing that really going to help is add an Early Stopping of a number like 1000 rows for no data--> Assume the datasheet is empty for the rest. It could be risky for the sheets have tables far away from each other.

@dolfim-ibm dolfim-ibm assigned ceberam and unassigned PeterStaar-IBM Oct 13, 2025
@ceberam
Copy link
Contributor

ceberam commented Oct 15, 2025

After I read the whole code of _find_table_bottom , _find_table_right , _find_table_bounds, _find_data_tables, I don't think the "find_true_data_bounds" helps in reducing the actual time the codes spend in finding the table data.

Some methods still use sheet.max_row or sheet.max_column instead of the calculated bounds. Let me suggest a small refactoring on those methods. I still believe that having the actual data bounds can reduce the overall time.

One thing that really going to help is add an Early Stopping of a number like 1000 rows for no data--> Assume the datasheet is empty for the rest. It could be risky for the sheets have tables far away from each other.

I would still try to optimize the current code as suggested above, but keeping this heuristic approach as an alternative if that fails.

Would you mind giving me write access to your fork? I would then add a commit with my suggestion.
In the meantime you could please resolve the existing conflict with tests/test_backend_msexcel.py

Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large spreadsheets - very slow processing

4 participants