-
Notifications
You must be signed in to change notification settings - Fork 3k
fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
✅ DCO Check Passed Thanks @HuangruiChu, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Fix error Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Need to fix the "Ruff formatter...........................................................Failed". |
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…99@gmail.com>) Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
@HuangruiChu thanks for your contribution. Can you please restore the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HuangruiChu for your contribution. Here are some comments:
- Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
- The new function
_find_true_data_bounds
is computationally expensive but it limits the scan range in_find_data_tables
and overall it may pay off using it. However, it should also be leveraged in_find_table_bottom
and_find_table_right
, which still have some unbounded scans. This could explain the increase of processing time in this PR. - Restore the
test_backend_msexcel.py
to ensure regression tests on this backend pass, as pointed out by @cau-git . - Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables
Thank you for your reply. Sorry, I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".
@ceberam The processing time for the whole datasheet1, after I experienced the process 100 times to calculate the real processing time for the corner case. After I read the whole code of _find_table_bottom , _find_table_right , _find_table_bounds, _find_data_tables, I don't think the "find_true_data_bounds" helps in reducing the actual time the codes spend in finding the table data. One thing that really going to help is add an Early Stopping of a number like 1000 rows for no data--> Assume the datasheet is empty for the rest. It could be risky for the sheets have tables far away from each other. |
Some methods still use
I would still try to optimize the current code as suggested above, but keeping this heuristic approach as an alternative if that fails. Would you mind giving me write access to your fork? I would then add a commit with my suggestion. |
Signed-off-by: Richard (Huangrui) Chu <65276824+HuangruiChu@users.noreply.github.com>
Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.
Issue resolved by this Pull Request:
Resolves #2307
Checklist: