Skip to content

Conversation

@cau-git
Copy link
Contributor

@cau-git cau-git commented Nov 13, 2025

  • Aligns CVAT codes to use the modern CVAT to DoclingDocument converter
  • Updates the test cases and usages
  • Massively reduces the disk size of CVAT folders required (no redundant page images across multiple dirs, no DoclingDocument JSON encoding images)

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…n_groundtruth

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Replace legacy CVAT annotation processing code (~1400 lines) with modern
convert_cvat_folder_to_docling() converter in CvatDatasetBuilder. This
significantly simplifies the codebase and aligns with the modern CVAT
converter architecture.

- Remove legacy annotation parsing and document creation methods
- Use convert_cvat_folder_to_docling() for CVAT-to-Docling conversion
- Improve path handling in CvatPreannotationBuilder for moved datasets
- Remove unused original and original_prediction storage in json_dataset_joiner
- Update test data annotations to match new converter output

BREAKING CHANGE: CvatDatasetBuilder now requires modern CVAT folder
structure and uses convert_cvat_folder_to_docling() internally.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@mergify
Copy link

mergify bot commented Nov 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Nov 13, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git changed the title feat: Remove legcay CvatDatasetBuilder code, use modernized code feat: Remove legacy CvatDatasetBuilder code, use modernized code Nov 13, 2025
- Refactor cvat_deliveries_to_hf to build single combined dataset with subset tags
- Add tags field to DatasetRecord for subset tracking
- Add page counting to annotation task creation
- Support multipage TIFF and additional image formats (BMP, GIF)
- Add configurable JSON directory names to pipeline
- Fix caption/footnote target detection logic in CVAT converter

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants