Skip to content

Table Evaluation Datasets #42

@nassarofficial

Description

@nassarofficial

Here are some tasks required to have a complete and less confusing datasets out in the space and for evaluating Docling's table models for publication or production.

1. Clean Wiki Datasets Content from HTML Tags

  • Goal: Ensure that the extracted tables from Wik datasets are free from unnecessary HTML tags, ensuring cleaner text-based content.
  • Approach: Strip out <table>, <tr>, <td>, and other HTML-related tags. Preserve table structure while removing inline styles, scripts, and unwanted metadata. Normalize spacing and special characters to improve readability and usability.
  • Outcome: A cleaner, structured dataset that retains meaningful tabular information without HTML noise to be similar to FTN, PTN, P1M, and STN.

2. Clean Complete Original P1M, PubTabNet, and FinTabNet on Hugging Face (HF)

  • Goal: Preprocess and clean large-scale tabular datasets to improve quality and usability with high DPI if possible. This will help evaluate other methods that depend on high DPI, and for us just to reduce the DPI when necessary.
  • Approach: Export again FinTabNet, PubTabNet from source. P1M might be tricky, we need to investigate if it is possible to retrieve original PDFs.
  • Outcome: Complete and clean test sets in high dpi that makes it easier to compare to other methods on different DPIs.

3. Docling Table Dataset

  • Goal: To have all of our datasets combined with the extra enriched or annotated cell types.
  • Approach: Just create collect all the samples that we have for training, and val, together. For test we include all test sets but select a good sample from P1M.
  • Outcome: This dataset will be used to evaluate Docling table models performance for internal and external developments.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions