Table Evaluation Datasets

Here are some tasks required to have a complete and less confusing datasets out in the space and for evaluating Docling's table models for publication or production.

**1. Clean Wiki Datasets Content from HTML Tags**

- Goal: Ensure that the extracted tables from Wik datasets are free from unnecessary HTML tags, ensuring cleaner text-based content.
- Approach: Strip out `<table>, <tr>, <td>,` and other HTML-related tags. Preserve table structure while removing inline styles, scripts, and unwanted metadata. Normalize spacing and special characters to improve readability and usability.
- Outcome: A cleaner, structured dataset that retains meaningful tabular information without HTML noise to be similar to FTN, PTN, P1M, and STN.


**2. Clean Complete Original P1M, PubTabNet, and FinTabNet on Hugging Face (HF)**
- Goal: Preprocess and clean large-scale tabular datasets to improve quality and usability with high DPI if possible. This will help evaluate other methods that depend on high DPI, and for us just to reduce the DPI when necessary.
- Approach: Export again FinTabNet, PubTabNet from source. P1M might be tricky, we need to investigate if it is possible to retrieve original PDFs.
- Outcome: Complete and clean test sets in high dpi that makes it easier to compare to other methods on different DPIs.

**3. Docling Table Dataset**
- Goal: To have all of our datasets combined with the extra enriched or annotated cell types.
- Approach: Just create collect all the samples that we have for training, and val, together. For test we include all test sets but select a good sample from P1M.
- Outcome: This dataset will be used to evaluate Docling table models performance for internal and external developments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Table Evaluation Datasets #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Table Evaluation Datasets #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions