Allow the direct evaluation of externally provided DocTag and DoclingDocument json files without having a HF parquet prediction dataset

The current design of `docling-eval` assumes the workflow:

1. `create-gt`: Create a Ground Truth dataset in HF parquet format.
2. `create-eval`: Create a prediction dataset in HF parquet format that contains the predictions and
    the ground truth data from step 1.
3. `evaluate`: Run evaluations on the prediction dataset created in step 2.

In case the predictions already exist in lossless files like DocTag or DoclingDocument json formats, it is still possible to use the previous workflow via the `FileProvider`. However this still imposes an unnecessary overhead because:
- It requires additional storage space to save the prediction parquet dataset.
- There is significant time spent in I/O to save the prediction dataset.
  - A quick runtime benchmarking shows that 15% of the time is spent to convert DocTag files into
    DoclingDocument objects and 85% to dump the shards of the created prediction dataset.

An improved design should **allow the direct evaluation of DocTag/json files without the necessity to dump a prediction dataset on the disk**.

One approach could be:

1. The user places the `dt` or `json` files in a directory.
2. Each `dt` / `json` file follows the naming convention: `<document_id>.dt`, `<document_id>.json`.
   - `document_id` must be the same with the `document_id` column of the GT dataset.
3. All evaluators must accept an optional parameter `external_predictions_path`. If present:
   - Each GT document is matched to a doctags/json file.
   - The `doctags` file is loaded and converted on-the-fly to DoclingDocument object. The `json` file is deserialized in DoclingDocument.
   - The evaluation proceeds between the GT-sourced doc and the prediction doc.
4. The CLI for the `evaluate` command must accordingly be expanded to receive an optional parameter
`--external-predictions-path`.

Notice: This design allows to parallelize the evaluations by comparing batches of GT/predicted documents concurrently.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow the direct evaluation of externally provided DocTag and DoclingDocument json files without having a HF parquet prediction dataset #112

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow the direct evaluation of externally provided DocTag and DoclingDocument json files without having a HF parquet prediction dataset #112

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions