Skip to content

Allow the direct evaluation of externally provided DocTag and DoclingDocument json files without having a HF parquet prediction dataset #112

@nikos-livathinos

Description

@nikos-livathinos

The current design of docling-eval assumes the workflow:

  1. create-gt: Create a Ground Truth dataset in HF parquet format.
  2. create-eval: Create a prediction dataset in HF parquet format that contains the predictions and
    the ground truth data from step 1.
  3. evaluate: Run evaluations on the prediction dataset created in step 2.

In case the predictions already exist in lossless files like DocTag or DoclingDocument json formats, it is still possible to use the previous workflow via the FileProvider. However this still imposes an unnecessary overhead because:

  • It requires additional storage space to save the prediction parquet dataset.
  • There is significant time spent in I/O to save the prediction dataset.
    • A quick runtime benchmarking shows that 15% of the time is spent to convert DocTag files into
      DoclingDocument objects and 85% to dump the shards of the created prediction dataset.

An improved design should allow the direct evaluation of DocTag/json files without the necessity to dump a prediction dataset on the disk.

One approach could be:

  1. The user places the dt or json files in a directory.
  2. Each dt / json file follows the naming convention: <document_id>.dt, <document_id>.json.
    • document_id must be the same with the document_id column of the GT dataset.
  3. All evaluators must accept an optional parameter external_predictions_path. If present:
    • Each GT document is matched to a doctags/json file.
    • The doctags file is loaded and converted on-the-fly to DoclingDocument object. The json file is deserialized in DoclingDocument.
    • The evaluation proceeds between the GT-sourced doc and the prediction doc.
  4. The CLI for the evaluate command must accordingly be expanded to receive an optional parameter
    --external-predictions-path.

Notice: This design allows to parallelize the evaluations by comparing batches of GT/predicted documents concurrently.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions