Skip to content

Conversation

@sromoam
Copy link
Contributor

@sromoam sromoam commented Oct 17, 2025

Issue #, if available:
This is demonstration of using the new CLI tooling in conjunction with the RealKIE FCC dataset, to perform benchmarking, using Stickler eval lib

Description of changes:
This includes a config for the FCC invoices, and a stickler evaluation configuration.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

config_library/pattern-2/fcc-invoices/
├── README.md # This file
├── config.yaml # Base IDP configuration
├── fcc_configured.yaml # Deployed stack configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between fcc-configured.yaml and config.yaml?

## Sample Data

Sample documents are located in `samples/fcc-invoices/`:
- 3 sample PDF invoices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the sample docs in this PR..

--output-dir evaluation_output
```

**Note**: The `sample_labels_3.csv` contains ground truth for 3 sample documents. For full dataset evaluation, use `sr_refactor_labels_5_5_25.csv`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. So your ground truth format isn't the same as the inference results format?
How will this work in the accelerator for HITL review -> Save as Evaluation Baseline - which basically copies the verified inference results to be used as ground truth in future.. Let's discuss how to align ground truth format with inference results format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually strike that.. I see from your code that Sticker itself isn't using this CSV format.. only your local script which seems to save the GT as the correct inference_results format before running Stickler.. so we're good.

- Direct integration with SticklerEvaluationService
- Same accurate results

**Expected output:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My output looks pretty different..

$ python bulk_evaluate_fcc_invoices_simple.py --results-dir ../../../fcc_results/cli-batch-20251030-211652 --csv-path s
ample_labels_3.csv --config-path stickler_config.json --output-dir evaluation_output
================================================================================
BULK FCC INVOICE EVALUATION
================================================================================

📋 Loading Stickler config from stickler_config.json...
✓ Service initialized

📊 Loading ground truth from sample_labels_3.csv...
✓ Loaded 3 documents with ground truth

📁 Loading inference results from ../../../fcc_results/cli-batch-20251030-211652...
✓ Loaded 3 inference results

⚙️  Evaluating documents...
✓ Completed evaluation of 3 documents

================================================================================
AGGREGATED RESULTS
================================================================================

📊 Summary: 3 processed, 0 errors

📈 Overall Metrics:
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000
  Accuracy:  0.0741

  Confusion Matrix:
    TP:      0  |  FP:      0
    FN:     25  |  TN:      2
    FP1:      0  |  FP2:      0

📋 Field-Level Metrics (Top 10):
  Field                                     Precision     Recall         F1
  ---------------------------------------- ---------- ---------- ----------
  agency                                       0.0000     0.0000     0.0000
  advertiser                                   0.0000     0.0000     0.0000
  gross_total                                  0.0000     0.0000     0.0000
  net_amount_due                               0.0000     0.0000     0.0000
  line_item__description                       0.0000     0.0000     0.0000
  line_item__days                              0.0000     0.0000     0.0000
  line_item__rate                              0.0000     0.0000     0.0000
  line_item__start_date                        0.0000     0.0000     0.0000
  line_item__end_date                          0.0000     0.0000     0.0000

💾 Results saved to evaluation_output
================================================================================
~/projects/idp-pr88/config_library/pattern-2/fcc-invoices (sr/fcc_benchmark)$ 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like the (new JSON-Schema) accelerator config..
Is stickler ready yet to map to the new config? Eg PR awslabs/stickler#20 ?
I'm thinking I need that new version of Sticker, tested with our new config, to proceed, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants