FCC Benchmark Demo [DRAFT] #88

sromoam · 2025-10-17T20:44:18Z

Issue #, if available:
This is demonstration of using the new CLI tooling in conjunction with the RealKIE FCC dataset, to perform benchmarking, using Stickler eval lib

Description of changes:
This includes a config for the FCC invoices, and a stickler evaluation configuration.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Adding in a config baseline.

rstrahan · 2025-10-30T20:59:53Z

config_library/pattern-2/fcc-invoices/README.md

+config_library/pattern-2/fcc-invoices/
+├── README.md                              # This file
+├── config.yaml                            # Base IDP configuration
+├── fcc_configured.yaml                    # Deployed stack configuration


What's the difference between fcc-configured.yaml and config.yaml?

rstrahan · 2025-10-30T21:01:55Z

config_library/pattern-2/fcc-invoices/README.md

+## Sample Data
+
+Sample documents are located in `samples/fcc-invoices/`:
+- 3 sample PDF invoices


I don't see the sample docs in this PR..

rstrahan · 2025-10-30T21:35:58Z

config_library/pattern-2/fcc-invoices/README.md

+  --output-dir evaluation_output
+```
+
+**Note**: The `sample_labels_3.csv` contains ground truth for 3 sample documents. For full dataset evaluation, use `sr_refactor_labels_5_5_25.csv`.


Hmm.. So your ground truth format isn't the same as the inference results format?
How will this work in the accelerator for HITL review -> Save as Evaluation Baseline - which basically copies the verified inference results to be used as ground truth in future.. Let's discuss how to align ground truth format with inference results format

Actually strike that.. I see from your code that Sticker itself isn't using this CSV format.. only your local script which seems to save the GT as the correct inference_results format before running Stickler.. so we're good.

rstrahan · 2025-10-30T21:39:27Z

config_library/pattern-2/fcc-invoices/README.md

+- Direct integration with SticklerEvaluationService
+- Same accurate results
+
+**Expected output:**


My output looks pretty different..

$ python bulk_evaluate_fcc_invoices_simple.py --results-dir ../../../fcc_results/cli-batch-20251030-211652 --csv-path s ample_labels_3.csv --config-path stickler_config.json --output-dir evaluation_output ================================================================================ BULK FCC INVOICE EVALUATION ================================================================================ 📋 Loading Stickler config from stickler_config.json... ✓ Service initialized 📊 Loading ground truth from sample_labels_3.csv... ✓ Loaded 3 documents with ground truth 📁 Loading inference results from ../../../fcc_results/cli-batch-20251030-211652... ✓ Loaded 3 inference results ⚙️ Evaluating documents... ✓ Completed evaluation of 3 documents ================================================================================ AGGREGATED RESULTS ================================================================================ 📊 Summary: 3 processed, 0 errors 📈 Overall Metrics: Precision: 0.0000 Recall: 0.0000 F1 Score: 0.0000 Accuracy: 0.0741 Confusion Matrix: TP: 0 | FP: 0 FN: 25 | TN: 2 FP1: 0 | FP2: 0 📋 Field-Level Metrics (Top 10): Field Precision Recall F1 ---------------------------------------- ---------- ---------- ---------- agency 0.0000 0.0000 0.0000 advertiser 0.0000 0.0000 0.0000 gross_total 0.0000 0.0000 0.0000 net_amount_due 0.0000 0.0000 0.0000 line_item__description 0.0000 0.0000 0.0000 line_item__days 0.0000 0.0000 0.0000 line_item__rate 0.0000 0.0000 0.0000 line_item__start_date 0.0000 0.0000 0.0000 line_item__end_date 0.0000 0.0000 0.0000 💾 Results saved to evaluation_output ================================================================================ ~/projects/idp-pr88/config_library/pattern-2/fcc-invoices (sr/fcc_benchmark)$

rstrahan · 2025-10-30T21:59:54Z

config_library/pattern-2/fcc-invoices/stickler_config.json

This doesn't look like the (new JSON-Schema) accelerator config..
Is stickler ready yet to map to the new config? Eg PR awslabs/stickler#20 ?
I'm thinking I need that new version of Sticker, tested with our new config, to proceed, right?

sromoam added 5 commits October 16, 2025 13:39

Fixing bug with the CLI.

1d6ca16

Adding in a config baseline.

Adding samples to the FCC benchmark branch to demo how this works.

e433829

Adding the stickler service.

73f1a3d

Updating the labels.

45edea3

Updating the script to use the sticklerservice.

0a27664

rstrahan requested changes Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FCC Benchmark Demo [DRAFT] #88

FCC Benchmark Demo [DRAFT] #88

Uh oh!

sromoam commented Oct 17, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

rstrahan Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FCC Benchmark Demo [DRAFT] #88

Are you sure you want to change the base?

FCC Benchmark Demo [DRAFT] #88

Uh oh!

Conversation

sromoam commented Oct 17, 2025

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

rstrahan Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants