- 
                Notifications
    You must be signed in to change notification settings 
- Fork 41
FCC Benchmark Demo [DRAFT] #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
FCC Benchmark Demo [DRAFT] #88
Conversation
| config_library/pattern-2/fcc-invoices/ | ||
| ├── README.md # This file | ||
| ├── config.yaml # Base IDP configuration | ||
| ├── fcc_configured.yaml # Deployed stack configuration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between fcc-configured.yaml and config.yaml?
| ## Sample Data | ||
|  | ||
| Sample documents are located in `samples/fcc-invoices/`: | ||
| - 3 sample PDF invoices | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the sample docs in this PR..
| --output-dir evaluation_output | ||
| ``` | ||
|  | ||
| **Note**: The `sample_labels_3.csv` contains ground truth for 3 sample documents. For full dataset evaluation, use `sr_refactor_labels_5_5_25.csv`. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. So your ground truth format isn't the same as the inference results format?
How will this work in the accelerator for HITL review -> Save as Evaluation Baseline - which basically copies the verified inference results to be used as ground truth in future..  Let's discuss how to align ground truth format  with inference results format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually strike that.. I see from your code that Sticker itself isn't using this CSV format.. only your local script which seems to save the GT as the correct inference_results format before running Stickler.. so we're good.
| - Direct integration with SticklerEvaluationService | ||
| - Same accurate results | ||
|  | ||
| **Expected output:** | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My output looks pretty different..
$ python bulk_evaluate_fcc_invoices_simple.py --results-dir ../../../fcc_results/cli-batch-20251030-211652 --csv-path s
ample_labels_3.csv --config-path stickler_config.json --output-dir evaluation_output
================================================================================
BULK FCC INVOICE EVALUATION
================================================================================
📋 Loading Stickler config from stickler_config.json...
✓ Service initialized
📊 Loading ground truth from sample_labels_3.csv...
✓ Loaded 3 documents with ground truth
📁 Loading inference results from ../../../fcc_results/cli-batch-20251030-211652...
✓ Loaded 3 inference results
⚙️  Evaluating documents...
✓ Completed evaluation of 3 documents
================================================================================
AGGREGATED RESULTS
================================================================================
📊 Summary: 3 processed, 0 errors
📈 Overall Metrics:
  Precision: 0.0000
  Recall:    0.0000
  F1 Score:  0.0000
  Accuracy:  0.0741
  Confusion Matrix:
    TP:      0  |  FP:      0
    FN:     25  |  TN:      2
    FP1:      0  |  FP2:      0
📋 Field-Level Metrics (Top 10):
  Field                                     Precision     Recall         F1
  ---------------------------------------- ---------- ---------- ----------
  agency                                       0.0000     0.0000     0.0000
  advertiser                                   0.0000     0.0000     0.0000
  gross_total                                  0.0000     0.0000     0.0000
  net_amount_due                               0.0000     0.0000     0.0000
  line_item__description                       0.0000     0.0000     0.0000
  line_item__days                              0.0000     0.0000     0.0000
  line_item__rate                              0.0000     0.0000     0.0000
  line_item__start_date                        0.0000     0.0000     0.0000
  line_item__end_date                          0.0000     0.0000     0.0000
💾 Results saved to evaluation_output
================================================================================
~/projects/idp-pr88/config_library/pattern-2/fcc-invoices (sr/fcc_benchmark)$ 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like the (new JSON-Schema) accelerator config..
Is stickler ready yet to map to the new config?   Eg PR awslabs/stickler#20  ?
I'm thinking I need that new version of Sticker, tested with our new config, to proceed, right?
Issue #, if available:
This is demonstration of using the new CLI tooling in conjunction with the RealKIE FCC dataset, to perform benchmarking, using Stickler eval lib
Description of changes:
This includes a config for the FCC invoices, and a stickler evaluation configuration.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.