A Supervised Framework for Document Processing at Scale with Large Language Models in Credit-Risk Research
Authors:
Cosmin Cojocaru¹ and Sorin Ionescu²
¹ National University of Science and Technology Politehnica Bucharest, Romania
² National University of Science and Technology Politehnica Bucharest, Romania
This repository accompanies the paper:
Cojocaru, C., & Ionescu, S. (2025). A Supervised Framework for Document Processing at Scale with Large Language Models in Credit-Risk Research. Proceedings of ICMIE 2025.
It implements a supervised, schema-guided pipeline for document metadata extraction and evaluation using Large Language Models (LLMs).
Two notebooks are provided — one for the extraction framework (with Gradio UI) and one for evaluation and benchmarking.
notebooks/
├── 01_supervised_document_framework.ipynb → Gradio interface and extraction pipeline
└── 02_evaluation_of_jsons.ipynb → Evaluation and benchmarking logic
data/
├── Extracted JSON outputs per model
├── Centralized metadata CSV
└── evaluation/
├── Evaluation metrics (CSV)
└── Radar visualization files (PNG)
All dependencies are pre-installed in Colab.
Local execution is optional.
pip install -r requirements.txtOriginal PDFs are not redistributed due to copyright restrictions. All derived JSON and CSV files necessary for reproducibility are provided in data/ and data/evaluation/.
- Baseline and LLM-generated metadata JSONs
- Evaluation metrics (precision, recall, F1, Weighted Score)
- Visualization data (radar plots and summaries)
Cojocaru, C., & Ionescu, S. (2025). A Supervised Framework for Document Processing at Scale with Large Language Models in Credit-Risk Research. ICMIE 2025.
This project is released under the MIT License.
GitHub: https://github.com/cojocarucosmin/supervised-llm-document-framework (release v1.0)