IOM Fine-Tuning Pipeline

This repository provides a minimal end-to-end pipeline to fine-tune a seq2seq Transformer (e.g. T5 or BART) on the FDA Investigations Operations Manual (IOM). It now includes ontology mapping via Owlready2 against your COPE ontology (COPE_pharma.owl).

🚀 Features

Text Preprocessing
Clean & segment raw IOM into coherent text units.
Triple & QA Generation w/ Ontology
Extract (subject, verb, object) triples via spaCy, map them to COPE classes (OWL), and auto-generate Q&A pairs.
Training Data Preparation
Format Q&A into source/target JSON-lines for Hugging Face fine-tuning.
Model Fine-Tuning
Use 'HuggingFace Transformers’ Seq2SeqTrainer to specialize T5/BART on regulatory QA.
Inference Script
infer.py supports single-shot or interactive QA using your fine-tuned model.

🛠️ Setup & Installation

Clone the repo

git clone https://github.com/SuYirouCrystal/susie_iom_finetune.git
cd susie_iom_finetune

Install dependencies

pip install -r requirements.txt
# spaCy model
python3 -m spacy download en_core_web_sm
# (Optional) SciSpaCy  
pip install scispacy  
pip install \
https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz

Add your COPE ontology Place your OWL file in:
```
data/ontology/COPE_pharma.owl
```

📈 Pipeline Steps

Preprocess the IOM

python scripts/preprocess_iom.py \
 --input data/raw/IOM.docx \
 --output data/processed/iom_segments.txt

Cleans headers/footers, removes page numbers, and splits into paragraphs.

Generate Triples & Q&A

python scripts/generate_qa_data.py \
  --segments      data/processed/iom_segments.txt \
  --output_qas    data/processed/qa_pairs.json \
  --output_triples data/processed/triples.json \
  --ontology      data/ontology/COPE_pharma.owl

Extracts SVO triples, maps subject/object to COPE classes, then builds question/answer pairs.

Prepare Training Data

python scripts/prepare_train_data.py \
  --input  data/processed/qa_pairs.json \
  --output data/processed/train.jsonl

Converts Q&A JSON into train.jsonl of <source,target> for seq2seq training.

Fine-Tune the Model
```
python scripts/finetune.py \
  --config config/finetune_config.json
```
Trains your chosen model (e.g. t5-base) on data/processed/train.jsonl, saving to models/iom_model/.

Inference / Interactive QA

Single-Shot:

python scripts/infer.py \
 --model_dir models/iom_model \
 --question "question: What is Form FDA 482?"

Interactive Loop:
```
python scripts/infer.py --model_dir models/iom_model
```
Type your questions at the question: prompt and exit to quit.

📦 Model Directory

The models/ folder will hold your fine‑tuned weights tracked via Git LFS. When you clone the repo:

# Make sure Git LFS is installed
git lfs install
# Pull LFS files
git lfs pull

Load the model in code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("models/iom_model")
model     = AutoModelForSeq2SeqLM.from_pretrained("models/iom_model")

🎛️ Configuration

Edit config/finetune_config.json to adjust:

{
  "model_name": "t5-base",
  "num_train_epochs": 3,
  "batch_size": 4,
  "learning_rate": 5e-5
}

Swap in "t5-large" or "facebook/bart-base" and tweak hyperparameters to match your GPU/CPU.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
data		data
models		models
sample_output		sample_output
scripts		scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IOM Fine-Tuning Pipeline

🚀 Features

🛠️ Setup & Installation

📈 Pipeline Steps

📦 Model Directory

🎛️ Configuration

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SuYirouCrystal/IOM-Finetuning

Folders and files

Latest commit

History

Repository files navigation

IOM Fine-Tuning Pipeline

🚀 Features

🛠️ Setup & Installation

📈 Pipeline Steps

📦 Model Directory

🎛️ Configuration

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages