PDF Data Extraction Pipeline

This project is a Python pipeline for extracting specific structured data from PDF documents. The code converts PDF pages into images, performs OCR (Optical Character Recognition) on each image, and applies regex patterns to extract targeted information. This is particularly useful for extracting technical or structured data from scanned documents or PDF reports.

Features

Converts PDF pages into individual images.
Uses OCR to extract text from images.
Cleans and parses the text to extract specific fields using regular expressions.
Returns structured data for each page, making it easy to use in downstream applications.

Requirements

Python 3.7+
opencv-python
pdf2image
pillow
pyocr

Installation

Clone this repository:

git clone https://github.com/dineshram0212/pdf-ocr-extractor
cd pdf-ocr-extractor

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Data Extraction Pipeline

Features

Requirements

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dineshram0212/pdf-ocr-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extraction Pipeline

Features

Requirements

Installation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages