Skip to content

This PDF OCR Regex Data Extractor automates the extraction of specific data from PDF documents. It converts scanned or image-based text into machine-readable form, then uses regex patterns to efficiently identify and extract structured data, ideal for documents like invoices.

Notifications You must be signed in to change notification settings

dineshram0212/pdf-ocr-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PDF Data Extraction Pipeline

This project is a Python pipeline for extracting specific structured data from PDF documents. The code converts PDF pages into images, performs OCR (Optical Character Recognition) on each image, and applies regex patterns to extract targeted information. This is particularly useful for extracting technical or structured data from scanned documents or PDF reports.

Features

  • Converts PDF pages into individual images.
  • Uses OCR to extract text from images.
  • Cleans and parses the text to extract specific fields using regular expressions.
  • Returns structured data for each page, making it easy to use in downstream applications.

Requirements

  • Python 3.7+
  • opencv-python
  • pdf2image
  • pillow
  • pyocr

Installation

  1. Clone this repository:
    git clone https://github.com/dineshram0212/pdf-ocr-extractor
    cd pdf-ocr-extractor

About

This PDF OCR Regex Data Extractor automates the extraction of specific data from PDF documents. It converts scanned or image-based text into machine-readable form, then uses regex patterns to efficiently identify and extract structured data, ideal for documents like invoices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages