Skip to content

oidlabs-com/Lexoid

Open In Colab Hugging Face GitHub license PyPI Docs

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

  • Use the multi-modal advancement of LLMs
  • Enable convenience for users
  • Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building .whl from source

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]

print(parsed_md)

Parameters

  • path (str): The file path or URL.
  • parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
  • pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
  • max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
  • **kwargs: Additional arguments for the parser.

Supported API Providers

  • Google
  • OpenAI
  • Hugging Face
  • Together AI
  • OpenRouter
  • Fireworks

Benchmark

Results aggregated across 14 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank Model SequenceMatcher Similarity TFIDF Similarity Time (s) Cost ($)
1 AUTO (with auto-selected model) 0.899 (±0.131) 0.960 (±0.066) 21.17 0.00066
2 AUTO 0.895 (±0.112) 0.973 (±0.046) 9.29 0.00063
3 gemini-2.5-flash 0.886 (±0.164) 0.986 (±0.027) 52.55 0.01226
4 mistral-ocr-latest 0.882 (±0.106) 0.932 (±0.091) 5.75 0.00121
5 gemini-2.5-pro 0.876 (±0.195) 0.976 (±0.049) 22.65 0.02408
6 gemini-2.0-flash 0.875 (±0.148) 0.977 (±0.037) 11.96 0.00079
7 claude-3-5-sonnet-20241022 0.858 (±0.184) 0.930 (±0.098) 17.32 0.01804
8 gemini-1.5-flash 0.842 (±0.214) 0.969 (±0.037) 15.58 0.00043
9 gpt-5-mini 0.819 (±0.201) 0.917 (±0.104) 52.84 0.00811
10 gpt-5 0.807 (±0.215) 0.919 (±0.088) 98.12 0.05505
11 claude-sonnet-4-20250514 0.801 (±0.188) 0.905 (±0.136) 22.02 0.02056
12 claude-opus-4-20250514 0.789 (±0.220) 0.886 (±0.148) 29.55 0.09513
13 accounts/fireworks/models/llama4-maverick-instruct-basic 0.772 (±0.203) 0.930 (±0.117) 16.02 0.00147
14 gemini-1.5-pro 0.767 (±0.309) 0.865 (±0.230) 24.77 0.01139
15 gpt-4.1-mini 0.754 (±0.249) 0.803 (±0.193) 23.28 0.00347
16 accounts/fireworks/models/llama4-scout-instruct-basic 0.754 (±0.243) 0.942 (±0.063) 13.36 0.00087
17 gpt-4o 0.752 (±0.269) 0.896 (±0.123) 28.87 0.01469
18 gpt-4o-mini 0.728 (±0.241) 0.850 (±0.128) 18.96 0.00609
19 claude-3-7-sonnet-20250219 0.646 (±0.397) 0.758 (±0.297) 57.96 0.01730
20 gpt-4.1 0.637 (±0.301) 0.787 (±0.185) 35.37 0.01498
21 google/gemma-3-27b-it 0.604 (±0.342) 0.788 (±0.297) 23.16 0.00020
22 microsoft/phi-4-multimodal-instruct 0.589 (±0.273) 0.820 (±0.197) 14.00 0.00045
23 qwen/qwen-2.5-vl-7b-instruct 0.498 (±0.378) 0.630 (±0.445) 14.73 0.00056
24 ds4sd/SmolDocling-256M-preview 0.482 (±0.365) 0.572 (±0.351) 106.19 0.00000