GitHub - oidlabs-com/Lexoid: Multimodal document parser for high quality data understanding and extraction

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

Use the multi-modal advancement of LLMs
Enable convenience for users
Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building `.whl` from source

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]

print(parsed_md)

Parameters

path (str): The file path or URL.
parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
**kwargs: Additional arguments for the parser.

Supported API Providers

Google
OpenAI
Hugging Face
Together AI
OpenRouter
Fireworks

Benchmark

Results aggregated across 14 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank	Model	SequenceMatcher Similarity	TFIDF Similarity	Time (s)	Cost ($)
1	AUTO (with auto-selected model)	0.899 (±0.131)	0.960 (±0.066)	21.17	0.00066
2	AUTO	0.895 (±0.112)	0.973 (±0.046)	9.29	0.00063
3	gemini-2.5-flash	0.886 (±0.164)	0.986 (±0.027)	52.55	0.01226
4	mistral-ocr-latest	0.882 (±0.106)	0.932 (±0.091)	5.75	0.00121
5	gemini-2.5-pro	0.876 (±0.195)	0.976 (±0.049)	22.65	0.02408
6	gemini-2.0-flash	0.875 (±0.148)	0.977 (±0.037)	11.96	0.00079
7	claude-3-5-sonnet-20241022	0.858 (±0.184)	0.930 (±0.098)	17.32	0.01804
8	gemini-1.5-flash	0.842 (±0.214)	0.969 (±0.037)	15.58	0.00043
9	gpt-5-mini	0.819 (±0.201)	0.917 (±0.104)	52.84	0.00811
10	gpt-5	0.807 (±0.215)	0.919 (±0.088)	98.12	0.05505
11	claude-sonnet-4-20250514	0.801 (±0.188)	0.905 (±0.136)	22.02	0.02056
12	claude-opus-4-20250514	0.789 (±0.220)	0.886 (±0.148)	29.55	0.09513
13	accounts/fireworks/models/llama4-maverick-instruct-basic	0.772 (±0.203)	0.930 (±0.117)	16.02	0.00147
14	gemini-1.5-pro	0.767 (±0.309)	0.865 (±0.230)	24.77	0.01139
15	gpt-4.1-mini	0.754 (±0.249)	0.803 (±0.193)	23.28	0.00347
16	accounts/fireworks/models/llama4-scout-instruct-basic	0.754 (±0.243)	0.942 (±0.063)	13.36	0.00087
17	gpt-4o	0.752 (±0.269)	0.896 (±0.123)	28.87	0.01469
18	gpt-4o-mini	0.728 (±0.241)	0.850 (±0.128)	18.96	0.00609
19	claude-3-7-sonnet-20250219	0.646 (±0.397)	0.758 (±0.297)	57.96	0.01730
20	gpt-4.1	0.637 (±0.301)	0.787 (±0.185)	35.37	0.01498
21	google/gemma-3-27b-it	0.604 (±0.342)	0.788 (±0.297)	23.16	0.00020
22	microsoft/phi-4-multimodal-instruct	0.589 (±0.273)	0.820 (±0.197)	14.00	0.00045
23	qwen/qwen-2.5-vl-7b-instruct	0.498 (±0.378)	0.630 (±0.445)	14.73	0.00056
24	ds4sd/SmolDocling-256M-preview	0.482 (±0.365)	0.572 (±0.351)	106.19	0.00000

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
lexoid		lexoid
tests		tests
.env_example		.env_example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation:

Installation

Installing with pip

Building `.whl` from source

Creating a local installation

Usage

Parameters

Supported API Providers

Benchmark

About

Uh oh!

Releases 19

Uh oh!

Contributors 7

Uh oh!

Languages

License

oidlabs-com/Lexoid

Folders and files

Latest commit

History

Repository files navigation

Motivation:

Installation

Installing with pip

Building .whl from source

Creating a local installation

Usage

Parameters

Supported API Providers

Benchmark

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 19

Uh oh!

Contributors 7

Uh oh!

Languages

Building `.whl` from source