Skip to content

Commit 316884b

Browse files
authored
Merge pull request #91 from CatchTheTornado/feat/56-easy-ocr
#62 #56 #87 feat: easyOCR added, tesseract - removed, marker - removed, license changed to MIT
2 parents f52d52e + ac6baf0 commit 316884b

File tree

19 files changed

+130
-797
lines changed

19 files changed

+130
-797
lines changed

.vscode/settings.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"files.exclude": {
3+
"**/__pycache__": true,
4+
"**/*.egg-info": true
5+
}
6+
}

LICENSE

Lines changed: 21 additions & 674 deletions
Large diffs are not rendered by default.

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,12 @@ setup-local:
6666
.PHONY: install-linux
6767
install-linux:
6868
@echo -e "\033[1;34m Installing Linux dependencies...\033[0m"; \
69-
sudo apt update && sudo apt install -y libmagic1 tesseract-ocr poppler-utils pkg-config
69+
sudo apt update && sudo apt install -y libmagic1 poppler-utils pkg-config
7070

7171
.PHONY: install-macos
7272
install-macos:
7373
@echo -e "\033[1;34m Installing macOS dependencies...\033[0m"; \
74-
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
74+
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
7575

7676
.PHONY: install-requirements
7777
install-requirements:

README.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
77
![hero doc extract](ocr-hero.webp)
88

99
## Features:
10-
- **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11-
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
10+
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11+
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
1212
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
1313
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
1414
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started)
3939

4040
![Converting Invoice to JSON](./screenshots/example-2.png)
4141

42-
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
43-
4442
## Getting started
4543

4644
You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment).
@@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so
114112
115113
(MAC) - Dependencies
116114
```
117-
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
115+
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
118116
```
119117
120118
(Mac) - You need to startup the celery worker
@@ -312,9 +310,11 @@ python client/cli.py llm_pull --model llama3.2-vision
312310
and only after to run this specific prompt query:
313311
314312
```bash
315-
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
313+
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en
316314
```
317315
316+
**Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc.
317+
318318
The `ocr` command can store the results using the `storage_profiles`:
319319
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
320320
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
@@ -410,37 +410,39 @@ apiClient.uploadFile(formData).then(response => {
410410
- **Method**: POST
411411
- **Parameters**:
412412
- **file**: PDF, image or Office file to be processed.
413-
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
413+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
414414
- **ocr_cache**: Whether to cache the OCR result (true or false).
415415
- **prompt**: When provided, will be used for Ollama processing the OCR result
416416
- **model**: When provided along with the prompt - this model will be used for LLM processing
417417
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
418418
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
419+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
419420
420421
Example:
421422
422423
```bash
423-
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
424+
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=easyocr" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
424425
```
425426
426427
### OCR Endpoint via JSON request
427428
- **URL**: /ocr/request
428429
- **Method**: POST
429430
- **Parameters** (JSON body):
430431
- **file**: Base64 encoded PDF file content.
431-
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
432+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
432433
- **ocr_cache**: Whether to cache the OCR result (true or false).
433434
- **prompt**: When provided, will be used for Ollama processing the OCR result.
434435
- **model**: When provided along with the prompt - this model will be used for LLM processing.
435436
- **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved.
436437
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting.
438+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
437439
438440
Example:
439441
440442
```bash
441443
curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{
442444
"file": "<base64-encoded-file-content>",
443-
"strategy": "marker",
445+
"strategy": "easyocr",
444446
"ocr_cache": true,
445447
"prompt": "",
446448
"model": "llama3.1",
@@ -598,13 +600,7 @@ AWS_S3_BUCKET_NAME=your-bucket-name
598600
```
599601
600602
## License
601-
This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details.
602-
603-
**Important note on [marker](https://github.com/VikParuchuri/marker) license***:
604-
605-
The weights for the models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/).
606-
607-
603+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
608604
609605
## Contact
610606
In case of any questions please contact us at: info@catchthetornado.com

client/cli.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,17 +6,19 @@
66
import math
77
from ollama import pull
88

9-
def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
9+
def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
1010
ocr_url = os.getenv('OCR_UPLOAD_URL', 'http://localhost:8000/ocr/upload')
1111
files = {'file': open(file_path, 'rb')}
1212
if not ocr_cache:
1313
print("OCR cache disabled.")
1414

15-
data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile}
15+
data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile, 'language': language}
1616

1717
if storage_filename:
1818
data['storage_filename'] = storage_filename
1919

20+
print(data) # @todo change to log debug in the future
21+
2022
try:
2123
if prompt_file:
2224
prompt = open(prompt_file, 'r').read()
@@ -42,7 +44,7 @@ def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1',
4244
print(f"Failed to upload file: {response.text}")
4345
return None
4446

45-
def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
47+
def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
4648
ocr_url = os.getenv('OCR_REQUEST_URL', 'http://localhost:8000/ocr/request')
4749
with open(file_path, 'rb') as f:
4850
file_content = base64.b64encode(f.read()).decode('utf-8')
@@ -52,7 +54,8 @@ def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1'
5254
'model': model,
5355
'strategy': strategy,
5456
'storage_profile': storage_profile,
55-
'file': file_content
57+
'file': file_content,
58+
'language': language
5659
}
5760

5861
if storage_filename:
@@ -175,6 +178,7 @@ def main():
175178
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
176179
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
177180
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
181+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
178182
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
179183

180184
# Sub-command for uploading a file via file upload - @deprecated - it's a backward compatibility gimmick
@@ -189,6 +193,7 @@ def main():
189193
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
190194
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
191195
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
196+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
192197
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
193198

194199

@@ -204,6 +209,7 @@ def main():
204209
ocr_request_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
205210
ocr_request_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use. You may use some formatting - see the docs')
206211
ocr_request_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use')
212+
ocr_request_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
207213

208214
# Sub-command for getting the result
209215
result_parser = subparsers.add_parser('result', help='Get the OCR result by specified task id.')
@@ -239,7 +245,7 @@ def main():
239245

240246
if args.command == 'ocr' or args.command == 'ocr_upload':
241247
print(args)
242-
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
248+
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
243249
if result is None:
244250
print("Error uploading file.")
245251
return
@@ -251,7 +257,7 @@ def main():
251257
if text_result:
252258
print(text_result)
253259
elif args.command == 'ocr_request':
254-
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
260+
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
255261
if result is None:
256262
print("Error uploading file.")
257263
return

config/strategies.yaml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
strategies:
22
llama_vision:
33
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
4-
marker:
5-
class: text_extract_api.extract.strategies.marker.MarkerStrategy
6-
tesseract:
7-
class: text_extract_api.extract.strategies.tesseract.TesseractStrategy
4+
easyocr:
5+
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy

dev.Dockerfile

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
88
&& apt-get update --fix-missing \
99
&& apt-get install -y \
1010
libgl1-mesa-glx \
11-
tesseract-ocr \
12-
libtesseract-dev \
1311
poppler-utils \
1412
libmagic1 \
1513
libmagic-dev \

dev.gpu.Dockerfile

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
4343
&& apt-get update --fix-missing \
4444
&& apt-get install -y \
4545
libgl1-mesa-glx \
46-
tesseract-ocr \
47-
libtesseract-dev \
4846
poppler-utils \
4947
libpoppler-cpp-dev \
5048
&& rm -rf /var/lib/apt/lists/*

pyproject.toml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,9 @@ readme = "README.md"
1212
requires-python = ">=3.8"
1313
dependencies = [
1414
"fastapi",
15+
"easyocr",
1516
"celery",
1617
"redis",
17-
"pytesseract",
1818
"opencv-python-headless",
1919
"pdf2image",
2020
"ollama",
@@ -27,8 +27,6 @@ dependencies = [
2727
"google-auth-httplib2",
2828
"google-auth-oauthlib",
2929
"transformers",
30-
"surya-ocr==0.4.14",
31-
"marker-pdf==0.2.6",
3230
"boto3",
3331
"Pillow",
3432
"python-magic==0.4.27",

run.sh

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,6 @@ echo "Starting Redis"
5252
echo "Your ENV settings loaded from .env.localhost file: "
5353
printenv
5454

55-
echo "Downloading models"
56-
python -c 'from marker.models import load_all_models; load_all_models()'
57-
5855
CELERY_BIN="$(pwd)/.venv/bin/celery"
5956
CELERY_PID=$(pgrep -f "$CELERY_BIN")
6057
REDIS_PORT=6379 # will move it to .envs in near future

0 commit comments

Comments
 (0)