-
-
Couldn't load subscription status.
- Fork 244
How to turn off OCR (useful if you only want metadata extraction)
Jan Schlautmann edited this page Sep 25, 2024
·
1 revision
Even if parser.from_text(x, service = 'meta') is selected, Tika extracts the content. For files that need OCR'ing this can take a lot of time.
There are some solutions offered by Tika here to turn off OCR'ing. Since tika-python uses a Tika Server the last solution can be used:
parser.from_file(x, service = 'meta', headers = {"X-Tika-OCRskipOcr": 'true'})
This also works with service = 'all'. It returns the content if there is content that can be returned without OCR.