-
Notifications
You must be signed in to change notification settings - Fork 138
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Initial Checks
- I confirm that I'm on the latest version
Description
I've run into issues parsing some PDFs from the US House. For example:
With the following code below gives the traceback:
Traceback (most recent call last):
File "/home/travis/problem.py", line 5, in <module>
pdf_parser.parse(f)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/doc_parser.py", line 111, in parse
nodes = self.processing_pipeline.run(nodes)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/ingest.py", line 42, in run
nodes = transform_func.process(sorted(nodes))
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 115, in process
combined_image = self._combine_images_in_group(image_elements)
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 47, in _combine_images_in_group
image = Image.open(io.BytesIO(image_data))
File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/PIL/Image.py", line 3536, in open
raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x77abf494f420
Example Code
import openparse
f = "aderholt-challenger-center-disclosure-ltr-updated.pdf"
pdf_parser = openparse.DocumentParser()
pdf_parser.parse(f)Python, open-parse & OS Version
python_version: 3.10.1
operating_system: Linux
os_version: 6.8.0-49-generic
open-parse version: 0.7.0
install path: /home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse
python version: 3.10.1 (main, May 23 2024, 14:57:20) [GCC 9.4.0]
platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
related packages: PyMuPDF-1.24.14 pydantic-2.10.1
Since the error is from PIL, pillow==11.0.0
epistoteles, mikeumus and MathieuCianconemikeumus
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working