Skip to content

PIL.UnidentifiedImageError #89

@thoppe

Description

@thoppe

Initial Checks

  • I confirm that I'm on the latest version

Description

I've run into issues parsing some PDFs from the US House. For example:

https://aderholt.house.gov/sites/evo-subsites/aderholt.house.gov/files/evo-media-document/aderholt-challenger-center-disclosure-ltr-updated.pdf

With the following code below gives the traceback:

Traceback (most recent call last):
  File "/home/travis/problem.py", line 5, in <module>
    pdf_parser.parse(f)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/doc_parser.py", line 111, in parse
    nodes = self.processing_pipeline.run(nodes)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/ingest.py", line 42, in run
    nodes = transform_func.process(sorted(nodes))
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 115, in process
    combined_image = self._combine_images_in_group(image_elements)
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse/processing/basic_transforms.py", line 47, in _combine_images_in_group
    image = Image.open(io.BytesIO(image_data))
  File "/home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/PIL/Image.py", line 3536, in open
    raise UnidentifiedImageError(msg)
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x77abf494f420

Example Code

import openparse

f = "aderholt-challenger-center-disclosure-ltr-updated.pdf"
pdf_parser = openparse.DocumentParser()
pdf_parser.parse(f)

Python, open-parse & OS Version

python_version: 3.10.1
             operating_system: Linux
                   os_version: 6.8.0-49-generic
           open-parse version: 0.7.0
                 install path: /home/travis/.pyenv/versions/3.10.1/lib/python3.10/site-packages/openparse
               python version: 3.10.1 (main, May 23 2024, 14:57:20) [GCC 9.4.0]
                     platform: Linux-6.8.0-49-generic-x86_64-with-glibc2.35
             related packages: PyMuPDF-1.24.14 pydantic-2.10.1

Since the error is from PIL, pillow==11.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions