SPARKNLP-1261 Introducing Reader2Image Annotator #14658

danilojsl · 2025-08-31T19:17:37Z

Description

This PR introduces a new Spark NLP annotator, Reader2Image, which enables smooth integration of image reading capabilities into existing NLP workflows. It allows users to parse image content from structured documents (such as HTML and Markdown) and outputs them as a structured Spark DataFrame using VLM models alongside.

Supports parsing image content from:

HTML files
Markdown files

Returns parsed image data as Spark DataFrame annotations with metadata such as:

File name
Image dimensions (height, width)
Number of channels
Mode
Binary image data
Metadata

Motivation and Context

Many document processing pipelines include HTML or Markdown sources containing embedded images. Until now, Spark NLP lacked a native, streamlined way to read, parse, and represent these images within annotation pipelines. Reader2Image closes this gap by providing:

Seamless reuse of Spark NLP pipelines with image support.
Ability to read directly from multiple file formats.
Consistent output format compatible with other visual annotators.

How Has This Been Tested?

Screenshots (if appropriate):

Unit Tests
Google Colab Notebooks

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

danilojsl self-assigned this Aug 31, 2025

danilojsl added Feature request new-feature Introducing a new feature and removed Feature request labels Aug 31, 2025

danilojsl added 5 commits September 3, 2025 17:27

[SPARKNLP-1261] Adding to Reader2Image annotator

b6744aa

[SPARKNLP-1261] Adding support to mix content in Reader2Image

d56bd00

[SPARKNLP-1261] Adding tests and demo notebook to Reader2Image

049eb00

[SPARKNLP-1261] Updating to right version

ea13f0a

[SPARKNLP-1261] Adding support to reading images for emails

30e573b

danilojsl force-pushed the feature/SPARKNLP-1261-Implement-Reader2Image-Annotator branch from cb58774 to 30e573b Compare September 3, 2025 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-1261 Introducing Reader2Image Annotator #14658

SPARKNLP-1261 Introducing Reader2Image Annotator #14658

Uh oh!

danilojsl commented Aug 31, 2025

Uh oh!

Uh oh!

SPARKNLP-1261 Introducing Reader2Image Annotator #14658

Are you sure you want to change the base?

SPARKNLP-1261 Introducing Reader2Image Annotator #14658

Uh oh!

Conversation

danilojsl commented Aug 31, 2025

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!