Skip to content

Conversation

danilojsl
Copy link
Contributor

Description

This PR introduces a new Spark NLP annotator, Reader2Image, which enables smooth integration of image reading capabilities into existing NLP workflows. It allows users to parse image content from structured documents (such as HTML and Markdown) and outputs them as a structured Spark DataFrame using VLM models alongside.

Supports parsing image content from:

  • HTML files
  • Markdown files

Returns parsed image data as Spark DataFrame annotations with metadata such as:

  • File name
  • Image dimensions (height, width)
  • Number of channels
  • Mode
  • Binary image data
  • Metadata

Motivation and Context

Many document processing pipelines include HTML or Markdown sources containing embedded images. Until now, Spark NLP lacked a native, streamlined way to read, parse, and represent these images within annotation pipelines. Reader2Image closes this gap by providing:

  • Seamless reuse of Spark NLP pipelines with image support.
  • Ability to read directly from multiple file formats.
  • Consistent output format compatible with other visual annotators.

How Has This Been Tested?

Screenshots (if appropriate):

  • Unit Tests
  • Google Colab Notebooks

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Aug 31, 2025
@danilojsl danilojsl added Feature request new-feature Introducing a new feature and removed Feature request labels Aug 31, 2025
@danilojsl danilojsl force-pushed the feature/SPARKNLP-1261-Implement-Reader2Image-Annotator branch from cb58774 to 30e573b Compare September 3, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Introducing a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant