SPARKNLP-1261 Introducing Reader2Image Annotator #14658
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a new Spark NLP annotator,
Reader2Image
, which enables smooth integration of image reading capabilities into existing NLP workflows. It allows users to parse image content from structured documents (such as HTML and Markdown) and outputs them as a structured Spark DataFrame using VLM models alongside.Supports parsing image content from:
Returns parsed image data as Spark DataFrame annotations with metadata such as:
Motivation and Context
Many document processing pipelines include HTML or Markdown sources containing embedded images. Until now, Spark NLP lacked a native, streamlined way to read, parse, and represent these images within annotation pipelines.
Reader2Image
closes this gap by providing:How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: