New approach to multimodal document ingestion #2558

pamelafox · 2025-06-04T00:15:33Z

Purpose

As I've discussed in various issues and live streams, our current "GPT vision approach" has some drawbacks, specifically:

requires vector embeddings for images, which increases ingestion time and RAG answering time
creates images of the entire document, which is unnecessary if the document is mostly text.

The new multimodal approach:

extracts images (using Document Intelligence) and stores them separately in Blob storage
[optional] computes embeddings of extracted images
Uses LLM to describe the images inside the text chunk
associates each text chunk with any nearby images
during the RAG flow, it optionally does a multivector search, and even if it doesnt, if it sees any images associated with the resulting chunks, it sends those to the model.

Video demonstration: https://www.youtube.com/watch?v=3RujWrBmjsc

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[X] Yes - old approach will no longer be supported
[ ] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

…tracted images

github-actions · 2025-06-04T00:15:52Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-27T21:16:57Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-27T23:11:39Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-27T23:17:27Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-28T00:01:09Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-30T14:04:40Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`/docs/multimodal.md`	`64`
2	`docs/multimodal.md`	`95`
3	`docs/multimodal.md`	`258`

github-actions · 2025-06-30T14:27:25Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

docs/deploy_features.md

#	Link	Line Number
1	`./gpt4v.md`	`138`
2	`./gpt4v.md`	`145`
3	`./gpt4v.md`	`262`
4	`./gpt4v.md`	`350`

docs/productionizing.md

#	Link	Line Number
1	`/docs/gpt4v.md`	`109`

…g Content

…tsplitter

…tPage to Chunk

github-actions · 2025-08-25T20:44:15Z

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues

docs/textsplitter.md

#	Link	Line Number
1	`../backend/prepdocslib/textsplitter.py`	`3`

Copilot

Pull Request Overview

This PR introduces a new multimodal document ingestion approach that aims to replace the current "GPT vision approach" with better performance characteristics. The new approach extracts images using Document Intelligence, stores them separately in Blob storage, uses LLMs to describe images within text chunks, and associates text chunks with nearby images for more efficient RAG flows.

Key Changes:

Implementation of new multimodal image extraction and description approach
Replacement of vector embeddings for images with direct blob storage
Integration of image-text association in the RAG pipeline
Updated test snapshots to reflect new data structures and API responses

Reviewed Changes

Copilot reviewed 155 out of 178 changed files in this pull request and generated no comments.

File	Description
tests/snapshots/test_prepdocslib_textsplitter/test_pages_with_figures/pages_with_figures.json/split_pages_with_figures.json	New test snapshot showing figure captions and descriptions embedded within text chunks
Multiple test snapshot files (test_app/**/result.json)	Updated API response structures with new citation formats, image arrays, and search parameter configurations reflecting the multimodal approach
tests/snapshots/test_app/test_chat_stream_vision/vision_client0/result.jsonlines	New streaming response format for vision-enabled chat with multimodal content

Comments suppressed due to low confidence (1)

tests/snapshots/test_prepdocslib_textsplitter/test_pages_with_figures/pages_with_figures.json/split_pages_with_figures.json:1

The word "Bitcoin" is misspelled as "Bitconin" in the figure caption description.

pamelafox · 2025-08-26T02:26:22Z

Video showcasing the PR: https://www.youtube.com/watch?v=3RujWrBmjsc

mattgotteiner · 2025-08-26T17:00:15Z

docs/customization.md


-The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
+1. **Query rewriting**: Unchanged.
+2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.


like ask - it's optional to use the vector embedding here right?
RAG_SEARCH_IMAGE_EMBEDDINGS

Added a note below this part about optionality.

mattgotteiner · 2025-08-26T17:00:28Z

docs/customization.md

-The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
+1. **Query rewriting**: Unchanged.
+2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
+3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.


the image source variables control this as well?

Yes, adding a line to bottom of each "The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
"

mattgotteiner · 2025-08-26T17:02:51Z

docs/multimodal.md

+This repository includes an optional feature that uses multimodal embedding models and multimodal chat completion models
+to better handle documents that contain images, such as financial reports with charts and graphs.
+
+With this feature enabled, the data ingestion process will extract images from your documents


We also have a bulleted list a few paragraphs later, so I did a summary sentence here. I could use bullets in both places, or consolidate? It's somewhat an artifact of how we originally wrote the GPT-4v doc so maybe it's not flowing well.

mattgotteiner · 2025-08-26T17:05:50Z

docs/multimodal.md

+   You can customize the RAG flow approach with a few additional environment variables.
+
+   The following variables can be set to either true or false,
+   to control whether Azure AI Search will use text embeddings, image embeddings, or both:


for searching right? And we also need to update the other descriptions of what the approach does with these variables

should this all be default true for multimodal approach?

These are all default true currently. Let me completely redo this section to be clearer!

mattgotteiner · 2025-08-26T17:06:20Z

docs/multimodal.md

+   ```
+
+   The following variable can be set to either true or false,
+   to control whether the chat completion model will use text inputs, image inputs, or both:


for answering* and update description of answer process saying this is optional?
should this all be default true?

Redid this whole section

mattgotteiner · 2025-08-26T17:29:30Z

docs/textsplitter.md

+1. Produce semantically coherent chunks that align with sentence boundaries.
+2. Respect a maximum token count per chunk (hard limit of 500 tokens) plus a soft character length guideline (default 1,000 characters with a 20% overflow tolerance for merges / normalization). Size limit does not apply to figure blocks (chunks containing a `<figure>` may exceed the token limit; figures are never split).
+3. Keep structural figure placeholders (`<figure>...</figure>`) atomic: never split internally and always attach them to preceding accumulated text if any exists.
+4. Repair mid‑sentence page breaks when safe via merge or fragment shift heuristics while enforcing token + soft character budgets.


what is "fragment shift heuristics" in this sentence?

A fragment shift is when it moves a trailing sentence fragment from the N chunk to the N+1 chunk, so that the N+1 chunk doesn't start mid-sentence. But let me reword it.

Simplified that description, and adding this description below:

2. Trailing sentence fragment carry‑forward

If a full merge would violate limits, we do a more surgical repair: pull only the dangling sentence fragment from the end of the previous chunk and move it forward so it reunites with its continuation at the start of the next page.

Key differences from semantic overlap:

Carry‑forward MOVES text (no duplication except any recursive split overlap that may occur later). Semantic overlap DUPLICATES a small preview from the next chunk.

Carry‑forward only activates across a page boundary when a full merge is too large. Semantic overlap is routine and size‑capped.

mattgotteiner · 2025-08-26T17:33:15Z

docs/textsplitter.md

+* Recursive subdivision of oversized individual spans using a boundary preference order:
+    1. Sentence-ending punctuation near the midpoint (scan within the central third of the span).
+    2. If no sentence boundary is found, a word break (space / punctuation from a configured list) near the midpoint to avoid mid‑word cuts.
+    3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.


symmetric 10% overlap midpoint split - perhaps we just say "default to a simpler midpoint split"?

mattgotteiner · 2025-08-26T17:58:43Z

docs/textsplitter.md

+    3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
+* Figure handling is front‑loaded: figure blocks are extracted first and treated as atomic before any span splitting or recursion on plain text.
+* Cross‑page merge of text when all safety checks pass (prior chunk ends mid‑sentence, next chunk starts lowercase, not a heading, no early figure) and combined size fits both token and soft char budgets; otherwise a fragment shift may move the trailing unfinished clause forward.
+* A lightweight semantic overlap duplication pass (10% of max section length) that appends a trimmed prefix of the next chunk onto the end of the previous chunk (the next chunk itself is left unchanged). This is always attempted for adjacent non‑figure chunks on the same page and conditionally across a page boundary when the next chunk appears to be a direct lowercase continuation (and not a heading or figure). Figures are never overlapped/duplicated.


rather than calling this "overlap duplication" - perhaps "adding overlap to each chunk"?
We may want to omit the "always attempted" sentence and just state "figures are never overlapped / duplicated"

Sure, reworded

github-actions · 2025-08-26T18:09:24Z

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them.
For more details, check our Contributing Guide.

File Full Path Issues

README.md

#	Link	Line Number
1	`https://azure.microsoft.com/pricing/details/storage/blobs/`	`93`

pamelafox and others added 8 commits May 28, 2025 22:41

Prepare change for multimodal, rm old vision approach stuff

b55ca88

Add LLM-based media describer

74fdf48

Prepdocs progress

001c86f

Fix media description with OpenAI

7c8f825

More prepdocs improvements for image handling

ea3ee28

Merge branch 'main' into visionv2

16a0ec6

Store bbox as list of pixel floats, add storage container just for ex…

e85f8c5

…tracted images

Getting image citations almost working

2a73065

pamelafox added 2 commits June 27, 2025 13:17

More progress on multimodal approach

751abd1

Update more tests

ebfcfc5

Fix up more app tests

f177e5c

Add test for upload_document_image

154f284

Add media describer and embeddings tests

41aeac4

Fix tests for vision, work on vectorizer

a2fa105

Add font, rename multimodal doc

dae363f

pamelafox added 8 commits June 30, 2025 07:29

Update links to multimodal

7d576f0

Fix import

0dfacbf

Doc fixes

51fd298

Fix f-string syntax

d76e949

Markdown lint issues

9223611

mypy fixes and reasoning fixes

c4086c2

Rename vision variables, fix mypy

e0a8843

Mypy fixes

b470901

pamelafox marked this pull request as ready for review August 22, 2025 15:55

pamelafox changed the title ~~WIP: New approach to multimodal document ingestion~~ New approach to multimodal document ingestion Aug 22, 2025

pamelafox added 4 commits August 22, 2025 09:21

Fix prepdocs to properly close async clients

a960506

better CSS for image URLs and images in Thought Process and Supportin…

74b780d

…g Content

Revert logging level to WARNING as before

fb97138

Update text splitter chunking logic and add full test coverage

e7f368c

pamelafox requested a review from Copilot August 22, 2025 20:48

This comment was marked as outdated.

Sign in to view

pamelafox added 4 commits August 22, 2025 22:41

Use single token char

5a6b748

Fix mypy error

90420c5

Add some helper functions and modules to improve code clarity for tex…

7cebf13

…tsplitter

Update splitting algorithm with better overlap algorithm, rename Spli…

a2ff14a

…tPage to Chunk

pamelafox requested a review from Copilot August 25, 2025 20:45

Copilot AI reviewed Aug 25, 2025

View reviewed changes

markdown issues

a0c3b41

mattgotteiner reviewed Aug 26, 2025

View reviewed changes

Revise multimodal doc to be clearer

d52de82

mattgotteiner reviewed Aug 26, 2025

View reviewed changes

Rephrase fragment shift to be more grokkable

c838ad5

pamelafox and others added 2 commits August 26, 2025 13:45

Reword duplicate part of textsplitter doc

244ef89

Merge branch 'main' into visionv2

9adafc3

New approach to multimodal document ingestion #2558

Are you sure you want to change the base?

New approach to multimodal document ingestion #2558

Conversation

pamelafox commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

Uh oh!

github-actions bot commented Jun 4, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 27, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 27, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 27, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 28, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 30, 2025

Check Broken Paths

Uh oh!

github-actions bot commented Jun 30, 2025

Check Broken Paths

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Aug 25, 2025

Check Broken Paths

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

pamelafox commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

2. Trailing sentence fragment carry‑forward

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pamelafox commented Jun 4, 2025 •

edited

Loading