Skip to content

Conversation

pamelafox
Copy link
Collaborator

@pamelafox pamelafox commented Jun 4, 2025

Purpose

As I've discussed in various issues and live streams, our current "GPT vision approach" has some drawbacks, specifically:

  • requires vector embeddings for images, which increases ingestion time and RAG answering time
  • creates images of the entire document, which is unnecessary if the document is mostly text.

The new multimodal approach:

  • extracts images (using Document Intelligence) and stores them separately in Blob storage
  • [optional] computes embeddings of extracted images
  • Uses LLM to describe the images inside the text chunk
  • associates each text chunk with any nearby images
  • during the RAG flow, it optionally does a multivector search, and even if it doesnt, if it sees any images associated with the resulting chunks, it sends those to the model.

Video demonstration: https://www.youtube.com/watch?v=3RujWrBmjsc

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[X] Yes - old approach will no longer be supported
[ ] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

Copy link

github-actions bot commented Jun 4, 2025

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1/docs/multimodal.md64
2docs/multimodal.md95
3docs/multimodal.md258

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
docs/deploy_features.md
#LinkLine Number
1./gpt4v.md138
2./gpt4v.md145
3./gpt4v.md262
4./gpt4v.md350
docs/productionizing.md
#LinkLine Number
1/docs/gpt4v.md109

@pamelafox pamelafox marked this pull request as ready for review August 22, 2025 15:55
@pamelafox pamelafox changed the title WIP: New approach to multimodal document ingestion New approach to multimodal document ingestion Aug 22, 2025
@pamelafox pamelafox requested a review from Copilot August 22, 2025 20:48
Copilot

This comment was marked as outdated.

Copy link

Check Broken Paths

We have automatically detected the following broken relative paths in your files.
Review and fix the paths to resolve this issue.

Check the file paths and associated broken paths inside them.
For more details, check our Contributing Guide.

File Full Path Issues
docs/textsplitter.md
#LinkLine Number
1../backend/prepdocslib/textsplitter.py3

@pamelafox pamelafox requested a review from Copilot August 25, 2025 20:45
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new multimodal document ingestion approach that aims to replace the current "GPT vision approach" with better performance characteristics. The new approach extracts images using Document Intelligence, stores them separately in Blob storage, uses LLMs to describe images within text chunks, and associates text chunks with nearby images for more efficient RAG flows.

Key Changes:

  • Implementation of new multimodal image extraction and description approach
  • Replacement of vector embeddings for images with direct blob storage
  • Integration of image-text association in the RAG pipeline
  • Updated test snapshots to reflect new data structures and API responses

Reviewed Changes

Copilot reviewed 155 out of 178 changed files in this pull request and generated no comments.

File Description
tests/snapshots/test_prepdocslib_textsplitter/test_pages_with_figures/pages_with_figures.json/split_pages_with_figures.json New test snapshot showing figure captions and descriptions embedded within text chunks
Multiple test snapshot files (test_app/**/result.json) Updated API response structures with new citation formats, image arrays, and search parameter configurations reflecting the multimodal approach
tests/snapshots/test_app/test_chat_stream_vision/vision_client0/result.jsonlines New streaming response format for vision-enabled chat with multimodal content
Comments suppressed due to low confidence (1)

tests/snapshots/test_prepdocslib_textsplitter/test_pages_with_figures/pages_with_figures.json/split_pages_with_figures.json:1

  • The word "Bitcoin" is misspelled as "Bitconin" in the figure caption description.
[

@pamelafox
Copy link
Collaborator Author

Video showcasing the PR: https://www.youtube.com/watch?v=3RujWrBmjsc


The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
1. **Query rewriting**: Unchanged.
2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like ask - it's optional to use the vector embedding here right?
RAG_SEARCH_IMAGE_EMBEDDINGS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note below this part about optionality.

The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
1. **Query rewriting**: Unchanged.
2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image source variables control this as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding a line to bottom of each "The settings can be customized to disable calculating the image vector embeddings or to disable sending image inputs to the LLM, if desired.
"

This repository includes an optional feature that uses multimodal embedding models and multimodal chat completion models
to better handle documents that contain images, such as financial reports with charts and graphs.

With this feature enabled, the data ingestion process will extract images from your documents
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bullets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a bulleted list a few paragraphs later, so I did a summary sentence here. I could use bullets in both places, or consolidate? It's somewhat an artifact of how we originally wrote the GPT-4v doc so maybe it's not flowing well.

You can customize the RAG flow approach with a few additional environment variables.

The following variables can be set to either true or false,
to control whether Azure AI Search will use text embeddings, image embeddings, or both:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for searching right? And we also need to update the other descriptions of what the approach does with these variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this all be default true for multimodal approach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all default true currently. Let me completely redo this section to be clearer!

```

The following variable can be set to either true or false,
to control whether the chat completion model will use text inputs, image inputs, or both:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for answering* and update description of answer process saying this is optional?
should this all be default true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redid this whole section

1. Produce semantically coherent chunks that align with sentence boundaries.
2. Respect a maximum token count per chunk (hard limit of 500 tokens) plus a soft character length guideline (default 1,000 characters with a 20% overflow tolerance for merges / normalization). Size limit does not apply to figure blocks (chunks containing a `<figure>` may exceed the token limit; figures are never split).
3. Keep structural figure placeholders (`<figure>...</figure>`) atomic: never split internally and always attach them to preceding accumulated text if any exists.
4. Repair mid‑sentence page breaks when safe via merge or fragment shift heuristics while enforcing token + soft character budgets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "fragment shift heuristics" in this sentence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fragment shift is when it moves a trailing sentence fragment from the N chunk to the N+1 chunk, so that the N+1 chunk doesn't start mid-sentence. But let me reword it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified that description, and adding this description below:

2. Trailing sentence fragment carry‑forward

If a full merge would violate limits, we do a more surgical repair: pull only the dangling sentence fragment from the end of the previous chunk and move it forward so it reunites with its continuation at the start of the next page.

Key differences from semantic overlap:

  • Carry‑forward MOVES text (no duplication except any recursive split overlap that may occur later). Semantic overlap DUPLICATES a small preview from the next chunk.
  • Carry‑forward only activates across a page boundary when a full merge is too large. Semantic overlap is routine and size‑capped.

* Recursive subdivision of oversized individual spans using a boundary preference order:
1. Sentence-ending punctuation near the midpoint (scan within the central third of the span).
2. If no sentence boundary is found, a word break (space / punctuation from a configured list) near the midpoint to avoid mid‑word cuts.
3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symmetric 10% overlap midpoint split - perhaps we just say "default to a simpler midpoint split"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded

3. If neither boundary type is found, a symmetric 10% overlap midpoint split (duplicated region appears at the end of the first part and the start of the second) preserves continuity.
* Figure handling is front‑loaded: figure blocks are extracted first and treated as atomic before any span splitting or recursion on plain text.
* Cross‑page merge of text when all safety checks pass (prior chunk ends mid‑sentence, next chunk starts lowercase, not a heading, no early figure) and combined size fits both token and soft char budgets; otherwise a fragment shift may move the trailing unfinished clause forward.
* A lightweight semantic overlap duplication pass (10% of max section length) that appends a trimmed prefix of the next chunk onto the end of the previous chunk (the next chunk itself is left unchanged). This is always attempted for adjacent non‑figure chunks on the same page and conditionally across a page boundary when the next chunk appears to be a direct lowercase continuation (and not a heading or figure). Figures are never overlapped/duplicated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than calling this "overlap duplication" - perhaps "adding overlap to each chunk"?
We may want to omit the "always attempted" sentence and just state "figures are never overlapped / duplicated"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, reworded

Copy link

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1https://azure.microsoft.com/pricing/details/storage/blobs/93

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants