-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer #2819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…xt for traceability
Matt/update prepskills
Check Country Locale in URLsWe have automatically detected added country locale to URLs in your files. Check the file paths and associated URLs inside them.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a cloud-based data ingestion strategy that processes documents using Azure Functions as custom skills within an Azure AI Search indexer pipeline. The implementation refactors the existing local ingestion code into reusable modules that work both locally and in Azure Functions, following a three-stage pipeline: document extraction, figure processing, and text processing with chunking and embeddings.
Key changes include:
- Three new Azure Functions (
document_extractor,figure_processor,text_processor) implementing custom skills - Extracted shared setup logic into
servicesetup.pyfor reuse across local and cloud ingestion - Refactored figure processing into
figureprocessor.pyas a standalone module - Created
textprocessor.pyto handle text/figure combination and chunking - Added
cloudingestionstrategy.pyto configure the indexer/skillset pipeline - Updated infrastructure to deploy function apps with proper authentication
Reviewed Changes
Copilot reviewed 59 out of 61 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_textprocessor.py | New tests for text/figure combination logic |
| tests/test_servicesetup.py | New tests for shared setup helpers (moved from test_prepdocs.py) |
| tests/test_function_apps.py | Comprehensive tests for all three Azure Functions |
| tests/test_pdfparser.py | Updated tests for refactored PDF parser without media description coupling |
| app/backend/prepdocslib/servicesetup.py | New shared module for service configuration |
| app/backend/prepdocslib/figureprocessor.py | New module for figure description and enrichment |
| app/backend/prepdocslib/textprocessor.py | New module for text processing pipeline |
| app/backend/prepdocslib/cloudingestionstrategy.py | New strategy implementing cloud ingestion with skillsets |
| app/functions/*/function_app.py | Three new Azure Functions implementing custom skills |
| infra/main.bicep | Infrastructure for function apps and search service identity |
| docs/data_ingestion.md | Updated documentation explaining cloud ingestion approach |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Check Broken URLsWe have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue. Check the file paths and associated broken URLs inside them.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 58 out of 60 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/test_mediadescriber.py
Outdated
| "endPageNumber": 1, | ||
| "unit": "pixel", | ||
| "pages": [{"pageNumber": 1}], | ||
| "pages": [{"pageNumber": 0}], |
Copilot
AI
Nov 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The page number should be 1, not 0. Azure Document Intelligence uses 1-indexed page numbers in its API responses. This inconsistency could cause test failures or incorrect behavior when validating Content Understanding responses.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 56 out of 58 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Purpose
Fixes #2817
This pull request refactors the ingestion pipeline to support a new cloud ingestion strategy, and improves modularity by reorganizing setup logic.
The ingestion strategy uses Azure Functions as Custom Web API skills in a skillset connected to a Blob Indexer.
These are the three skills in order:
document extractor:

figure processor:

text processor:

I have had to refactor parts of prepdocs to make it easily reusable by the functions as well, so that we can run the same code locally and in the cloud.
Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest).python -m pytest --covto verify 100% coverage of added linespython -m mypyto check for type errorsruffandblackmanually on my code.