Skip to content

Conversation

@pamelafox
Copy link
Collaborator

Purpose

Fixes #2817

This pull request refactors the ingestion pipeline to support a new cloud ingestion strategy, and improves modularity by reorganizing setup logic.

The ingestion strategy uses Azure Functions as Custom Web API skills in a skillset connected to a Blob Indexer.

These are the three skills in order:

document extractor:
document_extractor

figure processor:
figure_processor

text processor:
text_processor

I have had to refactor parts of prepdocs to make it easily reusable by the functions as well, so that we can run the same code locally and in the cloud.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[ ] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[ ] No

Type of change

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

@pamelafox pamelafox marked this pull request as draft November 4, 2025 07:05
@pamelafox pamelafox changed the title Prepskills WIP: Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer Nov 4, 2025
@pamelafox pamelafox marked this pull request as ready for review November 11, 2025 21:57
@pamelafox pamelafox changed the title WIP: Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer Cloud ingestion strategy with prepdocs as custom skillset for Azure AI Search Blob Indexer Nov 11, 2025
@github-actions
Copy link

Check Country Locale in URLs

We have automatically detected added country locale to URLs in your files.
Review and remove country-specific locale from URLs to resolve this issue.

Check the file paths and associated URLs inside them.
For more details, check our Contributing Guide.

File Full Path Issues
docs/data_ingestion.md
#LinkLine Number
1https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-shaper166

@pamelafox pamelafox requested a review from Copilot November 11, 2025 23:02
Copilot finished reviewing on behalf of pamelafox November 11, 2025 23:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a cloud-based data ingestion strategy that processes documents using Azure Functions as custom skills within an Azure AI Search indexer pipeline. The implementation refactors the existing local ingestion code into reusable modules that work both locally and in Azure Functions, following a three-stage pipeline: document extraction, figure processing, and text processing with chunking and embeddings.

Key changes include:

  • Three new Azure Functions (document_extractor, figure_processor, text_processor) implementing custom skills
  • Extracted shared setup logic into servicesetup.py for reuse across local and cloud ingestion
  • Refactored figure processing into figureprocessor.py as a standalone module
  • Created textprocessor.py to handle text/figure combination and chunking
  • Added cloudingestionstrategy.py to configure the indexer/skillset pipeline
  • Updated infrastructure to deploy function apps with proper authentication

Reviewed Changes

Copilot reviewed 59 out of 61 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_textprocessor.py New tests for text/figure combination logic
tests/test_servicesetup.py New tests for shared setup helpers (moved from test_prepdocs.py)
tests/test_function_apps.py Comprehensive tests for all three Azure Functions
tests/test_pdfparser.py Updated tests for refactored PDF parser without media description coupling
app/backend/prepdocslib/servicesetup.py New shared module for service configuration
app/backend/prepdocslib/figureprocessor.py New module for figure description and enrichment
app/backend/prepdocslib/textprocessor.py New module for text processing pipeline
app/backend/prepdocslib/cloudingestionstrategy.py New strategy implementing cloud ingestion with skillsets
app/functions/*/function_app.py Three new Azure Functions implementing custom skills
infra/main.bicep Infrastructure for function apps and search service identity
docs/data_ingestion.md Updated documentation explaining cloud ingestion approach

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

Check Broken URLs

We have automatically detected the following broken URLs in your files. Review and fix the paths to resolve this issue.

Check the file paths and associated broken URLs inside them.
For more details, check our Contributing Guide.

File Full Path Issues
README.md
#LinkLine Number
1https://azure.microsoft.com/pricing/details/storage/blobs/93

@pamelafox pamelafox requested a review from Copilot November 12, 2025 07:58
Copilot finished reviewing on behalf of pamelafox November 12, 2025 07:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 58 out of 60 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"endPageNumber": 1,
"unit": "pixel",
"pages": [{"pageNumber": 1}],
"pages": [{"pageNumber": 0}],
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page number should be 1, not 0. Azure Document Intelligence uses 1-indexed page numbers in its API responses. This inconsistency could cause test failures or incorrect behavior when validating Content Understanding responses.

Copilot uses AI. Check for mistakes.
pamelafox and others added 5 commits November 12, 2025 08:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@pamelafox pamelafox requested a review from Copilot November 12, 2025 17:24
Copilot finished reviewing on behalf of pamelafox November 12, 2025 17:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 56 out of 58 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@pamelafox pamelafox closed this Nov 12, 2025
@pamelafox pamelafox reopened this Nov 12, 2025
@pamelafox pamelafox merged commit 3395382 into Azure-Samples:main Nov 12, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make custom skills for AI Search for prepdocs ingestion code

2 participants