Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
9b199b8
Convert prepdocs to skills
pamelafox Nov 3, 2025
b971ca7
More Bicep to get funcs deployed with auth
pamelafox Nov 3, 2025
df0c17a
chore(functions): add missing prepdocslib dependencies to function re…
pamelafox Nov 4, 2025
e805ee3
build(functions): vendor dependencies into .python_packages for flex …
pamelafox Nov 4, 2025
253cb7e
chore(functions): copy backend requirements as requirements.backend.t…
pamelafox Nov 4, 2025
d66a620
chore(functions): overwrite function requirements with backend pins (…
pamelafox Nov 4, 2025
0d7e8a9
chore(functions): remove requirements backup; always overwrite with b…
pamelafox Nov 4, 2025
12d71d5
Get function apps deployed
pamelafox Nov 4, 2025
9ac595f
Updates to function auth
pamelafox Nov 5, 2025
d8dd729
latest changes to get auth working
pamelafox Nov 5, 2025
e906fb5
Fix tests
pamelafox Nov 7, 2025
f7638d4
always upload local files
mattgotteiner Nov 8, 2025
ba1a997
update to storageMetadata extraction
mattgotteiner Nov 8, 2025
57b53fd
Merge pull request #7 from mattgotteiner/matt/update-prepskills
pamelafox Nov 9, 2025
628609a
Got it working
pamelafox Nov 10, 2025
7bec324
Working more on the docs
pamelafox Nov 10, 2025
6dee74a
Merge in latest
pamelafox Nov 10, 2025
267ff51
Update
pamelafox Nov 11, 2025
8df151f
Push latest for review
pamelafox Nov 11, 2025
be98004
Consolidate docs
pamelafox Nov 11, 2025
b733d20
Clean up vectorization docs and refs
pamelafox Nov 11, 2025
1db5f14
More code cleanup
pamelafox Nov 11, 2025
6d4e490
Address Copilot feedback on tests
pamelafox Nov 11, 2025
9fcaa55
More code cleanups
pamelafox Nov 12, 2025
46bbaf7
Cleanup function test
pamelafox Nov 12, 2025
2d7b453
100% diff coverage
pamelafox Nov 12, 2025
c5116c8
Update app/functions/document_extractor/function_app.py
pamelafox Nov 12, 2025
cfa762c
Update app/backend/prepdocslib/page.py
pamelafox Nov 12, 2025
7c25851
Update app/functions/document_extractor/function_app.py
pamelafox Nov 12, 2025
e9f13f5
Address feedback and tweak docs
pamelafox Nov 12, 2025
db9dc7e
Merge branch 'prepskills' of https://github.com/pamelafox/azure-searc…
pamelafox Nov 12, 2025
b96f9c1
Apply suggestions from code review
pamelafox Nov 12, 2025
0211250
Adding diagram
pamelafox Nov 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,8 @@ npm-debug.log*
node_modules
static/

app/functions/*/prepdocslib/

data/**/*.md5

.DS_Store
59 changes: 58 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,30 @@ If necessary, edit this file to ensure it accurately reflects the current state
* app/backend/approaches/prompts/chat_query_rewrite.prompty: Prompt used to rewrite the query based off search history into a better search query
* app/backend/approaches/prompts/chat_query_rewrite_tools.json: Tools used by the query rewriting prompt
* app/backend/approaches/prompts/chat_answer_question.prompty: Prompt used by the Chat approach to actually answer the question based off sources
* app/backend/prepdocslib: Contains the document ingestion library used by both local and cloud ingestion
* app/backend/prepdocslib/blobmanager.py: Manages uploads to Azure Blob Storage
* app/backend/prepdocslib/cloudingestionstrategy.py: Builds the Azure AI Search indexer and skillset for the cloud ingestion pipeline
* app/backend/prepdocslib/csvparser.py: Parses CSV files
* app/backend/prepdocslib/embeddings.py: Generates embeddings for text and images using Azure OpenAI
* app/backend/prepdocslib/figureprocessor.py: Generates figure descriptions for both local ingestion and the cloud figure-processor skill
* app/backend/prepdocslib/fileprocessor.py: Orchestrates parsing and chunking of individual files
* app/backend/prepdocslib/filestrategy.py: Strategy for uploading and indexing files (local ingestion)
* app/backend/prepdocslib/htmlparser.py: Parses HTML files
* app/backend/prepdocslib/integratedvectorizerstrategy.py: Strategy using Azure AI Search integrated vectorization
* app/backend/prepdocslib/jsonparser.py: Parses JSON files
* app/backend/prepdocslib/listfilestrategy.py: Lists files from local filesystem or Azure Data Lake
* app/backend/prepdocslib/mediadescriber.py: Interfaces for describing images (Azure OpenAI GPT-4o, Content Understanding)
* app/backend/prepdocslib/page.py: Data classes for pages, images, and chunks
* app/backend/prepdocslib/parser.py: Base parser interface
* app/backend/prepdocslib/pdfparser.py: Parses PDFs using Azure Document Intelligence or local parser
* app/backend/prepdocslib/searchmanager.py: Manages Azure AI Search index creation and updates
* app/backend/prepdocslib/servicesetup.py: Shared service setup helpers for OpenAI, embeddings, blob storage, etc.
* app/backend/prepdocslib/strategy.py: Base strategy interface for document ingestion
* app/backend/prepdocslib/textparser.py: Parses plain text and markdown files
* app/backend/prepdocslib/textprocessor.py: Processes text chunks for cloud ingestion (merges figures, generates embeddings)
* app/backend/prepdocslib/textsplitter.py: Splits text into chunks using different strategies
* app/backend/app.py: The main entry point for the backend application.
* app/functions: Azure Functions used for cloud ingestion custom skills (document extraction, figure processing, text processing). Each function bundles a synchronized copy of `prepdocslib`; run `python scripts/copy_prepdocslib.py` to refresh the local copies if you modify the library.
* app/frontend: Contains the React frontend code, built with TypeScript, built with vite.
* app/frontend/src/api: Contains the API client code for communicating with the backend.
* app/frontend/src/components: Contains the React components for the frontend.
Expand Down Expand Up @@ -65,7 +88,7 @@ When adding a new developer setting, update:
* app/backend/approaches/retrievethenread.py : Retrieve from overrides parameter
* app/backend/app.py: Some settings may need to be sent down in the /config route.

## When adding tests for a new feature:
## When adding tests for a new feature

All tests are in the `tests` folder and use the pytest framework.
There are three styles of tests:
Expand Down Expand Up @@ -124,3 +147,37 @@ cd scripts && mypy . --config-file=../pyproject.toml

Note that we do not currently enforce type hints in the tests folder, as it would require adding a lot of `# type: ignore` comments to the existing tests.
We only enforce type hints in the main application code and scripts.

## Python code style

Do not use single underscores in front of "private" methods or variables in Python code. We do not follow that convention in this codebase, since this is an application and not a library.

## Deploying the application

To deploy the application, use the `azd` CLI tool. Make sure you have the latest version of the `azd` CLI installed. Then, run the following command from the root of the repository:

```shell
azd up
```

That command will BOTH provision the Azure resources AND deploy the application code.

If you only changed the Bicep templates and want to re-provision the Azure resources, run:

```shell
azd provision
```

If you only changed the application code and want to re-deploy the code, run:

```shell
azd deploy
```

If you are using cloud ingestion and only want to deploy individual functions, run the necessary deploy commands, for example:

```shell
azd deploy document-extractor
azd deploy figure-processor
azd deploy text-processor
```
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
- Chat (multi-turn) and Q&A (single turn) interfaces
- Renders citations and thought process for each answer
- Includes settings directly in the UI to tweak the behavior and experiment with options
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [integrated vectorization](/docs/data_ingestion.md#overview-of-integrated-vectorization)
- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud data ingestion](/docs/data_ingestion.md#cloud-data-ingestion)
- Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
- Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
- Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra
Expand Down
29 changes: 17 additions & 12 deletions app/backend/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,7 @@ async def setup_clients():
USE_CHAT_HISTORY_BROWSER = os.getenv("USE_CHAT_HISTORY_BROWSER", "").lower() == "true"
USE_CHAT_HISTORY_COSMOS = os.getenv("USE_CHAT_HISTORY_COSMOS", "").lower() == "true"
USE_AGENTIC_RETRIEVAL = os.getenv("USE_AGENTIC_RETRIEVAL", "").lower() == "true"
USE_VECTORS = os.getenv("USE_VECTORS", "").lower() != "false"

# WEBSITE_HOSTNAME is always set by App Service, RUNNING_IN_PRODUCTION is set in main.bicep
RUNNING_ON_AZURE = os.getenv("WEBSITE_HOSTNAME") is not None or os.getenv("RUNNING_IN_PRODUCTION") is not None
Expand Down Expand Up @@ -582,7 +583,7 @@ async def setup_clients():
current_app.config[CONFIG_USER_BLOB_MANAGER] = user_blob_manager

# Set up ingester
file_processors = setup_file_processors(
file_processors, figure_processor = setup_file_processors(
azure_credential=azure_credential,
document_intelligence_service=os.getenv("AZURE_DOCUMENTINTELLIGENCE_SERVICE"),
local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER", "").lower() == "true",
Expand All @@ -594,18 +595,21 @@ async def setup_clients():
openai_model=OPENAI_CHATGPT_MODEL,
openai_deployment=AZURE_OPENAI_CHATGPT_DEPLOYMENT if OPENAI_HOST == OpenAIHost.AZURE else None,
)
search_info = await setup_search_info(
search_info = setup_search_info(
search_service=AZURE_SEARCH_SERVICE, index_name=AZURE_SEARCH_INDEX, azure_credential=azure_credential
)
text_embeddings_service = setup_embeddings_service(
open_ai_client=openai_client,
openai_host=OPENAI_HOST,
emb_model_name=OPENAI_EMB_MODEL,
emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
azure_openai_endpoint=azure_openai_endpoint,
disable_vectors=os.getenv("USE_VECTORS", "").lower() == "false",
)

text_embeddings_service = None
if USE_VECTORS:
text_embeddings_service = setup_embeddings_service(
open_ai_client=openai_client,
openai_host=OPENAI_HOST,
emb_model_name=OPENAI_EMB_MODEL,
emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
azure_openai_endpoint=azure_openai_endpoint,
)

image_embeddings_service = setup_image_embeddings_service(
azure_credential=azure_credential,
vision_endpoint=AZURE_VISION_ENDPOINT,
Expand All @@ -618,6 +622,7 @@ async def setup_clients():
image_embeddings=image_embeddings_service,
search_field_name_embedding=AZURE_SEARCH_FIELD_NAME_EMBEDDING,
blob_manager=user_blob_manager,
figure_processor=figure_processor,
)
current_app.config[CONFIG_INGESTER] = ingester

Expand All @@ -640,7 +645,7 @@ async def setup_clients():
OPENAI_CHATGPT_MODEL not in Approach.GPT_REASONING_MODELS
or Approach.GPT_REASONING_MODELS[OPENAI_CHATGPT_MODEL].streaming
)
current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = os.getenv("USE_VECTORS", "").lower() != "false"
current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = bool(USE_VECTORS)
current_app.config[CONFIG_USER_UPLOAD_ENABLED] = bool(USE_USER_UPLOAD)
current_app.config[CONFIG_LANGUAGE_PICKER_ENABLED] = ENABLE_LANGUAGE_PICKER
current_app.config[CONFIG_SPEECH_INPUT_ENABLED] = USE_SPEECH_INPUT_BROWSER
Expand Down
Loading
Loading