Azure-Samples · pamelafox · Nov 12, 2025 · Nov 3, 2025 · Nov 3, 2025 · Nov 4, 2025
diff --git a/.gitignore b/.gitignore
@@ -148,6 +148,8 @@ npm-debug.log*
 node_modules
 static/
 
+app/functions/*/prepdocslib/
+
 data/**/*.md5
 
 .DS_Store
diff --git a/AGENTS.md b/AGENTS.md
@@ -17,7 +17,30 @@ If necessary, edit this file to ensure it accurately reflects the current state
       * app/backend/approaches/prompts/chat_query_rewrite.prompty: Prompt used to rewrite the query based off search history into a better search query
       * app/backend/approaches/prompts/chat_query_rewrite_tools.json: Tools used by the query rewriting prompt
       * app/backend/approaches/prompts/chat_answer_question.prompty: Prompt used by the Chat approach to actually answer the question based off sources
+    * app/backend/prepdocslib: Contains the document ingestion library used by both local and cloud ingestion
+      * app/backend/prepdocslib/blobmanager.py: Manages uploads to Azure Blob Storage
+      * app/backend/prepdocslib/cloudingestionstrategy.py: Builds the Azure AI Search indexer and skillset for the cloud ingestion pipeline
+      * app/backend/prepdocslib/csvparser.py: Parses CSV files
+      * app/backend/prepdocslib/embeddings.py: Generates embeddings for text and images using Azure OpenAI
+      * app/backend/prepdocslib/figureprocessor.py: Generates figure descriptions for both local ingestion and the cloud figure-processor skill
+      * app/backend/prepdocslib/fileprocessor.py: Orchestrates parsing and chunking of individual files
+      * app/backend/prepdocslib/filestrategy.py: Strategy for uploading and indexing files (local ingestion)
+      * app/backend/prepdocslib/htmlparser.py: Parses HTML files
+      * app/backend/prepdocslib/integratedvectorizerstrategy.py: Strategy using Azure AI Search integrated vectorization
+      * app/backend/prepdocslib/jsonparser.py: Parses JSON files
+      * app/backend/prepdocslib/listfilestrategy.py: Lists files from local filesystem or Azure Data Lake
+      * app/backend/prepdocslib/mediadescriber.py: Interfaces for describing images (Azure OpenAI GPT-4o, Content Understanding)
+      * app/backend/prepdocslib/page.py: Data classes for pages, images, and chunks
+      * app/backend/prepdocslib/parser.py: Base parser interface
+      * app/backend/prepdocslib/pdfparser.py: Parses PDFs using Azure Document Intelligence or local parser
+      * app/backend/prepdocslib/searchmanager.py: Manages Azure AI Search index creation and updates
+      * app/backend/prepdocslib/servicesetup.py: Shared service setup helpers for OpenAI, embeddings, blob storage, etc.
+      * app/backend/prepdocslib/strategy.py: Base strategy interface for document ingestion
+      * app/backend/prepdocslib/textparser.py: Parses plain text and markdown files
+      * app/backend/prepdocslib/textprocessor.py: Processes text chunks for cloud ingestion (merges figures, generates embeddings)
+      * app/backend/prepdocslib/textsplitter.py: Splits text into chunks using different strategies
     * app/backend/app.py: The main entry point for the backend application.
+  * app/functions: Azure Functions used for cloud ingestion custom skills (document extraction, figure processing, text processing). Each function bundles a synchronized copy of `prepdocslib`; run `python scripts/copy_prepdocslib.py` to refresh the local copies if you modify the library.
   * app/frontend: Contains the React frontend code, built with TypeScript, built with vite.
     * app/frontend/src/api: Contains the API client code for communicating with the backend.
     * app/frontend/src/components: Contains the React components for the frontend.
@@ -65,7 +88,7 @@ When adding a new developer setting, update:
   * app/backend/approaches/retrievethenread.py : Retrieve from overrides parameter
   * app/backend/app.py: Some settings may need to be sent down in the /config route.
 
-## When adding tests for a new feature:
+## When adding tests for a new feature
 
 All tests are in the `tests` folder and use the pytest framework.
 There are three styles of tests:
@@ -124,3 +147,37 @@ cd scripts && mypy . --config-file=../pyproject.toml
 
 Note that we do not currently enforce type hints in the tests folder, as it would require adding a lot of `# type: ignore` comments to the existing tests.
 We only enforce type hints in the main application code and scripts.
+
+## Python code style
+
+Do not use single underscores in front of "private" methods or variables in Python code. We do not follow that convention in this codebase, since this is an application and not a library.
+
+## Deploying the application
+
+To deploy the application, use the `azd` CLI tool. Make sure you have the latest version of the `azd` CLI installed. Then, run the following command from the root of the repository:
+
+```shell
+azd up
+```
+
+That command will BOTH provision the Azure resources AND deploy the application code.
+
+If you only changed the Bicep templates and want to re-provision the Azure resources, run:
+
+```shell
+azd provision
+```
+
+If you only changed the application code and want to re-deploy the code, run:
+
+```shell
+azd deploy
+```
+
+If you are using cloud ingestion and only want to deploy individual functions, run the necessary deploy commands, for example:
+
+```shell
+azd deploy document-extractor
+azd deploy figure-processor
+azd deploy text-processor
+```
diff --git a/README.md b/README.md
@@ -60,7 +60,7 @@ The repo includes sample data so it's ready to try end to end. In this sample ap
 - Chat (multi-turn) and Q&A (single turn) interfaces
 - Renders citations and thought process for each answer
 - Includes settings directly in the UI to tweak the behavior and experiment with options
-- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [integrated vectorization](/docs/data_ingestion.md#overview-of-integrated-vectorization)
+- Integrates Azure AI Search for indexing and retrieval of documents, with support for [many document formats](/docs/data_ingestion.md#supported-document-formats) as well as [cloud data ingestion](/docs/data_ingestion.md#cloud-data-ingestion)
 - Optional usage of [multimodal models](/docs/multimodal.md) to reason over image-heavy documents
 - Optional addition of [speech input/output](/docs/deploy_features.md#enabling-speech-inputoutput) for accessibility
 - Optional automation of [user login and data access](/docs/login_and_acl.md) via Microsoft Entra

diff --git a/app/backend/app.py b/app/backend/app.py
@@ -467,6 +467,7 @@ async def setup_clients():
     USE_CHAT_HISTORY_BROWSER = os.getenv("USE_CHAT_HISTORY_BROWSER", "").lower() == "true"
     USE_CHAT_HISTORY_COSMOS = os.getenv("USE_CHAT_HISTORY_COSMOS", "").lower() == "true"
     USE_AGENTIC_RETRIEVAL = os.getenv("USE_AGENTIC_RETRIEVAL", "").lower() == "true"
+    USE_VECTORS = os.getenv("USE_VECTORS", "").lower() != "false"
 
     # WEBSITE_HOSTNAME is always set by App Service, RUNNING_IN_PRODUCTION is set in main.bicep
     RUNNING_ON_AZURE = os.getenv("WEBSITE_HOSTNAME") is not None or os.getenv("RUNNING_IN_PRODUCTION") is not None
@@ -582,7 +583,7 @@ async def setup_clients():
         current_app.config[CONFIG_USER_BLOB_MANAGER] = user_blob_manager
 
         # Set up ingester
-        file_processors = setup_file_processors(
+        file_processors, figure_processor = setup_file_processors(
             azure_credential=azure_credential,
             document_intelligence_service=os.getenv("AZURE_DOCUMENTINTELLIGENCE_SERVICE"),
             local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER", "").lower() == "true",
@@ -594,18 +595,21 @@ async def setup_clients():
             openai_model=OPENAI_CHATGPT_MODEL,
             openai_deployment=AZURE_OPENAI_CHATGPT_DEPLOYMENT if OPENAI_HOST == OpenAIHost.AZURE else None,
         )
-        search_info = await setup_search_info(
+        search_info = setup_search_info(
             search_service=AZURE_SEARCH_SERVICE, index_name=AZURE_SEARCH_INDEX, azure_credential=azure_credential
         )
-        text_embeddings_service = setup_embeddings_service(
-            open_ai_client=openai_client,
-            openai_host=OPENAI_HOST,
-            emb_model_name=OPENAI_EMB_MODEL,
-            emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
-            azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
-            azure_openai_endpoint=azure_openai_endpoint,
-            disable_vectors=os.getenv("USE_VECTORS", "").lower() == "false",
-        )
+
+        text_embeddings_service = None
+        if USE_VECTORS:
+            text_embeddings_service = setup_embeddings_service(
+                open_ai_client=openai_client,
+                openai_host=OPENAI_HOST,
+                emb_model_name=OPENAI_EMB_MODEL,
+                emb_model_dimensions=OPENAI_EMB_DIMENSIONS,
+                azure_openai_deployment=AZURE_OPENAI_EMB_DEPLOYMENT,
+                azure_openai_endpoint=azure_openai_endpoint,
+            )
+
         image_embeddings_service = setup_image_embeddings_service(
             azure_credential=azure_credential,
             vision_endpoint=AZURE_VISION_ENDPOINT,
@@ -618,6 +622,7 @@ async def setup_clients():
             image_embeddings=image_embeddings_service,
             search_field_name_embedding=AZURE_SEARCH_FIELD_NAME_EMBEDDING,
             blob_manager=user_blob_manager,
+            figure_processor=figure_processor,
         )
         current_app.config[CONFIG_INGESTER] = ingester
 
@@ -640,7 +645,7 @@ async def setup_clients():
         OPENAI_CHATGPT_MODEL not in Approach.GPT_REASONING_MODELS
         or Approach.GPT_REASONING_MODELS[OPENAI_CHATGPT_MODEL].streaming
     )
-    current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = os.getenv("USE_VECTORS", "").lower() != "false"
+    current_app.config[CONFIG_VECTOR_SEARCH_ENABLED] = bool(USE_VECTORS)
     current_app.config[CONFIG_USER_UPLOAD_ENABLED] = bool(USE_USER_UPLOAD)
     current_app.config[CONFIG_LANGUAGE_PICKER_ENABLED] = ENABLE_LANGUAGE_PICKER
     current_app.config[CONFIG_SPEECH_INPUT_ENABLED] = USE_SPEECH_INPUT_BROWSER