The purpose of this repo is to setup a batch pipeline to process PDFs into text files leveraging Azure ML's native pipeline capabilities and Azure Form Recognizer (soon to be Azure AI Document Intelligence). This is a custom version of this repo, though this repo does not split PDFs by pages.
- With PDF file names, ensure special characters like +don't cause issues while processing. This is not specifically handled in the above operations.
- Given the size of the PDF files being processed, this can sometimes lead to out of memory issues. Either change the compute configuration or have a way of filtering out larger items to process independently.
- As of the current update (May 2024), azure-ai-form-recognizer was version 3.1 and GA. Over time, however this will give way to azure-ai-documentintelligence which is currently version 4.0 and in preview. This repo uses the former.
- In terms of RBAC, both the Azure ML workspace and the service principal have Contributoraccess to the storage account. Additionally, the workspace hasStorage Blob Data Contributoraccess to the storage account.
- Note about for Form Recognizer, you can auto-scale to avoid throttling issues.
- Critical to understand which SDK version maps to which API as listed here.