I have a storage account with Azure Container Storage configured consisting of multiple pdf/word/excel files. I would like to use Azure Document Intelligence to semantically chunk these files.
Is there a possibility to load the files directly from Container Storage to Azure Document Intelligence using langchain
? According to the langchain
docs it seems like either file has to be locally available or public url has to be handed over.
Attempt:
# Prerequisite: An Azure AI Document Intelligence resource in one of the 3 preview regions: East US, West US2, West Europe
import os
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
file_path = "storage-path-to-file"
endpoint = os.getenv("DOCUMENTINTELLIGENCE_ENDPOINT")
key = os.getenv("DOCUMENTINTELLIGENCE_API_KEY")
loader = AzureAIDocumentIntelligenceLoader(
api_endpoint=endpoint, api_key=key, file_path=file_path, api_model="prebuilt-layout"
)
documents = loader.load()
# Returns:
# Message: Invalid request.
# Inner error: {
# "code": "InvalidManagedIdentity",
# "message": "The managed identity configuration is invalid: Managed identity is not enabled # for the current resource."
# }
2
Answers
You can use the below code which loads the files directly from Azure Blob storage using
Azure Blob URL + SAS token.
Code:
Output:
You can get the
Azure Blob URL + SAS token
from portal.Portal -> Storage account -> Container -> your file -> Generate sas token -> click Generate sas token and url
Portal:
Reference:
Azure AI Document Intelligence | 🦜️🔗 LangChain
It looks like you can connect directly per this notebook:
https://github.com/jbernec/rag-orchestrations/blob/main/azure-ai-document-intelligence/rag_document_extraction.ipynb