I upload a pdf file to my streamlit application like this:
import streamlit as st
uploaded_file = st.file_uploader("Upload pdf file", type="pdf")
result = analyze_general_document(uploaded_file)
I want to analzye this pdf using the Azure Document Intelligence
python package like this:
from io import BytesIO
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
def set_client(secrets: dict):
endpoint = secrets["AI_DOCS_BASE"]
key = secrets["AI_DOCS_KEY"]
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key))
return document_analysis_client
def analyze_general_document(uploaded_file, secrets: dict):
print(f"File type: {uploaded_file.type}")
print(f"File size: {uploaded_file.size} bytes")
client = set_client(secrets)
# poller = client.begin_analyze_document_from_url("prebuilt-document", formUrl)
poller = client.begin_analyze_document("prebuilt-document", document=uploaded_file)
I can successfully print the file type and file size as you can see in the terminal output:
File type: application/pdf
File size: 6928426 bytes
Also opening the file with PyMuPDF
works fine as well.
However the method begin_analyze_document
throws the following exeception:
Traceback (most recent call last):
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesstreamlitruntimescriptrunnerexec_code.py", line 88, in exec_func_with_error_handling
result = func()
^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesstreamlitruntimescriptrunnerscript_runner.py", line 579, in code_to_exec
exec(code, module.__dict__)
File "C:UsersmyuserDocumentsvisual-studio-codeprojectproject-ai-docswebappapp.py", line 79, in <module>
main()
File "C:UsersmyuserDocumentsvisual-studio-codeprojectproject-ai-docswebappapp.py", line 61, in main
zip_content = process_pdf(uploaded_file, secrets)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserDocumentsvisual-studio-codeprojectproject-ai-docswebappapp_backend.py", line 40, in process_pdf
analyze_general_document(uploaded_file, secrets)
File "C:UsersmyuserDocumentsvisual-studio-codeprojectproject-ai-docswebappaz_document_intelligence.py", line 18, in analyze_general_document
poller = client.begin_analyze_document("prebuilt-document", document=uploaded_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesazurecoretracingdecorator.py", line 105, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesazureaiformrecognizer_document_analysis_client.py", line 129, in begin_analyze_document
return _client_op_path.begin_analyze_document( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesazurecoretracingdecorator.py", line 105, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesazureaiformrecognizer_generatedv2023_07_31operations_document_models_operations.py", line 518, in begin_analyze_document
raw_result = self._analyze_document_initial( # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:UsersmyuserAppDataLocalminiconda3envsprojectaiLibsite-packagesazureaiformrecognizer_generatedv2023_07_31operations_document_models_operations.py", line 443, in _analyze_document_initial
raise HttpResponseError(response=response)
azure.core.exceptions.HttpResponseError: (InvalidRequest) Invalid request.
Code: InvalidRequest
Message: Invalid request.
Inner error: {
"code": "InvalidContent",
"message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."
}
Why is the pdf considered invalid?
I also tried wrapping it in a BytesIO object like this but it didn’t work either:
def analyze_general_document(uploaded_file, secrets: dict):
print(f"File type: {uploaded_file.type}")
print(f"File size: {uploaded_file.size} bytes")
# Read the file as bytes
file_bytes = uploaded_file.read()
client = set_client(secrets)
# poller = client.begin_analyze_document_from_url("prebuilt-document", formUrl)
poller = client.begin_analyze_document("prebuilt-document", document=BytesIO(file_bytes))
2
Answers
Venkatesan's answer is definitely working and therefore I marked it as the correct answer. In the meantime I also found a way to make it work which works better for my use case as I already use
PyMuPDF
anyway in the project. Basically what I am doing is reading the pdf file with pymupdf, passing it to my azure document intelligence function and converting it to bytes by using pymupdf's method.write()
. The code looks like this:You can use the below code that Analyze the pdf file with Azure Document Intelligence by uploaded with streamlit using python,
Code:
Output:
Browser:
Reference:
azure.ai.formrecognizer.DocumentAnalysisClient class | Microsoft Learn