I´m trying to create an Azure Search vector index as well in the Azure ML UI (Prompt flow) portal but having an error in the component "LLM – Crack and Chunk Data": My Flow Error Image
The error says:
User program failed with BaseRagServiceError: Rag system error
Part of the logs is:
input_data=/mnt/azureml/cr/j/60652b595f69/cap/data-capability/wd/INPUT_input_data
input_glob=**/*
allowed_extensions=.txt,.md,.html,.htm,.py,.pdf,.ppt,.pptx,.doc,.docx,.xls,.xlsx,.csv,.json
chunk_size=1024
chunk_overlap=0
output_chunks=/mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/output_chunks
data_source_url=azureml://locations/XXXXX/workspaces/04XXXX0/data/vector-index-input-1734572551882/versions/1
document_path_replacement_regex=None
max_sample_files=-1
use_rcts=True
output_format=jsonl
custom_loader=None
doc_intel_connection_id=None
output_title_chunk=None
openai_api_version=None
openai_api_type=None
[2024-12-19 01:43:28] INFO azureml.rag.crack_and_chunk.crack_and_chunk - ActivityStarted, crack_and_chunk (activity.py:108)
[2024-12-19 01:43:28] INFO azureml.rag.crack_and_chunk - Processing file: What is prompt flow.pdf (crack_and_chunk.py:127)
/azureml-envs/rag-embeddings/lib/python3.9/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Filtered 0 files out of 1 (crack_and_chunk.py:129)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Skipped extensions: {} (crack_and_chunk.py:130)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Kept extensions: {
".pdf": 1
} (crack_and_chunk.py:133)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
".txt": 0.0,
".md": 0.0,
".html": 0.0,
".htm": 0.0,
".py": 0.0,
".pdf": 1.0,
".ppt": 0.0,
".pptx": 0.0,
".doc": 0.0,
".docx": 0.0,
".xls": 0.0,
".xlsx": 0.0,
".csv": 0.0,
".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
".txt": 0.0,
".md": 0.0,
".html": 0.0,
".htm": 0.0,
".py": 0.0,
".pdf": 1.0,
".ppt": 0.0,
".pptx": 0.0,
".doc": 0.0,
".docx": 0.0,
".xls": 0.0,
".xlsx": 0.0,
".csv": 0.0,
".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - Processed 0 files (crack_and_chunk.py:208)
[2024-12-19 01:43:31] INFO azureml.rag.crack_and_chunk - No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/* (crack_and_chunk.py:215)
[2024-12-19 01:43:31] ERROR azureml.rag.crack_and_chunk.crack_and_chunk - ServiceError: intepreted error = Rag system error, original error = No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*. (exceptions.py:124)
[2024-12-19 01:43:36] ERROR azureml.rag.crack_and_chunk.crack_and_chunk - crack_and_chunk failed with exception: Traceback (most recent call last):
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 229, in main_wrapper
map_exceptions(main, activity_logger, args, logger, activity_logger)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
raise e
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
return func(*func_args, **kwargs)
File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 220, in main
raise ValueError(f"No chunked documents found in {args.input_data} with glob {args.input_glob}.")
ValueError: No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*.
(crack_and_chunk.py:231) ...................................
I tried with Serverless and Compute instance and is the same result.
It seems the chunk is not doing nothing.
My file is PDF format file with only one page without images to let it more easy.
Someone has a suggestion? thank you in advanced!!
2
Answers
The Solution suggested by JayshankarGS works. I have added another PDF with more pages and the job finished:
The vector was created:
This kind of error comes when there is no content to chunk the document.
Even i got the same error.
I have two text files
New Text Document.txt
andNew Text Document (2).txt
, both are empty no content in those and got the error.You said you have a single page pdf file, the possible reason is the content is not being extracted properly.
So, you try with 3-4 pdf files with proper content also make sure the file is not password protected.