skip to Main Content

I´m trying to create an Azure Search vector index as well in the Azure ML UI (Prompt flow) portal but having an error in the component "LLM – Crack and Chunk Data": My Flow Error Image

The error says:
User program failed with BaseRagServiceError: Rag system error

Part of the logs is:

input_data=/mnt/azureml/cr/j/60652b595f69/cap/data-capability/wd/INPUT_input_data
input_glob=**/*
allowed_extensions=.txt,.md,.html,.htm,.py,.pdf,.ppt,.pptx,.doc,.docx,.xls,.xlsx,.csv,.json
chunk_size=1024
chunk_overlap=0
output_chunks=/mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/output_chunks
data_source_url=azureml://locations/XXXXX/workspaces/04XXXX0/data/vector-index-input-1734572551882/versions/1
document_path_replacement_regex=None
max_sample_files=-1
use_rcts=True
output_format=jsonl
custom_loader=None
doc_intel_connection_id=None
output_title_chunk=None
openai_api_version=None
openai_api_type=None
[2024-12-19 01:43:28] INFO     azureml.rag.crack_and_chunk.crack_and_chunk - ActivityStarted, crack_and_chunk (activity.py:108)
[2024-12-19 01:43:28] INFO     azureml.rag.crack_and_chunk - Processing file: What is prompt flow.pdf (crack_and_chunk.py:127)
/azureml-envs/rag-embeddings/lib/python3.9/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Filtered 0 files out of 1 (crack_and_chunk.py:129)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Skipped extensions: {} (crack_and_chunk.py:130)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Kept extensions: {
  ".pdf": 1
} (crack_and_chunk.py:133)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
  ".txt": 0.0,
  ".md": 0.0,
  ".html": 0.0,
  ".htm": 0.0,
  ".py": 0.0,
  ".pdf": 1.0,
  ".ppt": 0.0,
  ".pptx": 0.0,
  ".doc": 0.0,
  ".docx": 0.0,
  ".xls": 0.0,
  ".xlsx": 0.0,
  ".csv": 0.0,
  ".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
  ".txt": 0.0,
  ".md": 0.0,
  ".html": 0.0,
  ".htm": 0.0,
  ".py": 0.0,
  ".pdf": 1.0,
  ".ppt": 0.0,
  ".pptx": 0.0,
  ".doc": 0.0,
  ".docx": 0.0,
  ".xls": 0.0,
  ".xlsx": 0.0,
  ".csv": 0.0,
  ".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - Processed 0 files (crack_and_chunk.py:208)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/* (crack_and_chunk.py:215)
[2024-12-19 01:43:31] ERROR    azureml.rag.crack_and_chunk.crack_and_chunk - ServiceError: intepreted error = Rag system error, original error = No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*. (exceptions.py:124)
[2024-12-19 01:43:36] ERROR    azureml.rag.crack_and_chunk.crack_and_chunk - crack_and_chunk failed with exception: Traceback (most recent call last):
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 229, in main_wrapper
    map_exceptions(main, activity_logger, args, logger, activity_logger)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
    raise e
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
    return func(*func_args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 220, in main
    raise ValueError(f"No chunked documents found in {args.input_data} with glob {args.input_glob}.")
ValueError: No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*.
 (crack_and_chunk.py:231) ...................................

I tried with Serverless and Compute instance and is the same result.
It seems the chunk is not doing nothing.
My file is PDF format file with only one page without images to let it more easy.

Someone has a suggestion? thank you in advanced!!

2

Answers


  1. Chosen as BEST ANSWER

    The Solution suggested by JayshankarGS works. I have added another PDF with more pages and the job finished:

    enter image description here

    The vector was created:

    enter image description here


  2. This kind of error comes when there is no content to chunk the document.

    Even i got the same error.

    enter image description here

    I have two text files New Text Document.txt and
    New Text Document (2).txt, both are empty no content in those and got the error.

    You said you have a single page pdf file, the possible reason is the content is not being extracted properly.

    So, you try with 3-4 pdf files with proper content also make sure the file is not password protected.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search