skip to Main Content

I want to index a set of documents (actually only 5) from PDF.

All I want to do is use the TextSplitSkill followed by AzureOpenAIEmbeddingSkill. Here is my Skillset template:

{
"name": "{{ skillSetName }}",
"description": "Skillset to chunk documents and generate embeddings",
"skills": [
    {
        "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
        "textSplitMode": "pages",
        "defaultLanguageCode": "de",
        "maximumPageLength": 1000,
        "pageOverlapLength": 100,
        "inputs": [
            {
                "name": "text",
                "source": "/document/content"
            }
        ],
        "outputs": [
            {
                "name": "textItems",
                "targetName": "pages"
            }
        ]
    },
    {
        "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
        "name": "OpenAIEmbeddings",
        "description": "Convert chunk content to embeddings",
        "resourceUri": "{{ embeddingEndpoint }}",
        "deploymentId": "{{ embeddingDeployment }}",
        "context": "/document/pages/*",
        "inputs": [
            {
                "name": "text",
                "source": "/document/pages/*"
            }
        ],
        "outputs": [
            {
                "name": "embedding"
            }
        ]
    }
],
"indexProjections": {
"selectors": [
  {
    "targetIndexName": "{{ targetIndexName }}",
    "parentKeyFieldName": "parent_id",
    "sourceContext": "/document/pages/*",
    "mappings": [
      {
        "name": "content",
        "source": "/document/pages/*"
      },
      {
        "name": "content_vector",
        "source": "/document/pages/*/embedding"
      }
    ]
  }
],
"parameters": {
  "projectionMode": "skipIndexingParentDocuments"
}

}
}

However, I get the

Error Message: Could not execute skill because Web Api skill response is invalid.

Details: Concurrency conflict during operation

for my longest PDFs (30 pages). I read that Azure OpenAI Embedding only supports a batch-size of 16 for embeddings.

I played around with one document und splitting it into chunks of 300 chars and setting the maximumPagesToTake parameter to 16 but that ended in the same error.

I’m out of ideas and as the AzureOpenAIEmbeddingSkill is not open-source I can’t see if they are wrapping the calls to AOI around retries / splitting them in mini-batches…

Update: Here a screenshot of the Indexer-Error:
Error message in indexer

Can somebody help?

2

Answers


  1. Chosen as BEST ANSWER

    I ended up comparing my JSON files for raw REST-API calls to the ones that are created when you use the Portal.

    I finally fixed it by copying some variables from the automatically-created one to mine.

    I think the key parameter that must be set is maxPagesToTake=0. By setting it to zero, it automatically takes all. I would have thought that would be the default behaviour, if I ddin't set it at all (but obviously my expectation was wrong).

    The documentation should be updated in order to reflect the meaning of this parameter and also that it actually is a REQUIRED parameter.


  2. I tried your JSON template and it’s working for me. I also added the target name.

    "outputs": [
            {
              "name": "embedding",
              "targetName": "embedding"
            }
          ]
        }
      ]
    
    

    Output:
    enter image description here

    If donot find Web API skills in the json

    • If you want to use a Custom Web API skill in Azure AI Search, refer to this link.

    • You can check the required template in the Azure portal.
      enter image description here

    I have added indexes using documentation. The AnalyzeDocumentOptions.Pages Property can accept a page list or a page number as inputs.

    public System.Collections.Generic.IList<string> Pages { get; }
    

    Integrate the AnalyzeDocumentOptions.Pages property into the existing code and limit the analysis to only the first 5 pages.

    var options = new AnalyzeDocumentOptions { Pages = { "1","2" , "3" ,"4" ,"5" } }
    

    I tried a sample example from Azure Samples search with dotnet using git and SO.

      var schedule = new IndexingSchedule(TimeSpan.FromDays(1))
      {
          StartTime = DateTimeOffset.Now
      };
      var parameters = new IndexingParameters()
      {
          BatchSize = 100,
          MaxFailedItems = 0,
          MaxFailedItemsPerBatch = 0
      };
      var indexer = new SearchIndexer("hotels-sql-idxr", dataSource.Name, searchIndex.Name)
      {
          Description = "Data indexer",
          Schedule = schedule,
          Parameters = parameters,
          FieldMappings =
          {
              new FieldMapping("_id") {TargetFieldName = "HotelId"},
              new FieldMapping("Amenities") {TargetFieldName = "Tags"}
          }
      };
    
    
    
    Console.WriteLine("Running Azure SQL indexer...");
    try
    {
        await indexerClient.RunIndexerAsync(indexer.Name);
    }
    catch (RequestFailedException ex) when (ex.Status == 429)
    {
        Console.WriteLine("Failed to run indexer: {0}", ex.Message);
    }
    System.Threading.Thread.Sleep(5000);
    
    // Document analysis
    var options = new AnalyzeDocumentOptions { Pages = { "1-5" } };
    
    

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search