Azure AI Search Index - Multiple indexers and chunking

DanielHarmon
April 4, 2024
180 views
1 vote
2 Answers

I have an indexer that reads through blob storage, chunks, and vectorizes the data into an index. This is working great. I also have a key field, lets call it fileID that is stored in the metadata of the document and is also in the index. This is unique to the document, however it is not unique after chunking because a document will be split into multiple documents each with the same fileid.

I want to have a second indexer than can add data from a sql query into the index, joined on that fileid. However since I can’t use fileid anymore as the key – because of the chunking process and the fact that a key needs to be unique, how can I merge the data from the sql query indexer into the index?

I’m guessing this is not possible right now but if anyone has any suggestions, that would be amazing!

Answers

Chosen as BEST ANSWER

I ended up doing this with a custom web api skill using an Azure Function which takes a recordid and returns additional fields from a sql server database.

Custom Skill Web API Documentation

Here is my skillset definition when all is said and done.

    {
  "name": "myindexskillset",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
        "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
        "name": "#1",
        "description": "This skill calls an Azure function to get additional metadata from sql database",
        "httpMethod": "POST",
        "timeout": "PT30S",
        "batchSize": 100,
        "degreeOfParallelism": 1,
        "uri": "customskillendpointgoeshere",
        "context": "/document",
        "inputs": [
          {
            "name": "systemOfRecordFileId",
            "source": "/document/SystemOfRecordFileID"
          }
        ],
        "outputs": [
          {
            "name": "recordType",
            "targetName": "RecordType"
          },
          {
            "name": "referenceRecordId",
            "targetName": "ReferenceRecordID"
          },
          {
            "name": "recordTitle",
            "targetName": "RecordTitle"
          },
          {
            "name": "sponsorName",
            "targetName": "SponsorName"
          },
          {
            "name": "piFullName",
            "targetName": "PIFullName"
          },
          {
            "name": "subrecipientName",
            "targetName": "SubrecipientName"
          }
        ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#2",
      "description": null,
      "context": "/document/pages/*",
      "resourceUri": "<omitted>",
      "apiKey": "<omitted>",
      "deploymentId": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "vector"
        }
      ],
      "authIdentity": null
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#3",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/mergedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#4",
      "description": null,
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "mergedText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "#5",
      "description": null,
      "context": "/document/normalized_images/*",
      "textExtractionAlgorithm": null,
      "lineEnding": "Space",
      "defaultLanguageCode": "en",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "description": null,
    "key": "<omitted>"
  },
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "myIndex",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "vector",
            "source": "/document/pages/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "SystemOfRecordFileID",
            "source": "/document/SystemOfRecordFileID",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "RecordType",
            "source": "/document/RecordType",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "ReferenceRecordID",
            "source": "/document/ReferenceRecordID",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "RecordTitle",
            "source": "/document/RecordTitle",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "SponsorName",
            "source": "/document/SponsorName",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "PIFullName",
            "source": "/document/PIFullName",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "SubrecipientName",
            "source": "/document/SubrecipientName",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

(Edit)

The key field in the index will be unique per document, and it will be the same for the chunks created on that document.

So, it is not possible to create a unique ID for each chunk unless you create two separate indexes, one basic fields and further chunking with a custom web API skillset and one for loading the chunked data with creating unique fields.

Here, the skillset takes input and creates the chunked data for each document, and writes it to blob storage. Then, with that storage as a data source, it reads the key field as uniqueid.

tmp-index below shows the fields.

Next, the indexer is used for this index with the skillset.

Indexer definition.

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": ""0x8DC5466A843478A"",
  "name": "tmp-indexer",
  "description": null,
  "dataSourceName": "tmp-datasource",
  "skillsetName": "tmp-skillset",
  "targetIndexName": "tmp-index",
  "disabled": false,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "indexedFileNameExtensions": ".txt,.md,.html,.pdf,.docx,.pptx,.deltrack",
      "dataToExtract": "contentAndMetadata"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "document_id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "filename",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

Here, the dataSourceName is configured as your original DataSource and given the skillset, which takes input and writes chunked data in JSON format to the storage account.

Skillset definition.

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": ""0x8DC5466966EA9A3"",
  "name": "tmp-skillset",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "tmp-skillset",
      "description": null,
      "context": "/document/content",
      "uri": "endpoint",
      "httpMethod": "POST",
      "timeout": "PT1M",
      "batchSize": 10,
      "degreeOfParallelism": 10,
      "inputs": [
        {
          "name": "document_id",
          "source": "/document/document_id"
        },
        {
          "name": "filename",
          "source": "/document/filename"
        },
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "url",
          "source": "/document/url"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        }
      ],
      "httpHeaders": {
        "num-tokens": "1024", #chunk size
        "api-key": "api-key-to-endpoint",
        "connection-string": "<connection_string_to_storage_acc>",
        "container-name": "newdata-chunks",
        "metadata-mapping-json": "{}"
      }
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

Here, a POST request is made to the web API, passing the inputs.

You can configure the skillset according to your requirements. In the web API, you can use these inputs.

With text, create chunks of size 1024.
Create a unique folder with document_id and a unique file with chunk_id.
Write content to that file. Below is the sample content written to the storage account.

{"chunk_id": "0", "content": "chunk zero content", "last_updated": "20240404044226", "title": "using System;", "url": "https://jgsblob.blob.core.windows.net/data/csvs/translattion-console-code-plain.txt"}

You can create a new container and write according to your requirement, but make sure you are providing this as a data source in the next step.

So, the script in your web API should create unique folders and files, then write the content.

Next, create a new index and indexer with the definitions below.

Chunked-index

Here, you can also add your extra fields, which are the results of the join query.

Indexer

{
  "@odata.context": "https://azsearch0303.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": ""0x8DC546C5BE2EB06"",
  "name": "indexer1712209927853",
  "description": null,
  "dataSourceName": "tmp-datasource-chunk",
  "skillsetName": null,
  "targetIndexName": "chunked-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "json",
      "indexedFileNameExtensions": ".json"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "uniqueid",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

tmp-datasource-chunk is the DataSource created using blob storage, providing the container where you wrote the data earlier.

Now, you can add a custom web API skill here, which does join on the unique ID.

Please signup or login to give your own answer.

Click here to cancel reply.

Azure AI Search Index – Multiple indexers and chunking

Answers