How to return images in a chunked Azure AI search index

Alice
March 19, 2024
92 views
0 votes
2 Answers

As title.

I used "import and vectorized data" to creat index and the index be automatically chunk.

Index schema like;

 "value": [
    {
      "@search.score": 
      "chunk_id": "",
      "chunk": "",
      "title": "",
      "image": ""
    },

Referring to the official documentation, I used "/document/normalized_images/*/data" to retrieve the base64 data of the normalized images, and then processed it using a program to convert it into image files. However, my objective is to obtain the base64 data corresponding to each chunk. Therefore, I modified the skillset as follows, but it resulted in error messages:

"One or more index projection selectors are invalid. Details: There is no matching index field for input ‘image’ in index ‘name’."

"indexProjections": {
    "selectors": [
      {
        "targetIndexName": "name",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "vector",
            "source": "/document/pages/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "image",
            "sourceContext":"/document/pages/*",
            "inputs": [
                            {
                                "source":"/document/normalized_images/*/pages/data",
                                "name":"imagedata"
                            }
                        ]
        
          }
        ]
      }
    ]

I want to get the base64 data corresponding to each index chunk text. How can I adapt this approach or explore alternative solutions?

Answers

I want to get the base64 data corresponding to each index chunk text.

There’s a mismatch between the index schema and the skillset configuration. The field named "image" for storing image URLs, doesn’t seem to be suitable for storing base64 data.

If you want to store base64 data directly in the index, you need to add a field in the index schema to accommodate this data. You can name it something like "imageData" As shown in below.

"fields": [
    { "name": "imageData", "type": "Edm.String", "filterable": false, "sortable": false, "facetable": false, "searchable": false }
]

Once you modify the above just update the skillset as below.

"skills": [
    {
        "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
        "name": "#1",
        "inputs": [
            {
                "name": "chunk",
                "source": "/document/pages/*"
            }
        ],
        "outputs": [
            {
                "name": "chunk"
            },
            {
                "name": "imageData",
                "targetName": "imageData"
            }
        ]
    },
    {
        "@odata.type": "#Microsoft.Skills.Text.ExtractKeyPhrasesSkill",
        "name": "#2",
        "context": "/document",
        "inputs": [
            {
                "name": "text",
                "source": "/document/pages/*/text"
            }
        ],
        "outputs": [
            {
                "name": "keyPhrases",
                "targetName": "keyPhrases"
            }
        ]
    }
]

This skillset will extract base64 image data from the "image" field and store it in the "imageData" field.

Update Indexer :

"parameters": {
    "configuration": {
        "dataToExtract": "contentAndMetadata",
        "imageAction": "generateNormalizedImages",
        "indexedFileNameExtensions": ".pdf,.docx,.pptx,.xlsx",
        "skillsetName": "your_updated_skillset_name",
        "targetIndexName": "your_index_name",
        "fieldMappings": [
            {
                "sourceFieldName": "/document/pages/*/text",
                "targetFieldName": "text"
            },
            {
                "sourceFieldName": "/document/pages/*/title",
                "targetFieldName": "title"
            }
        ]
    }
}

Your index projections are defined wrong. First of all, you are creating a nested input within the "image" mapping. This should only be used if the "image" field is of type Edm.ComplexType and you want to create an inline complex type to be mapped to the index. Also, it looks like you are mapping "/document/normalized_images/*/pages/data". You need to remove the "pages" from that source path. So afterwards, that particular mapping in your index projections definition should just be the following:

{
        "name": "image",
        "source":"/document/normalized_images/*/data"
}

However, notice that the sourceContext for your index projections is "/document/pages/*". This means that for each ‘page’, there will be one document in the search index. However, images are tracked under the separate path "/document/normalized_images/*". This means that there is not necessarily a 1-1 mapping from pages to images. So if you use the mapping I shared above, it will actually output an array of strings that contains the separate base64 data for all of the images from the parent document for each of the pages from said document.

If you want there to be a 1-1 mapping from images to search documents, then you should instead do something like this with your skillset. Note that if the text output per image is too much to be vectorized though then you will see errors.

{
  "description": "Skillset to chunk documents by image and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "context": "/document/normalized_images/*",
      "resourceUri": "<fill in>",
      "apiKey": "<fill in>",
      "deploymentId": "<fill in>",
      "inputs": [
        {
          "name": "text",
          "source": "/document/normalized_images/*/text"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "vector"
        }
      ]
    }
  ],
  "cognitiveServices": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "name",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/normalized_images/*/text"
          },
          {
            "name": "vector",
            "source": "/document/normalized_images/*/vector"
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name"
          },
          {
            "name": "image",
            "source": "/document/normalized_images/*/data"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

Please signup or login to give your own answer.

Click here to cancel reply.