How to split document by page in Azure AI Search?

Alice
March 7, 2024
109 views
0 votes
2 Answers

As title.

I have several PDFs stored in Azure blob and entered Azure AI Search and using SplitSkill.

However, even if textSplitMod is set to pages, I still can’t split document by pages.

The skillset JSON code is as follows:

{
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#2",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/mergedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }

How can I achieve the goal of splitting according to page numbers?

Because I want the search output to show the answer along with the corresponding page number.

Answers

The text split skill breaks documents into chunks, which are used for further processing by other cognitive skills.

Below, you can see that I have added the field mappings of the output of the text split skill to the index. Even though all pages are indexed at the same source document, when using it in other cognitive skills by providing input like /document/mypages/*, it processes each page.

Below is the sample I used for language detection skill on each page.

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": ""0x8DC37696B99977C"",
  "name": "skillset1709010100983",
  "description": "",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#1",
      "description": null,
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "pageOverlapLength": 0,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "languageCode",
          "source": "/document/language"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "mypages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
      "name": "#2",
      "description": "",
      "context": "/document/mypages/*",
      "defaultCountryHint": "in",
      "modelVersion": "latest",
      "inputs": [
        {
          "name": "text",
          "source": "/document/mypages/*"
        }
      ],
      "outputs": [
        {
          "name": "languageCode",
          "targetName": "languageCode"
        },
        {
          "name": "languageName",
          "targetName": "languageName"
        },
        {
          "name": "score",
          "targetName": "score"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices",
    "description": null
  },
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

However, what you are asking about getting pages in the index cannot be done. You can either use the output of the text split skill, or refer to the knowledge store to create each of the pages and create a new separate index with those pages.

If you only have PDFs, you can use the generateNormalizedImagePerPage feature in addition to OCR to extract your text, and then you will also have a pageNumber available that you can map to the index. If you want each page to be its own document in the index, you can do so by using the new preview index projections feature.

Your indexer definition would look something like this:

{
    "name": "indexerName",
    "targetIndexName": "indexName",
    "skillsetName": "skillsetName",
    "dataSourceName": "dataSourceName",
    "parameters": {
        "configuration": {
            "imageAction": "generateNormalizedImagePerPage"
        }
    },
    "fieldMappings": [],
    "outputFieldMappings": []
}

And your skillset definition will look something like this:

{
    "skills": [
        {
            "description": "Extracts text (plain and structured) from image.",
            "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
            "context": "/document/normalized_images/*",
            "inputs": [
                {
                    "name": "image",
                    "source": "/document/normalized_images/*"
                }
            ],
            "outputs": [
                {
                    "name": "text",
                    "targetName": "text"
                }
            ]
        }
    ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": "CognitiveServices account",
        "key": "{{cognitiveServicesKey}}"
    },
    "indexProjections": {
        "selectors": [
            {
                "targetIndexName": "indexName",
                "parentKeyFieldName": "ParentKey",
                "sourceContext": "/document/normalized_images/*",
                "mappings": [
                    {
                        "name": "pageText",
                        "source": "/document/normalized_images/*/text"
                    },
                    {
                        "name": "pageNumber",
                        "source": "/document/normalized_images/*/pageNumber"
                    }
                ]
            }
        ],
        "parameters": {
            "projectionMode": "skipIndexingParentDocuments"
        }
    }
}

Please signup or login to give your own answer.

Click here to cancel reply.