skip to Main Content

As title.

I have several PDFs stored in Azure blob and entered Azure AI Search and using SplitSkill.

However, even if textSplitMod is set to pages, I still can’t split document by pages.

The skillset JSON code is as follows:

{
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#2",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/mergedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }

How can I achieve the goal of splitting according to page numbers?

Because I want the search output to show the answer along with the corresponding page number.

2

Answers


  1. The text split skill breaks documents into chunks, which are used for further processing by other cognitive skills.

    Below, you can see that I have added the field mappings of the output of the text split skill to the index. Even though all pages are indexed at the same source document, when using it in other cognitive skills by providing input like /document/mypages/*, it processes each page.

    enter image description here

    Below is the sample I used for language detection skill on each page.

    {
      "@odata.context": "https://jgsai.search.windows.net/$metadata#skillsets/$entity",
      "@odata.etag": ""0x8DC37696B99977C"",
      "name": "skillset1709010100983",
      "description": "",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "#1",
          "description": null,
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 1000,
          "pageOverlapLength": 0,
          "maximumPagesToTake": 0,
          "inputs": [
            {
              "name": "text",
              "source": "/document/content"
            },
            {
              "name": "languageCode",
              "source": "/document/language"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "mypages"
            }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
          "name": "#2",
          "description": "",
          "context": "/document/mypages/*",
          "defaultCountryHint": "in",
          "modelVersion": "latest",
          "inputs": [
            {
              "name": "text",
              "source": "/document/mypages/*"
            }
          ],
          "outputs": [
            {
              "name": "languageCode",
              "targetName": "languageCode"
            },
            {
              "name": "languageName",
              "targetName": "languageName"
            },
            {
              "name": "score",
              "targetName": "score"
            }
          ]
        }
      ],
      "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices",
        "description": null
      },
      "knowledgeStore": null,
      "indexProjections": null,
      "encryptionKey": null
    }
    

    However, what you are asking about getting pages in the index cannot be done. You can either use the output of the text split skill, or refer to the knowledge store to create each of the pages and create a new separate index with those pages.

    Login or Signup to reply.
  2. If you only have PDFs, you can use the generateNormalizedImagePerPage feature in addition to OCR to extract your text, and then you will also have a pageNumber available that you can map to the index. If you want each page to be its own document in the index, you can do so by using the new preview index projections feature.

    Your indexer definition would look something like this:

    {
        "name": "indexerName",
        "targetIndexName": "indexName",
        "skillsetName": "skillsetName",
        "dataSourceName": "dataSourceName",
        "parameters": {
            "configuration": {
                "imageAction": "generateNormalizedImagePerPage"
            }
        },
        "fieldMappings": [],
        "outputFieldMappings": []
    }
    

    And your skillset definition will look something like this:

    {
        "skills": [
            {
                "description": "Extracts text (plain and structured) from image.",
                "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
                "context": "/document/normalized_images/*",
                "inputs": [
                    {
                        "name": "image",
                        "source": "/document/normalized_images/*"
                    }
                ],
                "outputs": [
                    {
                        "name": "text",
                        "targetName": "text"
                    }
                ]
            }
        ],
        "cognitiveServices": {
            "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
            "description": "CognitiveServices account",
            "key": "{{cognitiveServicesKey}}"
        },
        "indexProjections": {
            "selectors": [
                {
                    "targetIndexName": "indexName",
                    "parentKeyFieldName": "ParentKey",
                    "sourceContext": "/document/normalized_images/*",
                    "mappings": [
                        {
                            "name": "pageText",
                            "source": "/document/normalized_images/*/text"
                        },
                        {
                            "name": "pageNumber",
                            "source": "/document/normalized_images/*/pageNumber"
                        }
                    ]
                }
            ],
            "parameters": {
                "projectionMode": "skipIndexingParentDocuments"
            }
        }
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search