How to extract page number from azure ai search using skills

Question

I have a series of documents in an azure storage blob, and i want to create a knowledge mining solution, that returns the page and layout of the searched text using AI search. I'm using the OCR skill:

    ocr_skill = OcrSkill(
        name="#2",
        context="/document/normalized_images/*",
        line_ending="Space",
        default_language_code="en",

        inputs=[
            InputFieldMappingEntry(name="image", source="/document/normalized_images/*")
        ],
        outputs=[
            OutputFieldMappingEntry(name="text", target_name="text"),
            OutputFieldMappingEntry(name="layoutText", target_name="layoutText")
        ]
    )

I'm using these indexer parameters:

    indexing_parameters = IndexingParameters(
        configuration={
            "indexStorageMetadataOnlyForOversizedDocuments": True,
            "failOnUnsupportedContentType": False,
            "indexedFileNameExtensions": ".pdf,.docx,.txt,.json",
            "parsingMode": "default",
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImagePerPage",
            "allowSkillsetToReadFileData": True  # Set this to True
        }
    )
    indexer = SearchIndexer(
        name=indexer_name,
        data_source_name=data_source_name,
        target_index_name=index_name,
        skillset_name=skillset_name,
        field_mappings=field_mappings,
        output_field_mappings=output_field_mappings,
        schedule=IndexingSchedule(interval="PT15M"),
        parameters=indexing_parameters
    )
    indexer_client.create_indexer(indexer)

The OCR is reading each page separately as i set the field to

SearchableField(name="layoutText", type=SearchFieldDataType.String, searchable=True, collection=True),

and creates separate line in the field, however there is no page number mapping for each line. How do i add the page number to each page OCR extracted and when the search locates the word or phrase?

Answer

@Karim Alameh Thanks for asking question.

I can configure the indexer with imageAction = generateNormalizedImagePerPage to generate an array of normalized images where each page in the PDF is rendered to one output image. This also includes the original PDF page number. Passing this to the OcrSkill and then the AzureOpenAIEmbeddingSkill actually returns me the results I want. Remember there’s additional cost associated with OCR and the image extraction

As a reference check this sample skillset-

{
  "@odata.context": "https://something-something.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8DC9DDEAB0DAC43\"",
  "name": "something-something-skillset",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "#1",
      "description": null,
      "context": "/document/normalized_images/*",
      "textExtractionAlgorithm": null,
      "lineEnding": "Space",
      "defaultLanguageCode": "en",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#4",
      "description": null,
      "context": "/document/normalized_images/*",
      "resourceUri": "https://dion-test-aoai.openai.azure.com",
      "apiKey": "",
      "deploymentId": "text-embedding-ada-002",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/normalized_images/*/text"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "authIdentity": null
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "something-something",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          {
            "name": "text_vector",
            "source": "/document/normalized_images/*/text_vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "chunk",
            "source": "/document/normalized_images/*/text",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "pageNumber",
            "source": "/document/normalized_images/*/pageNumber",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "metadata_storage_path",
            "source": "/document/metadata_storage_path",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/title",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

Share via

How to extract page number from azure ai search using skills

1 answer

Your answer