How to extract page number from azure ai search using skills

Karim Alameh 20 Reputation points
2024-07-08T07:39:57.21+00:00

I have a series of documents in an azure storage blob, and i want to create a knowledge mining solution, that returns the page and layout of the searched text using AI search. I'm using the OCR skill:

    ocr_skill = OcrSkill(
        name="#2",
        context="/document/normalized_images/*",
        line_ending="Space",
        default_language_code="en",

        inputs=[
            InputFieldMappingEntry(name="image", source="/document/normalized_images/*")
        ],
        outputs=[
            OutputFieldMappingEntry(name="text", target_name="text"),
            OutputFieldMappingEntry(name="layoutText", target_name="layoutText")
        ]
    )

I'm using these indexer parameters:

    indexing_parameters = IndexingParameters(
        configuration={
            "indexStorageMetadataOnlyForOversizedDocuments": True,
            "failOnUnsupportedContentType": False,
            "indexedFileNameExtensions": ".pdf,.docx,.txt,.json",
            "parsingMode": "default",
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImagePerPage",
            "allowSkillsetToReadFileData": True  # Set this to True
        }
    )
    indexer = SearchIndexer(
        name=indexer_name,
        data_source_name=data_source_name,
        target_index_name=index_name,
        skillset_name=skillset_name,
        field_mappings=field_mappings,
        output_field_mappings=output_field_mappings,
        schedule=IndexingSchedule(interval="PT15M"),
        parameters=indexing_parameters
    )
    indexer_client.create_indexer(indexer)

The OCR is reading each page separately as i set the field to

SearchableField(name="layoutText", type=SearchFieldDataType.String, searchable=True, collection=True),

and creates separate line in the field, however there is no page number mapping for each line. How do i add the page number to each page OCR extracted and when the search locates the word or phrase?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
994 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. SnehaAgrawal-MSFT 21,506 Reputation points
    2024-07-08T16:58:23.0966667+00:00

    @Karim Alameh Thanks for asking question.

    I can configure the indexer with imageAction = generateNormalizedImagePerPage to generate an array of normalized images where each page in the PDF is rendered to one output image. This also includes the original PDF page number. Passing this to the OcrSkill and then the AzureOpenAIEmbeddingSkill actually returns me the results I want. Remember there’s additional cost associated with OCR and the image extraction

    As a reference check this sample skillset-

    {
      "@odata.context": "https://something-something.search.windows.net/$metadata#skillsets/$entity",
      "@odata.etag": "\"0x8DC9DDEAB0DAC43\"",
      "name": "something-something-skillset",
      "description": "Skillset to chunk documents and generate embeddings",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
          "name": "#1",
          "description": null,
          "context": "/document/normalized_images/*",
          "textExtractionAlgorithm": null,
          "lineEnding": "Space",
          "defaultLanguageCode": "en",
          "detectOrientation": true,
          "inputs": [
            {
              "name": "image",
              "source": "/document/normalized_images/*"
            }
          ],
          "outputs": [
            {
              "name": "text",
              "targetName": "text"
            }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
          "name": "#4",
          "description": null,
          "context": "/document/normalized_images/*",
          "resourceUri": "https://dion-test-aoai.openai.azure.com",
          "apiKey": "<redacted>",
          "deploymentId": "text-embedding-ada-002",
          "dimensions": 1536,
          "modelName": "text-embedding-ada-002",
          "inputs": [
            {
              "name": "text",
              "source": "/document/normalized_images/*/text"
            }
          ],
          "outputs": [
            {
              "name": "embedding",
              "targetName": "text_vector"
            }
          ],
          "authIdentity": null
        }
      ],
      "indexProjections": {
        "selectors": [
          {
            "targetIndexName": "something-something",
            "parentKeyFieldName": "parent_id",
            "sourceContext": "/document/normalized_images/*",
            "mappings": [
              {
                "name": "text_vector",
                "source": "/document/normalized_images/*/text_vector",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "chunk",
                "source": "/document/normalized_images/*/text",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "pageNumber",
                "source": "/document/normalized_images/*/pageNumber",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "metadata_storage_path",
                "source": "/document/metadata_storage_path",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "title",
                "source": "/document/title",
                "sourceContext": null,
                "inputs": []
              }
            ]
          }
        ],
        "parameters": {
          "projectionMode": "skipIndexingParentDocuments"
        }
      },
      "encryptionKey": null
    }
    
    
    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.