Error in Azure Cognitive Search Service when storing document page associated to each chunk extracted from PDF in a custom WebApiSkill

Mikel Broström Zalba 20 Reputation points
2024-06-10T15:30:57+00:00

I have the following custom WebApiSkill:


@app.route(route="CustomSplitSkill", auth_level=func.AuthLevel.FUNCTION)

def CustomSplit&PageSkill(req: func.HttpRequest) -> func.HttpResponse:

    logging.info('Python HTTP trigger function processed a request.')

    try:

        req_body = req.get_json()

    except ValueError:

        return func.HttpResponse("Invalid input", status_code=400)

    try:

        # 'values' expected top-level key in the request body

        response_body = {"values": []}

        for value in req_body.get('values', []):

            recordId = value.get('recordId')

            text = value.get('data', {}).get('text', '')

            # Remove sequences of dots, numbers following them, and

            # any additional punctuation or newline characters, replacing them with a single space

            cleaned_text = re.sub(r"[',.\n]+|\d+", ' ', text)

            # Replace multiple spaces with a single space and trim leading/trailing spaces

            cleaned_text = re.sub(r'\s{2,}', ' ', cleaned_text).strip()

            # Pattern to match sequences of ". " occurring more than twice

            cleaned_text = re.sub(r"(\. ){3,}", "", cleaned_text)

            chunks, page_numbers = split_text_into_chunks_with_overlap(cleaned_text, chunk_size=256, overlap_size=20)

            

            # response object for specific pdf

            response_record = {

                "recordId": recordId,

                "data": {

                    "textItems": chunks,  # chunks is a str list

                    "numberItems": page_numbers # page_numbers is an int list

                }

            }

            response_body['values'].append(response_record)

        return func.HttpResponse(json.dumps(response_body), mimetype="application/json")

    except ValueError:

        return func.HttpResponse("Function app crashed", status_code=400)

The inputs and outputs of this skill in the skillset are defined like this:


inputs=[

    InputFieldMappingEntry(name="text", source="/document/content")

],

outputs=[

    OutputFieldMappingEntry(name="textItems", target_name="pages"),

    OutputFieldMappingEntry(name="numberItems", target_name="numbers")

],

And the SearchIndexerIndexProjectionSelector is configured in the following way:


index_projections = SearchIndexerIndexProjections(  

        selectors=[  

            SearchIndexerIndexProjectionSelector(  

                target_index_name=index_name,  

                parent_key_field_name="parent_id",  

                source_context="/document/pages/*",  

                mappings=[  

                    InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  

                    InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),  

                    InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),

                    InputFieldMappingEntry(name="page_number", source="/document/numbers/*"), 

                ],  

            ),  

        ],  

        parameters=SearchIndexerIndexProjectionsParameters(  

            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  

        ),  

    )

My search fields look like this:


fields = [  

        SearchField(

            name="parent_id",

            type=SearchFieldDataType.String,

            sortable=True,

            filterable=True,

            facetable=True

        ),  

        SearchField(

            name="title",

            type=SearchFieldDataType.String

        ),  

        SearchField(

            name="chunk_id",

            type=SearchFieldDataType.String,

            key=True,

            sortable=True,

            filterable=True,

            facetable=True,

            analyzer_name="keyword"

        ),  

        SearchField(

            name="chunk",

            type=SearchFieldDataType.String,

            sortable=False,

            filterable=False,

            facetable=False

        ),  

        SearchField(

            name="vector",

            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),

            vector_search_dimensions=1536,

            vector_search_profile_name="myHnswProfile"

        ),

        SearchField(

            name="page_number",

            type=SearchFieldDataType.Int32,

            sortable=True,

            filterable=True,

            facetable=True

        ), 

    ] 

I get the following error:

The data field 'page_number' in the document with key 'xyz' has an invalid value of type 'Edm.String' ('String maps to Edm.String'). The expected type was 'Edm.Int32'.

When changing the value to String the index creation passes, with the following result under page_numbers:

"page_number": "[1,2,3,4,5,6,7,...]"

But I want to get a single value under each chunk

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
831 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,059 questions
{count} votes

2 answers

Sort by: Most helpful
  1. brtrach-MSFT 15,786 Reputation points Microsoft Employee
    2024-06-11T01:40:22.4833333+00:00

    @Mikel Broström Zalba The error message you’re seeing is indicating that the page_number field is expected to be of type Edm.Int32, but a Edm.String type value is being provided. This mismatch is likely causing the issue.

    In your custom WebApiSkill, you’re returning page_numbers (which is an integer list) as part of the response. However, when this data is being indexed, it seems to be treated as a string ("page_number": "[1,2,3,4,5,6,7,...]"), which is causing the type mismatch.

    To resolve this issue, you might need to adjust how the page_number data is being handled in your skill or during the indexing process.

    1. Check the data type in your skill. Ensure that the page_numbers data is indeed an integer list when it’s being returned from your skill. You might want to add some logging to your skill to confirm this. print(type(page_numbers), page_numbers)
    2. Adjust the indexing process. If the page_numbers data is correct in your skill, the issue might be with how this data is being indexed. You might need to adjust your indexer or the field mappings to correctly handle the page_number data as an integer list.
    3. Flatten the list. If you want to get a single value under each chunk, you might need to flatten the list of page numbers so that each chunk is associated with a single page number. This would involve adjusting your skill to return a list of records, each containing a single chunk and its associated page number.

    Remember to update your index schema to reflect these changes if necessary.

    1 person found this answer helpful.
    0 comments No comments

  2. Mikel Broström Zalba 20 Reputation points
    2024-06-13T10:41:22.32+00:00

    This answer on SO solved my issue:

    https://stackoverflow.com/questions/78602675/error-in-azure-cognitive-search-service-when-storing-document-page-associated-to/78615922?noredirect=1#comment138603092_78615922

    Still need to figure out how to map the vectors generated by AzureOpenAIEmbeddingSkill

    0 comments No comments