Text split cognitive skill

Important

Some parameters are in public preview under Supplemental Terms of Use. The preview REST API supports these parameters.

The Text Split skill breaks text into chunks of text. You can specify whether you want to break the text into sentences or into pages of a particular length. This skill is useful if there are maximum text length requirements in other skills downstream, such as embedding skills that pass data chunks to embedding models on Azure OpenAI and other model providers. For more information about this scenario, see Chunk documents for vector search.

Several parameters are version-specific. The skills parameter table notes the API version in which a parameter was introduced so that you know whether a version upgrade is required. To use version-specific features such as token chunking in 2024-09-01-preview, you can use the Azure portal, or target a REST API version, or check an Azure SDK change log to see if it supports the feature.

The Azure portal supports most preview features and can be used to create or update a skillset. For updates to the Text Split skill, edit the skillset JSON definition to add new preview parameters.

Note

This skill isn't bound to Azure AI services. It's non-billable and has no Azure AI services key requirement.

@odata.type

Microsoft.Skills.Text.SplitSkill

Skill Parameters

Parameters are case-sensitive.

Parameter name Version Description
textSplitMode All versions Either pages or sentences. Pages have a configurable maximum length, but the skill attempts to avoid truncating a sentence so the actual length might be smaller. Sentences are a string that terminates at sentence-ending punctuation, such as a period, question mark, or exclamation point, assuming the language has sentence-ending punctuation.
maximumPageLength All versions Only applies if textSplitMode is set to pages. For unit set to characters, this parameter refers to the maximum page length in characters as measured by String.Length. The minimum value is 300, the maximum is 50000, and the default value is 5000. The algorithm does its best to break the text on sentence boundaries, so the size of each chunk might be slightly less than maximumPageLength.

For unit set to azureOpenAITokens, the maximum page length is the token length limit of the model. For text embedding models, a general recommendation for page length is 512 tokens.
defaultLanguageCode All versions (optional) One of the following language codes: am, bs, cs, da, de, en, es, et, fr, he, hi, hr, hu, fi, id, is, it, ja, ko, lv, no, nl, pl, pt-PT, pt-BR, ru, sk, sl, sr, sv, tr, ur, zh-Hans. Default is English (en). A few things to consider:
  • Providing a language code is useful to avoid cutting a word in half for nonwhitespace languages such as Chinese, Japanese, and Korean.
  • If you don't know the language in advance (for example, if you're using the LanguageDetectionSkill to detect language), we recommend the en default.
pageOverlapLength 2024-07-01 Only applies if textSplitMode is set to pages. Each page starts with this number of characters or tokens from the end of the previous page. If this parameter is set to 0, there's no overlapping text on successive pages. This example includes the parameter.
maximumPagesToTake 2024-07-01 Only applies if textSplitMode is set to pages. Number of pages to return. The default is 0, which means to return all pages. You should set this value if only a subset of pages are needed. This example includes the parameter.
unit 2024-09-01-preview New. Only applies if textSplitMode is set to pages. Specifies whether to chunk by characters (default) or azureOpenAITokens. Setting the unit affects maximumPageLength and pageOverlapLength.
azureOpenAITokenizerParameters 2024-09-01-preview New. An object providing extra parameters for the azureOpenAITokens unit.

encoderModelName is a designated tokenizer used for converting text into tokens, essential for natural language processing (NLP) tasks. Different models use different tokenizers. Valid values include cl100k_base (default) used by GPT-35-Turbo and GPT-4. Other valid values are r50k_base, p50k_base, and p50k_edit. The skill implements the tiktoken library by way of SharpToken and Microsoft.ML.Tokenizers but doesn't support every encoder. For example, there's currently no support for o200k_base encoding used by GPT-4o.

allowedSpecialTokens defines a collection of special tokens that are permitted within the tokenization process. Special tokens are string that you want to treat uniquely, ensuring they aren't split during tokenization. For example ["[START"], "[END]"].

Skill Inputs

Parameter name Description
text The text to split into substring.
languageCode (Optional) Language code for the document. If you don't know the language of the text inputs (for example, if you're using LanguageDetectionSkill to detect the language), you can omit this parameter. If you set languageCode to a language isn't in the supported list for the defaultLanguageCode, a warning is emitted and the text isn't split.

Skill Outputs

Parameter name Description
textItems Output is an array of substrings that were extracted. textItems is the default name of the output.

targetName is optional, but if you have multiple Text Split skills, make sure to set targetName so that you don't overwrite the data from the first skill with the second one. If targetName is set, use it in output field mappings or in downstream skills that consume the skill output, such as an embedding skill.

Sample definition

{
    "name": "SplitSkill", 
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill", 
    "description": "A skill that splits text into chunks", 
    "context": "/document", 
    "defaultLanguageCode": "en", 
    "textSplitMode": "pages", 
    "unit": "azureOpenAITokens", 
    "azureOpenAITokenizerParameters":{ 
        "encoderModelName":"cl100k_base", 
        "allowedSpecialTokens": [ 
            "[START]", 
            "[END]" 
        ] 
    },
    "maximumPageLength": 512,
    "inputs": [
        {
            "name": "text",
            "source": "/document/text"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "pages"
        }
    ]
}

Sample input

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
                "languageCode": "en"
            }
        },
        {
            "recordId": "2",
            "data": {
                "text": "This is the second document, which will be broken into several pages...",
                "languageCode": "en"
            }
        }
    ]
}

Sample output

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "textItems": [
                    "This is the loan...",
                    "In the next section, we continue..."
                ]
            }
        },
        {
            "recordId": "2",
            "data": {
                "textItems": [
                    "This is the second document...",
                    "In the next section of the second doc..."
                ]
            }
        }
    ]
}

Example for chunking and vectorization

This example is for integrated vectorization.

  • pageOverlapLength: Overlapping text is useful in data chunking scenarios because it preserves continuity between chunks generated from the same document.

  • maximumPagesToTake: Limits on page intake are useful in vectorization scenarios because it helps you stay under the maximum input limits of the embedding models providing the vectorization.

Sample definition

This definition adds pageOverlapLength of 100 characters and maximumPagesToTake of one.

Assuming the maximumPageLength is 5,000 characters (the default), then "maximumPagesToTake": 1 processes the first 5,000 characters of each source document.

This example sets textItems to myPages through targetName. Because targetName is set, myPages is the value you should use to select the output from the Text Split skill. Use /document/mypages/* in downstream skills, indexer output field mappings, knowledge store projections, and index projections.

{
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
    "textSplitMode" : "pages", 
    "maximumPageLength": 1000,
    "pageOverlapLength": 100,
    "maximumPagesToTake": 1,
    "defaultLanguageCode": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "mypages"
        }
    ]
}

Sample input (same as previous example)

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "text": "This is the loan application for Joe Romero, a Microsoft employee who was born in Chile and who then moved to Australia...",
                "languageCode": "en"
            }
        },
        {
            "recordId": "2",
            "data": {
                "text": "This is the second document, which will be broken into several sections...",
                "languageCode": "en"
            }
        }
    ]
}

Sample output (notice the overlap)

Within each "textItems" array, trailing text from the first item is copied into the beginning of the second item.

{
    "values": [
        {
            "recordId": "1",
            "data": {
                "textItems": [
                    "This is the loan...Here is the overlap part",
                    "Here is the overlap part...In the next section, we continue..."
                ]
            }
        },
        {
            "recordId": "2",
            "data": {
                "textItems": [
                    "This is the second document...Here is the overlap part...",
                    "Here is the overlap part...In the next section of the second doc..."
                ]
            }
        }
    ]
}

Error cases

If a language isn't supported, a warning is generated.

See also