How do I enable multi language support in single column in Azure Cognitive Search?

Mehboob Ahmad 61 Reputation points
2024-06-06T19:34:38.7033333+00:00
  1. How can I handle indexing customer information in English, Arabic, Russian, Chinese, Thai, and Japanese languages using a single analyzer in Azure Cognitive Search?
  2. What is the best analyzer which handle multiple languages? For Example: English, Arabic, Russian, Chinese, Thai, and Japanese
  3. Previously, I was using "standard.lucene," but it is not working for Japanese and it does not support wildcards.
    I have not tested with Russian, Arabic, and Thai.

Please help me find a reliable solution.

CodeCapture

Capture

Indexes Json for customer Name

Capture

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
831 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SnehaAgrawal-MSFT 19,841 Reputation points
    2024-06-10T15:01:41.64+00:00

    @Mehboob Ahmad Thanks for asking question!

    You should also consider language analyzers when content consists of non-Western language strings. While the default analyzer (Standard Lucene) is language-agnostic, the concept of using spaces and special characters (hyphens and slashes) to separate strings is more applicable to Western languages than non-Western ones.

    For example, in Chinese, Japanese, Korean (CJK), and other Asian languages, a space isn't necessarily a word delimiter.

    Consider the following Japanese string. Because it has no spaces, a language-agnostic analyzer would likely analyze the entire string as one token, when in fact the string is actually a phrase.

    これは私たちの銀河系の中ではもっとも重く明るいクラスの球状星団です。(This is the heaviest and brightest group of spherical stars in our galaxy.)

    For the example above, a successful query would have to include the full token, or a partial token using a suffix wildcard, resulting in an unnatural and limiting search experience.

    A better experience is to search for individual words: 明るい (Bright), 私たちの (Our), 銀河系 (Galaxy).

    Using one of the Japanese analyzers available in Azure AI Search is more likely to unlock this behavior because those analyzers are better equipped at splitting the chunk of text into meaningful words in the target language.

    For more details refer- Add language analyzers to string fields in an Azure AI Search index

    Hope this helps, please let us know, if further query happy to assist.

    0 comments No comments