Azure search not behaving as expected for dashes

Jia, Johnson (TBS) 6 Reputation points
2021-01-21T15:41:52.357+00:00

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-... Is there some setting or way to change this behavior?

Current search settings:

TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram") { Side = EdgeNGramTokenFilterSide.Front, MinGram = 3, MaxGram = 20 });
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
    {
TokenFilters = { TokenFilterName.Lowercase, new TokenFilterName("frontEdgeNGram"), TokenFilterName.Classic, TokenFilterName.AsciiFolding }
    });
SearchOptions UsersSearchOptions = new SearchOptions()
    {
        QueryType = SearchQueryType.Simple,
        SearchMode = SearchMode.All,
    };

Using azure.search.documents ver 11.1.1

Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
993 questions
{count} vote

3 answers

Sort by: Most helpful
  1. Campbell, David (TBS) 46 Reputation points
    2021-01-29T15:49:26.033+00:00

    Sorry, this doesn't quite help... as it doesn't address the question...at least if it does, I am confused..

    We are using FrontEdgeNGram (Custom Analyzer, FrontEdge ) because we want partial matches without always doing prefix searches( which are less performant )
    Are these values not added to the "inverted indexes" or is it added to the" unprocessed/internal index" .
    If using that analyzer the example is broken into:

    {
      "value": [
        {
          "token": "abc",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-1",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-12",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-123",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-123-",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-123-4",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-123-45",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        },
        {
          "token": "abc-123-456",
          "startOffset": 0,
          "endOffset": 11,
          "position": 0
        }
      ]
    }
    

    Therefore, as you see, if I "search" for abc-123-456 then it should be found ..without the * forcing a prefix search.
    If I search for abc-123 .. I expect it to be found... it also finds too much..( with a greater example set of data )
    abc-123-45 finds to much .. abc-123-45* finds the correct items..

    Am I incorrect in my assumptions?

    Also, how is the typed search text analyzed? Is it passed "as is" to each analyzer defined in the application?
    we do use more than one analyzer in the application depending on the field content

    1 person found this answer helpful.
    0 comments No comments

  2. Grmacjon-MSFT 17,886 Reputation points
    2021-01-29T03:50:59.67+00:00

    Hi @Jia, Johnson (TBS) ,

    Apologies for the delay in response. I responded to your SO post. Posting here for visibility.

    This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.

    For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms. The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.

    Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

    Hope that helps.

    Thanks,
    Grace

    0 comments No comments

  3. Grmacjon-MSFT 17,886 Reputation points
    2021-02-27T03:50:42.503+00:00

    Hi @Jia, Johnson (TBS) ,

    My deepest apologies for the delay in response. I had reached to the engineering team to get a better solution for your scenario. Here is their recommendation:

    All the tokens produced by the Analyze API are indexed. The observed behavior is because a prefix query “abc-123-456*” doesn’t undergo lexical analysis whereas the other one (“abc-123-456”) does. So in the 1st case, there is only 1 query token (entire prefix) to match but in the 2nd case, there will be multiple tokens for matching. More details can be found here - Lucene query syntax - Azure Cognitive Search | Microsoft Learn

    A possible solution is to use separate analyzers for search and indexing where the indexing analyzer will be the custom analyzer created by Cx and the keyword analyzer as the search analyzer. The keyword analyzer will pass the query as a single keyword token and eliminate excess matches. Add custom analyzers to string fields - Azure Cognitive Search | Microsoft Learn

    72520-kw.png

    Let us know if you have any further questions.

    -Grace

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.