Azure search not behaving as expected for dashes

Question

I'm having an issue when using azure search for the following example data set: abc-123-456, abc-123-457, abc-123-458, etc When making the search for abc-123-456, I'd expected to only return one results but instead getting all results containing abc-123-... Is there some setting or way to change this behavior?

Current search settings:

TheSearchIndex.TokenFilters.Add(new EdgeNGramTokenFilter("frontEdgeNGram") { Side = EdgeNGramTokenFilterSide.Front, MinGram = 3, MaxGram = 20 });
TheSearchIndex.Analyzers.Add(new CustomAnalyzer("FrontEdgeNGram", LexicalTokenizerName.Whitespace)
    {
TokenFilters = { TokenFilterName.Lowercase, new TokenFilterName("frontEdgeNGram"), TokenFilterName.Classic, TokenFilterName.AsciiFolding }
    });
SearchOptions UsersSearchOptions = new SearchOptions()
    {
        QueryType = SearchQueryType.Simple,
        SearchMode = SearchMode.All,
    };

Using azure.search.documents ver 11.1.1

Edit: Search with abc-123-456* with the asterisk gives me the one result as expected. How to get this behavior working as default?

Answer

Sorry, this doesn't quite help... as it doesn't address the question...at least if it does, I am confused..

We are using FrontEdgeNGram (Custom Analyzer, FrontEdge ) because we want partial matches without always doing prefix searches( which are less performant )
Are these values not added to the "inverted indexes" or is it added to the" unprocessed/internal index" .
If using that analyzer the example is broken into:

{
  "value": [
    {
      "token": "abc",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-1",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-12",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-123",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-123-",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-123-4",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-123-45",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    },
    {
      "token": "abc-123-456",
      "startOffset": 0,
      "endOffset": 11,
      "position": 0
    }
  ]
}

Therefore, as you see, if I "search" for abc-123-456 then it should be found ..without the * forcing a prefix search.
If I search for abc-123 .. I expect it to be found... it also finds too much..( with a greater example set of data )
abc-123-45 finds to much .. abc-123-45* finds the correct items..

Am I incorrect in my assumptions?

Also, how is the typed search text analyzed? Is it passed "as is" to each analyzer defined in the application?
we do use more than one analyzer in the application depending on the field content

Answer

Hi @Jia, Johnson (TBS) ,

Apologies for the delay in response. I responded to your SO post. Posting here for visibility.

This is expected behavior when using the EdgeNGramTokenFilter inside the custom analyzer configuration. The text “abc-123-456” is broken into smaller tokens like “abc”, “abc-1”, “abc-12”, “abc-123”….”abc-123-456”. Check out the Analyzer API for the full list of tokens generated by a particular analyzer.

For a query - abc-123, if the default analyzer is being used, the query terms will be abc and 123 and will match all the documents that contain these terms. The prefix query on the other hand is not analyzed and looks for documents that contain the prefix as is “abc-123”. A prefix search bypasses full-text search and looks for verbatim matches, which is why the correct result is coming back. Full-text search is over tokens in inverted indexes. Everything else (filters, fuzzy, regex, prefix/wildcard, etc.) is over verbatim strings in a separate unprocessed/internal index.

Another way can be to set only the search analyzer on the field to keyword to avoid breaking the input query.

Hope that helps.

Thanks,
Grace

Answer

Hi @Jia, Johnson (TBS) ,

My deepest apologies for the delay in response. I had reached to the engineering team to get a better solution for your scenario. Here is their recommendation:

All the tokens produced by the Analyze API are indexed. The observed behavior is because a prefix query “abc-123-456*” doesn’t undergo lexical analysis whereas the other one (“abc-123-456”) does. So in the 1st case, there is only 1 query token (entire prefix) to match but in the 2nd case, there will be multiple tokens for matching. More details can be found here - Lucene query syntax - Azure Cognitive Search | Microsoft Learn

A possible solution is to use separate analyzers for search and indexing where the indexing analyzer will be the custom analyzer created by Cx and the keyword analyzer as the search analyzer. The keyword analyzer will pass the query as a single keyword token and eliminate excess matches. Add custom analyzers to string fields - Azure Cognitive Search | Microsoft Learn

Let us know if you have any further questions.

-Grace

Share via

Azure search not behaving as expected for dashes

3 answers

Your answer