AI Search tokenize phone numbers

Jay Lin 20 Reputation points
2024-06-25T04:29:33.57+00:00

Hi there,

I have got another question to build a customize phone number analyzer.

For instance, +61 2 8364 5809 will be found when user searches:

  1. 61 2 8364 5809
  2. 61283645809
  3. 8364 5809
  4. 83645809
  5. 8364
  6. 836
  7. 5809

Not found if user searches

  1. 809

I have PatternCaptureTokenFilter (PreserveOriginal = true) to clean up "+", "(", ")" and space.

var phoneFilter = new PatternCaptureTokenFilter("phone_filter", new string[] { "([^()\\+\\s]+)" });
phoneFilter.PreserveOriginal = true;
tokenFilterList.Add(phoneFilter);
var phoneCleanupFilter = new PatternReplaceTokenFilter("phone_cleanup_filter", "\\W+", string.Empty);
tokenFilterList.Add(phoneCleanupFilter);

Custom-Phone

This analyzer can fulfill all the requirements except #4, but as soon as I implemented EdgeNGramTokenFilter after phoneFilter and phoneCleanupFilter to get the right 8 to 10 digits, all the tokens generated above that are less than 8 will be removed.

var eightEdgeGramsFilter = new EdgeNGramTokenFilter("8_10_edgegrams");
eightEdgeGramsFilter.MinGram = 8;
eightEdgeGramsFilter.MaxGram = 10;
eightEdgeGramsFilter.Side = EdgeNGramTokenFilterSide.Back;
tokenFilterList.Add(eightEdgeGramsFilter);

EdgeGrams-8-10

Is there a way to PreserveOriginal in EdgeNGramTokenFilter? Or is there a better way to get the right 8, 10 digits?

Regards,

Jay

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
831 questions
0 comments No comments
{count} votes