Comparing strings with a '*' wildcard character

This was a nice challenge, and while I have here a version that works, I have certainly not tested it extensively. After much pondering and barking up the wrong tree on a whiteboard I started playing with code and slowly it dawned on me that there are two fundamentals to this. Firstly I had to start thinking of the two strings as the "word" and the "mask" to be applied to the word and then the eureka moment was when I realised that the pattern we are matching can be matched from any point in the unmatched "right hand side" part of the word as we move along it.

My approach is basically to work along the two strings using the mask as the outer loop, recording where we are (the character pointer in each) until they don't match, then

a) Check that the reason for the mismatch is the presence of a '*' wildcard - otherwise they just don't match

b) Check the length of the pattern we now need to match; that is the sub-portion of the mask between the '*' character and either the next '*' character or the end of the mask.

c) Check for a match for that "submask" starting from every character in the right hand side (the as yet unprocessed part of the word).

d) If none found, return false, otherwise record that we have at least one match but keep searching (Eg. Strings like these are equal: "Microsoftft" == "Mic*ft")

e) Move the character pointer (the "wordOffset" variable in the snippet) to the end of the last match

 

Anyways, here is what I have for now - could certainly do with some tidying/refactoring but it's late and I'm hoping for some good critique, the flaws pointed out and a much better way of doing this to emerge:-)

 

  private static bool CustomCompare(string word, string mask)
 {
 int wordOffset=0;
 
 if (string.IsNullOrWhiteSpace(word))
 {
 return false;
 }
 
 for (int maskOffset = 0; maskOffset < mask.Length; maskOffset++)
 {
 if (word[wordOffset] == '*') //treating the wildcard '*' char as an illegal char in the word
 {
 return false;
 }
 
 if (mask[maskOffset] == word[wordOffset])
 {
 wordOffset++;
 }
 else
 {
 if (mask[maskOffset] == '*')
 {
 if (maskOffset == mask.Length - 1) //at end of mask, so rest of string will always match
 {
 return true;
 }
 else
 {
 maskOffset++;
 //work along the string until we match again
 while ((word[wordOffset] != mask[maskOffset]) && (wordOffset < word.Length - 1))
 {
 wordOffset++;
 }
 }
 
 //if '*' was at end of fully matched mask, (i.e. "str" == str*) we're done
 if (wordOffset == word.Length - 1)
 {
 return true;
 }
 else
 {
 //now wordoffset is at the next matching character - we need to get the length of the subpattern until the next wildcard or the end of mask
 int subMaskOffset = maskOffset; //mark the start of the submask
 while (maskOffset < mask.Length - 1 && mask[maskOffset + 1] != '*')
 {
 maskOffset++;
 }
 int subMaskLength = (maskOffset - subMaskOffset) + 1;
 
 //Now we know the length of the submask we got to find the last match for it in the word
 bool everMatched = false;
 
 int wordRhs = wordOffset;
 
 while (wordRhs < word.Length - subMaskLength + 1)
 {
 bool matchedletter = true; //reset each time we match the pattern
 
 for (int y = 0; y < subMaskLength; y++)
 {
 if (word[wordRhs + y] != mask[subMaskOffset + y])
 {
 matchedletter = false;
 }
 }
 
 if (matchedletter) //all letters in pattern matched
 {
 wordOffset = wordRhs + subMaskLength; //move along the length of the match
 wordRhs = wordOffset;
 everMatched = true;
 }
 else
 {
 wordRhs++; //else move along one and keep looking for the pattern
 }
 }
 
 if (!everMatched)
 {
 return false; //submask not present in rhs of word
 }
 }
 }
 else
 {
 return false;
 }
 }
 }
 return true; //if we get here - the mask matches the word
 }

  

If this puzzle were to be used in an interview situation and the candidate delivers a working prototype (on the whiteboard) perhaps like the above (but of course perhaps much more efficient and robust), what else should we then be thinking about?

I imagine (and again would welcome comment on this, just my thoughts as of now) that the next questions for this particular one would be along the lines of these:

1) What unit tests would you write for this method

  • and my answers would probably include:
    • Pass in empty string and various patterns to match, like these:

CustomCompare("\0", "M*ft Co*ion");

CustomCompare("*", "*");

CustomCompare("Microsoftft Corporation", "M*ft Co*ion");

CustomCompare("Microsoftft", "M*ft");

CustomCompare("Microsofftft", "M*ft");

 

2) Depending on time we could ask to add processing for a single char wildcard (like '?' in DOS).

3) Could potentially for this to be implemented using recursion (where the RHS processing part is refactored as a recursive method).

Comments

  • Anonymous
    March 08, 2013
    Awesome! This really helped me out.