Comparing strings with a '*' wildcard character
This was a nice challenge, and while I have here a version that works, I have certainly not tested it extensively. After much pondering and barking up the wrong tree on a whiteboard I started playing with code and slowly it dawned on me that there are two fundamentals to this. Firstly I had to start thinking of the two strings as the "word" and the "mask" to be applied to the word and then the eureka moment was when I realised that the pattern we are matching can be matched from any point in the unmatched "right hand side" part of the word as we move along it.
My approach is basically to work along the two strings using the mask as the outer loop, recording where we are (the character pointer in each) until they don't match, then
a) Check that the reason for the mismatch is the presence of a '*' wildcard - otherwise they just don't match
b) Check the length of the pattern we now need to match; that is the sub-portion of the mask between the '*' character and either the next '*' character or the end of the mask.
c) Check for a match for that "submask" starting from every character in the right hand side (the as yet unprocessed part of the word).
d) If none found, return false, otherwise record that we have at least one match but keep searching (Eg. Strings like these are equal: "Microsoftft" == "Mic*ft")
e) Move the character pointer (the "wordOffset" variable in the snippet) to the end of the last match
Anyways, here is what I have for now - could certainly do with some tidying/refactoring but it's late and I'm hoping for some good critique, the flaws pointed out and a much better way of doing this to emerge:-)
private static bool CustomCompare(string word, string mask)
{
int wordOffset=0;
if (string.IsNullOrWhiteSpace(word))
{
return false;
}
for (int maskOffset = 0; maskOffset < mask.Length; maskOffset++)
{
if (word[wordOffset] == '*') //treating the wildcard '*' char as an illegal char in the word
{
return false;
}
if (mask[maskOffset] == word[wordOffset])
{
wordOffset++;
}
else
{
if (mask[maskOffset] == '*')
{
if (maskOffset == mask.Length - 1) //at end of mask, so rest of string will always match
{
return true;
}
else
{
maskOffset++;
//work along the string until we match again
while ((word[wordOffset] != mask[maskOffset]) && (wordOffset < word.Length - 1))
{
wordOffset++;
}
}
//if '*' was at end of fully matched mask, (i.e. "str" == str*) we're done
if (wordOffset == word.Length - 1)
{
return true;
}
else
{
//now wordoffset is at the next matching character - we need to get the length of the subpattern until the next wildcard or the end of mask
int subMaskOffset = maskOffset; //mark the start of the submask
while (maskOffset < mask.Length - 1 && mask[maskOffset + 1] != '*')
{
maskOffset++;
}
int subMaskLength = (maskOffset - subMaskOffset) + 1;
//Now we know the length of the submask we got to find the last match for it in the word
bool everMatched = false;
int wordRhs = wordOffset;
while (wordRhs < word.Length - subMaskLength + 1)
{
bool matchedletter = true; //reset each time we match the pattern
for (int y = 0; y < subMaskLength; y++)
{
if (word[wordRhs + y] != mask[subMaskOffset + y])
{
matchedletter = false;
}
}
if (matchedletter) //all letters in pattern matched
{
wordOffset = wordRhs + subMaskLength; //move along the length of the match
wordRhs = wordOffset;
everMatched = true;
}
else
{
wordRhs++; //else move along one and keep looking for the pattern
}
}
if (!everMatched)
{
return false; //submask not present in rhs of word
}
}
}
else
{
return false;
}
}
}
return true; //if we get here - the mask matches the word
}
If this puzzle were to be used in an interview situation and the candidate delivers a working prototype (on the whiteboard) perhaps like the above (but of course perhaps much more efficient and robust), what else should we then be thinking about?
I imagine (and again would welcome comment on this, just my thoughts as of now) that the next questions for this particular one would be along the lines of these:
1) What unit tests would you write for this method
- and my answers would probably include:
- Pass in empty string and various patterns to match, like these:
CustomCompare("\0", "M*ft Co*ion");
CustomCompare("*", "*");
CustomCompare("Microsoftft Corporation", "M*ft Co*ion");
CustomCompare("Microsoftft", "M*ft");
CustomCompare("Microsofftft", "M*ft");
2) Depending on time we could ask to add processing for a single char wildcard (like '?' in DOS).
3) Could potentially for this to be implemented using recursion (where the RHS processing part is refactored as a recursive method).
Comments
- Anonymous
March 08, 2013
Awesome! This really helped me out.