T-SQL : Find Word Matched Count


Introduction

We have been asked to design an algorithm which counts “No. Of Words Matched” between 2 different strings

Problem Definition

User wants to count “No. Of Words Matched” between 2 different strings, suppose we have tables ‘Test1’ & ‘Test2’ which contains columns String1 & String2 respectively. Now user wants to calculate “No. Of Words Matched” between Table 'Test1' String1 column with every row of Table 'Test2' String2 column

Total No. Of Words Matched = Find Words in String1, find Words in String2, now find how many Words from String1 exists in String2 and return the total no. of words matched between 2 strings

Please find sample table structures which we would be working for this problem



      CREATE  TABLE    TEST1
      (  
      ID1     INT  IDENTITY(1,1)
      , STRING1       VARCHAR    (    MAX    )  
      )  
   
      CREATE  TABLE    TEST2
      (  
      ID2     INT  IDENTITY(1,1)
      , STRING2       VARCHAR    (    MAX    )  
      )  

Please find the script to populate these sample tables with dummy data



  

      INSERT  INTO     TEST1(STRING1)
      SELECT  'This is String1 Row1'   UNION ALL
      SELECT  'This is String1   Row2'     UNION ALL
      SELECT  'This is String1   Row3'     UNION ALL
      SELECT  'This is String1 John Doe'   UNION ALL
      SELECT  'This is String1 Will Smith'
   
      INSERT  INTO     TEST2(STRING2)
      SELECT  'This is string2    Row1'    UNION ALL
      SELECT  'This is string2 Row2'   UNION ALL
      SELECT  'This is String2 Row3'   UNION ALL
      SELECT  'This is String2 John doe    Band'   UNION ALL
      SELECT  'This is string2 Will smith One'

Here is a copy of the image of sample data for reference to this problem:

Solution

We have created a function udf_GetMatchWordCount which accept 3 parameters, String1, String2 & CaseSensitive and return the No. Words Matched between these two String parameters, matching of Words depends on CaseSensitive parameter which defines either to perform Word matching on Case Sensitive or Case Insensitive

The syntax for udf_GetMatchWordCount function is:

Syntax : udf_GetMatchWordCount(String1,  String2, CaseSensitive)

String1 : Is the string which words you want to be matched with String2 and can be of VARCHAR(MAX)
String2 : Is the string in which you want to match words from String1 and can be of VARCHAR(MAX)
CaseSensitive : It defines either you want String1 & String2 to be compared with Case Sensitive or not, 0 for Case Insensitive, 1 for Case Sensitive

Provide these input parameters and you will be given the output, you can also use this code inside inline query or stored procedure as well



  

  

  

      IF  OBJECT_ID (N    'dbo.udf_GetMatchWordCount'    , N    'FN'    )     IS  NOT NULL
                    DROP    FUNCTION     dbo.udf_GetMatchWordCount;  
      GO  
      --Create Function udf_GetMatchWordCount to Find out Matched Word Count  
      CREATE  FUNCTION     dbo.udf_GetMatchWordCount(@String1  VARCHAR(MAX), @String2  VARCHAR(MAX), @CaseSensitive    INT)
      RETURNS INT
      AS  
      BEGIN  
   
      DECLARE @MatchingCount INT
      DECLARE @WordTable1 TABLE(String VARCHAR(MAX))
      DECLARE @WordTable2 TABLE(String VARCHAR(MAX))
   
      SELECT  @String1 = CASE WHEN  @CaseSensitive = 0 THEN UPPER(@String1)    ELSE  @String1   END
      SELECT  @String2 = CASE WHEN  @CaseSensitive = 0 THEN UPPER(@String2)    ELSE  @String2   END
   
      SELECT  @String1 = REPLACE(REPLACE(REPLACE(@String1, ' ', ' X'), 'X ', ''), 'X', '')
      SELECT  @String2 = REPLACE(REPLACE(REPLACE(@String2, ' ', ' X'), 'X ', ''), 'X', '')
   
      ;    WITH   CTE (Word, String1) AS
      (  
      SELECT  CASE     WHEN CHARINDEX(' ', @String1) > 0   THEN  SUBSTRING(@String1, 1, CHARINDEX(' ', @String1)-1) ELSE  @String1   END  AS Word
                        ,       CASE   WHEN CHARINDEX(' ', @String1) > 0   THEN  SUBSTRING(@String1, CHARINDEX(' ', @String1)+1, LEN(@String1)) ELSE  @String1   END  AS String1  
      UNION ALL
      SELECT  CASE     WHEN CHARINDEX(' ', C.String1) > 0  THEN  SUBSTRING(C.String1, 1, CHARINDEX(' ', C.String1)-1)   ELSE  C.String1  END  AS Word
                        ,       CASE   WHEN CHARINDEX(' ', C.String1) > 0  THEN  SUBSTRING(C.String1, CHARINDEX(' ', C.String1)+1, LEN(C.String1))  ELSE  '' END  AS String1  
      FROM    CTE C 
      WHERE   LEN(C.String1) < LEN(@String1)  AND C.String1 <> ''
      )  
      INSERT  INTO     @WordTable1
      SELECT  Word    FROM    CTE
   
      ;    WITH   CTE2 (Word, String2) AS
      (  
      SELECT  CASE     WHEN CHARINDEX(' ', @String2) > 0   THEN  SUBSTRING(@String2, 1, CHARINDEX(' ', @String2)-1) ELSE  @String2   END  AS Word
                        ,       CASE   WHEN CHARINDEX(' ', @String2) > 0   THEN  SUBSTRING(@String2, CHARINDEX(' ', @String2)+1, LEN(@String2)) ELSE  @String2   END  AS String2  
      UNION ALL
      SELECT  CASE     WHEN CHARINDEX(' ', C.String2) > 0  THEN  SUBSTRING(C.String2, 1, CHARINDEX(' ', C.String2)-1)   ELSE  C.String2  END  AS Word
                        ,       CASE   WHEN CHARINDEX(' ', C.String2) > 0  THEN  SUBSTRING(C.String2, CHARINDEX(' ', C.String2)+1, LEN(C.String2))  ELSE  '' END  AS String2  
      FROM    CTE2 C 
      WHERE   LEN(C.String2) < LEN(@String2)  AND C.String2 <> ''
      )  
      INSERT  INTO     @WordTable2
      SELECT  Word    FROM    CTE2
   
      ;    WITH   CTE3    (Word, WordMatchCount) AS
      (  
      SELECT  T1.String AS Word, COUNT(*) AS  WordMatchCount
      FROM    @WordTable1 T1
      INNER JOIN  @WordTable2 T2  ON   T1.String = T2.String
      GROUP BY     T1.String
      )  
      SELECT  @MatchingCount = ISNULL(SUM(WordMatchCount), 0) FROM     CTE3
   
      RETURN  @MatchingCount
      END  

Here is our desired result

--Now your final query would be;
 SELECT  T1.ID1, T1.STRING1, T2.ID2, T2.STRING2
        , LEN(REPLACE(REPLACE(REPLACE(T1.String1, ' ', ' X'), 'X ', ''), 'X', '')) - LEN(REPLACE(T1.String1, ' ', '')) + 1 AS   Table1WordCount
        , LEN(REPLACE(REPLACE(REPLACE(T2.String2, ' ', ' X'), 'X ', ''), 'X', '')) - LEN(REPLACE(T2.String2, ' ', '')) + 1 AS   Table2WordCount
        , dbo.udf_GetMatchWordCount(T1.String1, T2.String2, 1) AS  MatchedWordsCount
FROM    TEST1 T1 
INNER JOIN  TEST2 T2     ON   1 = 1
ORDER BY     T1.ID1, T2.ID2

Conclusion

This article demonstrated solution for a common problem of Finding No. Of Matched words using set-based approach, and provided good example for Recursive CTE. This article also demonstrates how to replace multiple spaces with single space.


See Also