T-SQL: Sort data by multiple languages

Introduction

Most people who work as salaried employees usually works with the same database/s, the same applications and the same specific language, each day. By default, SQL Server Setup automatically detects the Windows system locale and selects the appropriate SQL Server collation. For most users this fits their needs. 

SQL Server supports several collations. A collation encodes the rules governing the proper use of characters. Each SQL Server collation specifies three properties: (1) The sort order to use for Unicode data types, (2) The sort order to use for non-Unicode character data types, (3) The code page used to store non-Unicode character data. 

   A sort order defines the sequence in which characters are sorted, and the way characters are evaluated in comparison operations.

While our database include specific collation, things are simple. We need to use the data collation for our sorting. We can specify the sorting collation by using a simple hint in the query level, for example: "order by <columns list> Collate Indic_General_100_CI_AI", or we can use the default collation, if it fits our data. Things becoming much more complex, once our database needs to supports multiple languages and cultures. In this case, we need to sort our data using multiple collations, in order to get the correct sorting order according to each of the languages. In this article we are going to focus on sorting the data by multiple languages.

   If you are dealing with multiple language it is highly recommended always to specify the sorting collation/s for each value. We can use column that store the culture name for example.

Our Case Study

We have a table column, which store multiple language data. We need to retrieve our data sorted according to the languages, for example Hindi data first, than Hebrew data, than English data, and so on. Moreover, the data should be sorted according to the language rules, specified by the right collation, otherwise the sorting will have no meaning!

  This article based on a question in MSDN’s forum.

Our first case study is simple case were we have the collation information in the table, while the second case study, which is actually the question from the forum, is more complex, since we have no information regarding the data collation.

  It is highly recommended to store the data culture in the table 

First Case Study

In most cases It is not a good idea to store the specific SQL Server collate, but the culture. This information can and probably is used by the application that uses the database. A common example is applications, which show data according to the user culture.

/*************************************************** DDL */
CREATE TABLE MultipleLanguages (
    UnicodeData NVARCHAR(200),
    Collation CHAR(5)
)
GO
 
/*************************************************** DML */
INSERT INTO MultipleLanguages (UnicodeData,Collation)
VALUES
     (N'This is an example'        ,'en-US')
    , (N'यह एक उदाहरण है.'          ,'hi-IN')
    , (N'उदाहरण है.'                    ,'hi-IN')
    , (N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.' ,'kn-IN')
    , (N'דוגמה'                     ,'he-IL')
    , (N'एक उदाहरण है.'             ,'hi-IN')
    , (N'هذا مثال على ذلك.'     ,'ar-EG')
    , (N'To je příklad.'            ,'cs')
    , (N'זה רק דוגמה'               ,'he-IL')
GO
 
select * from MultipleLanguages
GO
 
/***************************************************  */

 

Solution 01, First Case

This solution based on selecting each language data separately, and UNION all the information together. In order to specify the sorting collate in sub query, we uses ROW_NUMBER function. Since each sub-query include only the data fro one culture, the numbers start from 1 for each culture. We will need another sorting (order by Collation) in the end.

/*************************************************** Solution 01 case 01 */
select UnicodeData,Collation
from (
      
    select ML.UnicodeData, ML.Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate Hebrew_CI_AS) RN
    from MultipleLanguages ML
    where ML.Collation = 'he-IL'
  
    UNION ALL
  
    select ML.UnicodeData, ML.Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate SQL_Latin1_General_CP1_CI_AS) RN
    from MultipleLanguages ML
    where ML.Collation = 'en-US'
  
    UNION ALL
  
    select ML.UnicodeData, ML.Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate Indic_General_100_CI_AS) RN
    from MultipleLanguages ML
    where ML.Collation = 'hi-IN' or ML.Collation = 'kn-IN'
  
    UNION ALL
  
    select ML.UnicodeData, ML.Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate Arabic_CI_AS) RN
    from MultipleLanguages ML
    where ML.Collation = 'ar-EG'
  
    UNION ALL
  
    select ML.UnicodeData, ML.Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate Czech_CI_AS) RN
    from MultipleLanguages ML
    where ML.Collation = 'cs'
  
) t
order by Collation, RN
GO

 

Solution 02, First Case

This solution based on selecting each language separately, and UNION all the information together (same as previous). In this solution we sort the data in each sub-query directly, but In order to use sorting in sub-query we use TOP with a number that is big enough to select all. In most cases this solution is better than the previous one.

/*************************************************** Solution 02 case 01 */
select * from (
    select top 100000 ML.UnicodeData, ML.Collation
    from MultipleLanguages ML
    where ML.Collation = 'he-IL'
    Order by ML.UnicodeData collate Hebrew_CI_AS
) T
UNION ALL
select * from (
    select top 100000 ML.UnicodeData, ML.Collation
    from MultipleLanguages ML
    where ML.Collation = 'en-US'
    Order by ML.UnicodeData collate SQL_Latin1_General_CP1_CI_AS
) T
UNION ALL
select * from (
    select top 100000 ML.UnicodeData, ML.Collation
    from MultipleLanguages ML
    where ML.Collation = 'hi-IN' or ML.Collation = 'kn-IN' -- same SQL Server collate
    Order by ML.Collation, ML.UnicodeData collate Indic_General_100_CI_AS
) T
UNION ALL
select * from (
    select top 100000 ML.UnicodeData, ML.Collation
    from MultipleLanguages ML
    where ML.Collation = 'ar-EG'
    Order by ML.UnicodeData collate Arabic_CI_AS
) T
UNION ALL
select * from (
    select top 100000 ML.UnicodeData, ML.Collation
    from MultipleLanguages ML
    where ML.Collation = 'cs'
    Order by ML.UnicodeData collate Czech_CI_AS
) T
GO

 

Solution 03 [a], First Case

Not like previous solution in this script we are dealing with all the data together. For each culture we sort the entire data by the matching collate. Next, we sort the entire data again with different column each time according to our culture.

/*************************************************** Solution 03 [a] Case 01: Sorting all data by all collations */
select UnicodeData,Collation
from (
    select
        ML.UnicodeData,
        ML.Collation,
        ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS                  ) 'he-IL',
        ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS  ) 'en-US',
        ROW_NUMBER() OVER (order by unicodedata Collate Indic_General_100_CI_AS       ) 'kn-IN & hi-IN',
        ROW_NUMBER() OVER (order by unicodedata Collate Arabic_CI_AS                  ) 'ar-EG',
        ROW_NUMBER() OVER (order by unicodedata Collate Czech_CI_AS                   ) 'cs'
    from MultipleLanguages ML
) T
order by
    [Collation],
    CASE
        when Collation = 'he-IL'        then [he-IL]
        when Collation = 'en-US'        then [en-US]
        when Collation = 'kn-IN'        then [kn-IN & hi-IN]
        when Collation = 'hi-IN'        then [kn-IN & hi-IN]
        when Collation = 'ar-EG'        then [ar-EG]
        when Collation = 'cs'           then [cs]
    END
GO

 

Solution 03 [b], First Case

/*************************************************** Solution 03 [b] Case 01: Sorting all data  in sections according to  all collations */
select UnicodeData,Collation
from (
    select
        ML.UnicodeData,
        ML.Collation,
        RN = 
            CASE
                when Collation = 'he-IL' then  ROW_NUMBER() OVER (order by  unicodedata Collate  Hebrew_CI_AS                  )
                when Collation = 'en-US' then  ROW_NUMBER() OVER (order by  unicodedata Collate  SQL_Latin1_General_CP1_CI_AS  )
                when Collation = 'kn-IN' then  ROW_NUMBER() OVER (order by  unicodedata Collate  Indic_General_100_CI_AS       )
                when Collation = 'hi-IN' then  ROW_NUMBER() OVER (order by  unicodedata Collate  Indic_General_100_CI_AS       )
                when Collation = 'ar-EG' then  ROW_NUMBER() OVER (order by  unicodedata Collate  Arabic_CI_AS                  )
                when Collation = 'cs'    then  ROW_NUMBER() OVER (order by  unicodedata Collate  Czech_CI_AS                   )
            END
    from MultipleLanguages ML
) T
order by
    [Collation],RN
GO

Comparing solutions

In this article we are not discusses the optimization of the solutions. We are focusing on the options for solving the problem of sorting by multiple languages. It is important to remind everyone that you must compare solutions, in your database using your specific DDL+DML! with that being said, let's see basic comparison of our solutions (based on our DDL+DML), and we will mention some important points, which should be taken into consideration while choosing the best solution for specific case.

Execution plan: 

* Please click the image in order to open it in full scale

STATISTICS IO

Using the command SET STATISTICS IO ON we can examine the most important parameter in most cases, in the execution cost. In our case we got these reports:

Solution 01, Case o1: IO
Table 'MultipleLanguages'. Scan count 5, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
Solution 02, Case o1: IO
Table 'MultipleLanguages'. Scan count 5, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
 
Solution 03, Case o1: IO
Table 'MultipleLanguages'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

but what is we don't have any information regarding the data culture?!?

Second Case Study

In this case we have simple data with no information about the value's culture. The only assumption that we can make is that each value uses specific culture from a known cultures list. This is Much more complex case since we will need to find the value's culture on-the-fly.

/*************************************************** DDL */
IF OBJECT_ID(N'dbo.MultipleLanguages', N'U') IS NOT NULL
    DROP TABLE dbo.MultipleLanguages;
CREATE TABLE MultipleLanguages (
    UnicodeData NVARCHAR(200)
)
GO
 
/*************************************************** DML */
Truncate table MultipleLanguages
INSERT INTO MultipleLanguages (UnicodeData)
VALUES
      (N'This is an example.'       )
    , (N'यह एक उदाहरण है.'          )
    , (N'उदाहरण है.'                    )
    , (N'ಈ ಒಂದು ಉದಾಹರಣೆಯಾಗಿದೆ.' )
    , (N'דוגמה.'                    )
    , (N'एक उदाहरण है.'             )
    , (N'هذا مثال على ذلك.'     )
    , (N'To je příklad.'            )
    , (N'זה רק דוגמה.'              )
GO
 
select * from MultipleLanguages
GO
 
/**********************************************************************  */

Solutions for Second Case Study

Our solutions based on previous solutions in case study 01. The only main differences are (1) we need another step to recognize the value's culture using utilities table, and (2) we need to use JOIN in order to get the culture name from the utilities table. For these steps, we will create new utilities table, which map characters to cultures.

/********************************************************************** DDL */
create table UnicodeMapping (
    Character nchar(1),
    UnicodeNum int,
    CultureN NVARCHAR(100),
    CollateN NVARCHAR(100)
)
GO
 
-- fill the table with main Hebrew characters, using a number table
insert UnicodeMapping (Character, UnicodeNum, CultureN, CollateN)
select NCHAR(n), n, 'he-IL', 'Hebrew_CI_AS'
from NumbersTable -- we are using utilities numbers table
where
    n between 1488 and 1514 -- Hebrew
    or n between 64304 and 64330 -- Hebrew
GO
 
-- fill the table with main English characters, using a number table
insert UnicodeMapping (Character, UnicodeNum, CultureN, CollateN)
select NCHAR(n), n, 'en-US', 'SQL_Latin1_General_CP1_CI_AS'
from NumbersTable -- we are using utilities numbers table
where
    n between 97 and 122 -- En
    or n between 65 and 90 -- En
GO
 
-- Do the same with all the languages that you need, and all the UNICODE ranges for those languages
 
select * from UnicodeMapping
GO
 
/**********************************************************************  */

* we are using utilities numbers table in order to fill our utilities UnicodeMapping. 
* There is an option to use ranges of characters in the UnicodeMapping table instead of the actual values, but this mean that each time that we query the data we will need to compare values to ranges and the queries will cost more. 

Now we can use our utilities table in order to get the information that we need :-)
In these example we are using only the English and the Hebrew languages, but you can add any other language that you want.

/*************************************************** Solution 01 case 02 */
select UnicodeData,Collation
from (
     
    select ML.UnicodeData, UM.CultureN Collation, ROW_NUMBER() OVER (order by ML.UnicodeData Collate Hebrew_CI_AS) RN
    from MultipleLanguages ML
    left JOIN UnicodeMapping UM on UNICODE(LEFT(ML.UnicodeData,1)) = UM.UnicodeNum
    where UM.CultureN = 'he-IL'
 
    UNION ALL
     
    select ML.UnicodeData, UM.CultureN Collation, ROW_NUMBER() OVER (order by ML.UnicodeData CollateSQL_Latin1_General_CP1_CI_AS) RN
    from MultipleLanguages ML
    left JOIN UnicodeMapping UM on UNICODE(LEFT(ML.UnicodeData,1)) = UM.UnicodeNum
    where UM.CultureN = 'en-US'
 
) t
order by Collation, RN
GO
 
/*************************************************** Solution 02 case 02 */
select * from (
    select top 100000 ML.UnicodeData, UM.CultureN Collation
    from MultipleLanguages ML
    left JOIN UnicodeMapping UM on UNICODE(LEFT(ML.UnicodeData,1)) = UM.UnicodeNum
    where UM.CultureN = 'he-IL'
    Order by ML.UnicodeData collate Hebrew_CI_AS
) T
UNION ALL
select * from (
    select top 100000 ML.UnicodeData, UM.CultureN Collation
    from MultipleLanguages ML
    left JOIN UnicodeMapping UM on UNICODE(LEFT(ML.UnicodeData,1)) = UM.UnicodeNum
    where UM.CultureN = 'en-US'
    Order by ML.UnicodeData collate SQL_Latin1_General_CP1_CI_AS
) T
GO
 
/*************************************************** Solution 03 Case 02 */
select UnicodeData,Collation
from (
    select
        ML.UnicodeData,
        UM.CultureN Collation,
        ROW_NUMBER() OVER (order by unicodedata Collate Hebrew_CI_AS                  ) 'he-IL',
        ROW_NUMBER() OVER (order by unicodedata Collate SQL_Latin1_General_CP1_CI_AS  ) 'en-US'
    from MultipleLanguages ML
    left JOIN UnicodeMapping UM on UNICODE(LEFT(ML.UnicodeData,1)) = UM.UnicodeNum
) T
order by
    [Collation],
    CASE
        when Collation = 'he-IL'        then [he-IL]
        when Collation = 'en-US'        then [en-US]
    END
GO

Comments & Conclusions

   In this article we are using simple pseudo code, which represent only the database structure that is relevant directly to our issue. It is important to remember to use appropriate clustered/nonclustered indexes and partitions, in your production database solution!

  Comparing solutions

        >> Comparing the execution plans of solutions 01 and 02, we can see that in solution 01 the SQL Server need to use another sorting after UNION the data, while in the second solution 02 the data is already sorted. This is the main reason that solution 02 looks better.

        >> In both solutions using UNION (solutions 01,02) we splits the data into the culture groups, and sorts each culture separately. If there is no appropriate indexes and/or partitions, then we will need to scan the entire table for each culture. these solutions look like a very poor solutions as we can see when we compare the IO reads. In our case we had to read 5 times more pages than we had to read in the third solution.

In our sample we are using very small SET of data, and the entire data was inserted in a single page (we can use DBCC IND in order to verify what pages stores our data). Therefore, the solution that used UNION had to read the same page 5 times (one for each culture we sorts), while the third solution needed to read the page only once. In first glance it might lead us to a false conclusion. Is the third solution actually much better in any case?

You have to remember that in production. your table probably include much more data (probably millions rows), which is stored in hundreds of pages. Moreover, you probably use partitions that split the data by the culture. Each culture's data might even be stored in a separated file and/or disk. In these cases the results might be different! Using the solutions with UNION we will be able to read only the pages which include data in the specific culture. We will not need to read the same page more than once.

  A collation encodes the rules governing the proper use of characters according to specific language. Using one specific collate in order to sort multiple languages data will leads to bad order for all other languages!

  It is highly recommended to use accessories tables like UnicodeMapping, NumbersTable, etc' in separate database, which can be in read_only mode.  Firstly, it will let us use the same tables for all users without the need to create and store the same data in each database. secondly, using READ_ONLY mode improve the use of the data (we reduce locks and waits), thirdly, using READ_ONLY mode give us some protection.

  In order to determine what is our value language, we used the UNICODE function, and we compared the result to our accessories table UnicodeMapping. In our samples we simplified problem of finding the value language. We select the first character in the string (using LEFT function), assuming that the first character will determine the language. This is not accurate in every case! The string might start with numbers for example, which fit as part of any language. Therefore, in these cases we will need to that chose the first character that is not shared by different languages (like numbers). This should be done using SQL CLR function!

  If you don't know in advance which languages you might have, then you can use dynamic query, or use all the languages in your accessories tables  UnicodeMapping. Each language that is not mapped can be stored as specific default language that you chose.

Summary

A collation encodes the rules governing the proper use of characters according to specific language. Using one specific collate in order to sort multiple languages data will leads to bad order for all other languages! In this article we showed how we can deal sorting multiple languages data using multiple collations (cultures) in the same query. We used three simple solutions based on "SELF UNION" or "SELF JOIN" operations ('Self' mean that we JOIN or UNION a table to itself), or direct approach using ROW_NUMBER function.

  You can download all the codes in this article as one zip file from here.

Resources and more information