Strange query result from SQL Server

Frediano Ziglio 111 Reputation points
2020-09-02T06:46:30.927+00:00

I'm trying to use UTF-8 encoding in MSSQL 2019 but I found a weird behavior. Launching a query like

select 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

on tempdb I'm getting this result:

me‚‚‚

where are the other characters? Note that the '‚' is not an ASCII comma but a Unicode 0x201a point which correspond to 0x82 Latin1 encoding (the encoding of the tempdb database).

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,623 questions
0 comments No comments
{count} votes

Accepted answer
  1. Frediano Ziglio 111 Reputation points
    2020-09-02T15:57:42.967+00:00

    Replying here as too long and won't fit a reply.

    Thanks for the patient and effort. This does not reply entirely to my question but collecting your tests I think I can now explain my queries and the others.

    So, let's start with my query and why the result

    SELECT 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    the server (and so the parser) receives the string as UTF-16 and parse the string 'me‚‚‚abcdef', as it's embedded in single quotes the parser "create" a VARCHAR using the encoding of the current database, in my case (tempdb just installed)

    SELECT DATABASEPROPERTYEX(db_name(), 'Collation')
    
    SQL_Latin1_General_CP1_CI_AS
    

    so the parser converts the string to a VARCHAR(11) COLLATE SQL_Latin1_General_CP1_CI_AS column (11 are the length required), so we have basically a

    CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS
    

    now, continuing executing the initial query:

    'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    the server tries to convert the query to another collation, using the same type (VARCHAR(11)), so doing a

    CAST(<STRING> AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    that is

    CAST(CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    which returns our 'me‚‚‚' (as the commas require 3 bytes so 2+3*3 == 11)

    Now, let's test this on your last query:

    SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    should be the same as

    SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚' AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    how to verify that this work? Well, let's try with the string '‚aa'

    SELECT CONVERT (VARBINARY(MAX), '‚aa' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    converted to

    SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚aa' AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    why '‚aa' ? So there's space to convert the first character to UTF-8, as the second query is doing.

    Just as a last verification of my explanation this query

    SELECT 'me‚王abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    is returning 'me‚?abcd' this confirms the double conversion and the truncation to VARCHAR(10), 10 bytes to express the source string as Latin1

    0 comments No comments

6 additional answers

Sort by: Most helpful
  1. David Browne - msft 3,846 Reputation points
    2020-09-02T14:47:56.153+00:00

    Lots of good info in the answers, but I don't think we really understand what's going on. Generally the expression

    'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    

    The question is should this result in a lossless conversion of the characters to UTF8 encoding, or a truncated string value.

    The docs have been updated, and this behavior is by-design. Without using the N unicode prefix on a TSQL character literal, the value will be converted first to the database collation's code page, then to the code page of the target collation. This intermediate translation can lead to loss of data either through unicode characters in the character string constant not being compatible with the database code page, or from the length the intermediate string causing truncation.

    See this Note:

    When a UTF8-enabled collation is specified using the COLLATE clause, conversion to the database default code page still happens before the conversion to the collation specified by the COLLATE clause. Conversion is not done directly to the specified Unicode-enabled collation. For more information, see Unicode string.

    So this should be chaged to:

       N'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    

    Character String Constants

    2 people found this answer helpful.

  2. Ronen Ariely 15,191 Reputation points
    2020-09-02T13:13:22.943+00:00

    Hi,

    The original question is about why a specific query on tempdb produces a truncated result.

    This is NOT related to tempDB specifically, but to the COLLATE of the database.

    In your specific case the tempDB behave like this, but for someone else tempDB might use different COLLATE like for example a UTF-8 COLLATE, therefore it will work for him well.

    If you will execute your query in the HebrewDB database from the above example, then you will get the exact same issue, but if you will execute it in the UTF8DB database then you will not get this issue.

    I tried to explain (probably poorly and not well) that when you execute query and use string on-the-fly like in your case, then the data COLLATE impact the behavior.

    When you use string like this 'me‚‚‚abcdef' then the batch parser also involve in the execution. In this case, Using COLLATE like Hebrew_CI_AS fails to convert the value to a UTF-8 COLLATE

    As mentioned above you should use N'me‚‚‚abcdef' instead of 'me‚‚‚abcdef' and it should work for you.


    Note: If you will try to execute the following query under the HebrewDB then you will notice that the result is CHAR ZERO 0x. This is the source of the issue.

    SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8) -- 0x CHAR ZERO
    

    By the way, I wrote about "char zero" long time ago in one of my blog.... you can search for the post.

    Seems like when we use char zero then the string truncated (in the HebrewDB as well as in the UTF8DB database)

    select 'meab' + CHAR(0) + 'cdef' COLLATE Hebrew_100_CS_AS_SC_UTF8
    select N'1234' + CHAR(0) + N'56789'
    GO
    

    This is what happen here as much as I can see. The Character "‚" is converted into CHAR Zero when we change the COLLATE, and this lead to the behavior of getting only the string before this char (truncated string).

    Here is another small play you can examine:

    use HebrewDB
    SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    GO -- 0x  CHAR ZERO
    
    use UTF8DB
    SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    GO -- 0xE2809A : this is the hex value in UTF-8 for the character. You can confirm this here: https://www.utf8-chartable.de/unicode-utf8-table.pl
    SELECT CONVERT (VARCHAR(MAX),0xE2809A) -- CONVERT BACK works ok
    GO -- ‚   : OK
    
    1 person found this answer helpful.

  3. MelissaMa-MSFT 24,196 Reputation points
    2020-09-03T06:25:22.237+00:00

    Hi @Frediano Ziglio

    Thank you for your sharing.

    One article I searched mentioned that we still need an N prefix on string literals, even though the destination type is varchar. The reason is that SQL Server will try to interpret the value of the string first, and if the N is not there, part of the Unicode data gets lost.

    Please refer Impact of UTF-8 support in SQL Server 2019 for more details and below example:

     SELECT 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
     --me‚‚‚  
    
     SELECT N'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
     --me‚‚‚abcdef  
    

    Hope above could be a little helpful to you.

    If the response is helpful, please click "Accept Answer" and upvote it.

    Best regards
    Melissa

    1 person found this answer helpful.

  4. MelissaMa-MSFT 24,196 Reputation points
    2020-09-02T08:11:03.373+00:00

    Hi @Frediano Ziglio ,

    UTF-8 is enabled in existing the data types CHAR and VARCHAR.

    String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.

    Like UTF-16, UTF-8 is only available to Windows collations that support Supplementary Characters , as introduced in SQL Server 2012.

    You could have a try with below to get the complete output:

    select cast('me‚‚‚abcdef' as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    

    Output:
    me‚‚‚abcdef

    Please refer more details from below link:
    Introducing UTF-8 support for SQL Server

    If the response is helpful, please click "Accept Answer" and upvote it.

    Best regards
    Melissa

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.