Strange query result from SQL Server

Frediano Ziglio 111 Reputation points
2020-09-02T06:46:30.927+00:00

I'm trying to use UTF-8 encoding in MSSQL 2019 but I found a weird behavior. Launching a query like

select 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

on tempdb I'm getting this result:

me‚‚‚

where are the other characters? Note that the '‚' is not an ASCII comma but a Unicode 0x201a point which correspond to 0x82 Latin1 encoding (the encoding of the tempdb database).

Transact-SQL
Transact-SQL
A Microsoft extension to the ANSI SQL language that includes procedural programming, local variables, and various support functions.
4,626 questions
0 comments No comments
{count} votes

Accepted answer
  1. Frediano Ziglio 111 Reputation points
    2020-09-02T15:57:42.967+00:00

    Replying here as too long and won't fit a reply.

    Thanks for the patient and effort. This does not reply entirely to my question but collecting your tests I think I can now explain my queries and the others.

    So, let's start with my query and why the result

    SELECT 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    the server (and so the parser) receives the string as UTF-16 and parse the string 'me‚‚‚abcdef', as it's embedded in single quotes the parser "create" a VARCHAR using the encoding of the current database, in my case (tempdb just installed)

    SELECT DATABASEPROPERTYEX(db_name(), 'Collation')
    
    SQL_Latin1_General_CP1_CI_AS
    

    so the parser converts the string to a VARCHAR(11) COLLATE SQL_Latin1_General_CP1_CI_AS column (11 are the length required), so we have basically a

    CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS
    

    now, continuing executing the initial query:

    'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    the server tries to convert the query to another collation, using the same type (VARCHAR(11)), so doing a

    CAST(<STRING> AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    that is

    CAST(CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    which returns our 'me‚‚‚' (as the commas require 3 bytes so 2+3*3 == 11)

    Now, let's test this on your last query:

    SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    should be the same as

    SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚' AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    how to verify that this work? Well, let's try with the string '‚aa'

    SELECT CONVERT (VARBINARY(MAX), '‚aa' COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    converted to

    SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚aa' AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC_UTF8)
    

    why '‚aa' ? So there's space to convert the first character to UTF-8, as the second query is doing.

    Just as a last verification of my explanation this query

    SELECT 'me‚王abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    is returning 'me‚?abcd' this confirms the double conversion and the truncation to VARCHAR(10), 10 bytes to express the source string as Latin1

    0 comments No comments

6 additional answers

Sort by: Most helpful
  1. Ronen Ariely 15,191 Reputation points
    2020-09-02T08:54:40.487+00:00

    You could have a try with below to get the complete output:
    select cast('me‚‚‚abcdef' as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8

    CONVERT to VARCHAR first is a good ideas 22080-image.png

    In general! Using UTF-8 on English characters can be extremely confusing since UTF-8 is backward compatible with ASCII which mean that it uses the same encoding for the first 128 characters. Therefore, I always recommend to test/demonstrate with other langues.

    Moreover, In order to examine the behavior of on-the-fly queries you should configure the database COLLATE since the result can be totally different.

    Let's create three databases for example:

    CREATE DATABASE HebrewDB COLLATE Hebrew_CI_AS  
    CREATE DATABASE LatinDB  COLLATE SQL_Latin1_General_CP1_CI_AS  
    CREATE DATABASE UTF8DB   COLLATE Hebrew_100_CS_AS_SC_UTF8  
    GO  
    

    Now let's add Hebrew character and Arabic character to the above query:

    -- English only: Do not really provide anything  
    select cast('me‚‚‚abcdef' as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    -- Adding Hebrew and Arabic  
    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    GO  
    

    Execute the second select will return totally different result in these three databases!

    use HebrewDB  
    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    GO  
    USE LatinDB  
    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    GO  
    USE UTF8DB  
    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    GO  
    

    22157-image.png

    In addition, VERY IMPORTANT! Notice the differences between these queries (try to execute it in the Hebrew Database for example)

    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8  
    select cast(('me‚‚‚abcdef' + ',א' + ',ݰ') COLLATE Latin1_General_100_CI_AS_SC_UTF8 as VARCHAR(50))  
    select cast((N'me‚‚‚abcdef' + N',א' + N',ݰ') COLLATE Latin1_General_100_CI_AS_SC_UTF8 as VARCHAR(50))  
    GO  
    

    For good/reliable result you should probably use the third option

    ----------

    Good day @Frediano Ziglio

    If you want to get first understanding about UTF-8 then you can check my short lecture on UTF-8 from the Microsoft Ignite 2019 (I had the honor to present this feature officially). It is only an introduction and you better search for my 90 minute lecture on the topic (there are versions in Hebrew and in English)

    With that being said the link that @MelissaMa-MSFT provided is awesome and explain almost all you need to know.

    Regarding your specific case, please publish a full example so we will be able to re-produce the scenario - starting from creating new database using the same COLLATE as your database, which as you can see above might be VERY important. Please explain what is the expected result and we will be able to explain the actual result :-)

    Note: UTF-8 mapped the code point "0201" to "ȁ" and not to something that looks like a comma. You can use the function NCHAR and Execute 'SELECT NCHAR(0x201)` to get the result of the mapping or you can use the following link to get all the mapping of UTF-8:

    https://www.utf8-chartable.de/unicode-utf8-table.pl


  2. Frediano Ziglio 111 Reputation points
    2020-09-02T09:55:56.297+00:00

    The original question is about why a specific query on tempdb produces a truncated result.

    All responses here are changing the query. This is not answering the original question, just giving workarounds.

    I installed a docker container (using a Fedora 31, I but I don't think it's important) with

    docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=MyStrongPassword' -p 1433:1433 -d mcr.microsoft.com/mssql/server:2019-CU6-ubuntu-16.04
    

    (but same with 2019-latest)

    Connect, switch to tempdb database (like USE tempdb) and execute a

    select 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8
    

    (sorry but my original post had multiple lines, I don't know why the formatting was removed).

    I hope the question is more clear. Thank you for the documentation links.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.