Strange query result from SQL Server

Question

I'm trying to use UTF-8 encoding in MSSQL 2019 but I found a weird behavior. Launching a query like

select 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

on tempdb I'm getting this result:

me‚‚‚

where are the other characters? Note that the '‚' is not an ASCII comma but a Unicode 0x201a point which correspond to 0x82 Latin1 encoding (the encoding of the tempdb database).

Accepted Answer

Replying here as too long and won't fit a reply.

Thanks for the patient and effort. This does not reply entirely to my question but collecting your tests I think I can now explain my queries and the others.

So, let's start with my query and why the result

SELECT 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

the server (and so the parser) receives the string as UTF-16 and parse the string 'me‚‚‚abcdef', as it's embedded in single quotes the parser "create" a VARCHAR using the encoding of the current database, in my case (tempdb just installed)

SELECT DATABASEPROPERTYEX(db_name(), 'Collation')

SQL_Latin1_General_CP1_CI_AS

so the parser converts the string to a VARCHAR(11) COLLATE SQL_Latin1_General_CP1_CI_AS column (11 are the length required), so we have basically a

CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS

now, continuing executing the initial query:

'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

the server tries to convert the query to another collation, using the same type (VARCHAR(11)), so doing a

CAST( AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8

that is

CAST(CAST(N'me‚‚‚abcdef' AS VARCHAR(11)) COLLATE SQL_Latin1_General_CP1_CI_AS AS VARCHAR(11)) COLLATE Latin1_General_100_CI_AS_SC_UTF8

which returns our 'me‚‚‚' (as the commas require 3 bytes so 2+3*3 == 11)

Now, let's test this on your last query:

SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)

should be the same as

SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚' AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(1)) COLLATE Hebrew_100_CS_AS_SC_UTF8)

how to verify that this work? Well, let's try with the string '‚aa'

SELECT CONVERT (VARBINARY(MAX), '‚aa' COLLATE Hebrew_100_CS_AS_SC_UTF8)

converted to

SELECT CONVERT (VARBINARY(MAX), CAST(CAST(N'‚aa' AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC AS VARCHAR(3)) COLLATE Hebrew_100_CS_AS_SC_UTF8)

why '‚aa' ? So there's space to convert the first character to UTF-8, as the second query is doing.

Just as a last verification of my explanation this query

SELECT 'me‚王abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

is returning 'me‚?abcd' this confirms the double conversion and the truncation to VARCHAR(10), 10 bytes to express the source string as Latin1

Answer

Lots of good info in the answers, but I don't think we really understand what's going on. Generally the expression

'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

The question is should this result in a lossless conversion of the characters to UTF8 encoding, or a truncated string value.

The docs have been updated, and this behavior is by-design. Without using the N unicode prefix on a TSQL character literal, the value will be converted first to the database collation's code page, then to the code page of the target collation. This intermediate translation can lead to loss of data either through unicode characters in the character string constant not being compatible with the database code page, or from the length the intermediate string causing truncation.

See this Note:

When a UTF8-enabled collation is specified using the COLLATE clause, conversion to the database default code page still happens before the conversion to the collation specified by the COLLATE clause. Conversion is not done directly to the specified Unicode-enabled collation. For more information, see Unicode string.

So this should be chaged to:

   N'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8

Character String Constants

Answer

Hi,

The original question is about why a specific query on tempdb produces a truncated result.

This is NOT related to tempDB specifically, but to the COLLATE of the database.

In your specific case the tempDB behave like this, but for someone else tempDB might use different COLLATE like for example a UTF-8 COLLATE, therefore it will work for him well.

If you will execute your query in the HebrewDB database from the above example, then you will get the exact same issue, but if you will execute it in the UTF8DB database then you will not get this issue.

I tried to explain (probably poorly and not well) that when you execute query and use string on-the-fly like in your case, then the data COLLATE impact the behavior.

When you use string like this 'me‚‚‚abcdef' then the batch parser also involve in the execution. In this case, Using COLLATE like Hebrew_CI_AS fails to convert the value to a UTF-8 COLLATE

As mentioned above you should use N'me‚‚‚abcdef' instead of 'me‚‚‚abcdef' and it should work for you.

Note: If you will try to execute the following query under the HebrewDB then you will notice that the result is CHAR ZERO 0x. This is the source of the issue.

SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8) -- 0x CHAR ZERO

By the way, I wrote about "char zero" long time ago in one of my blog.... you can search for the post.

Seems like when we use char zero then the string truncated (in the HebrewDB as well as in the UTF8DB database)

select 'meab' + CHAR(0) + 'cdef' COLLATE Hebrew_100_CS_AS_SC_UTF8
select N'1234' + CHAR(0) + N'56789'
GO

This is what happen here as much as I can see. The Character "‚" is converted into CHAR Zero when we change the COLLATE, and this lead to the behavior of getting only the string before this char (truncated string).

Here is another small play you can examine:

use HebrewDB
SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
GO -- 0x  CHAR ZERO

use UTF8DB
SELECT CONVERT (VARBINARY(MAX), '‚' COLLATE Hebrew_100_CS_AS_SC_UTF8)
GO -- 0xE2809A : this is the hex value in UTF-8 for the character. You can confirm this here: https://www.utf8-chartable.de/unicode-utf8-table.pl
SELECT CONVERT (VARCHAR(MAX),0xE2809A) -- CONVERT BACK works ok
GO -- ‚   : OK

Answer

Hi @Frediano Ziglio ，

Thank you for your sharing.

One article I searched mentioned that we still need an N prefix on string literals, even though the destination type is varchar. The reason is that SQL Server will try to interpret the value of the string first, and if the N is not there, part of the Unicode data gets lost.

Please refer Impact of UTF-8 support in SQL Server 2019 for more details and below example:

 SELECT 'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
 --me‚‚‚  

 SELECT N'me‚‚‚abcdef' COLLATE Latin1_General_100_CI_AS_SC_UTF8  
 --me‚‚‚abcdef

Hope above could be a little helpful to you.

If the response is helpful, please click "Accept Answer" and upvote it.

Best regards
Melissa

Answer

Hi @Frediano Ziglio ,

UTF-8 is enabled in existing the data types CHAR and VARCHAR.

String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “_UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8.

Like UTF-16, UTF-8 is only available to Windows collations that support Supplementary Characters , as introduced in SQL Server 2012.

You could have a try with below to get the complete output:

select cast('me‚‚‚abcdef' as VARCHAR(50)) COLLATE Latin1_General_100_CI_AS_SC_UTF8

Output:
me‚‚‚abcdef

Please refer more details from below link:
Introducing UTF-8 support for SQL Server

If the response is helpful, please click "Accept Answer" and upvote it.

Best regards
Melissa

Share via

Strange query result from SQL Server

6 additional answers

Your answer