How to preserve whitespaces

For once I'm going to venture outside of the realm of typed XML and address an issue that might be of concern to users of untyped XML as well.

We've been asked a few times if we had a solution to preserve non-significant whitespaces in an XML instance. The short answer is "Yes".

First, let's sum up the problem with this simple example

declare @var XML
SET @var = '
<root>
<a>
<b>Data b</b>
</a>
</root>'
SELECT @var
go

The query returns
<root><a><b>Data b</b></a></root>

We gave the server a nicely formatted XML instance, with all the indentations, but when we get it back, the formatting is gone. In short, the server, got rid of all the whitespace-only text nodes.

You can use the xml:space attribute and set its value to "preserve" but that is not always practical. In particular, if you use typed XML, the attribute will be rejected unless your schema specifically contains a reference to xml:space.

Fortunately, SQL Server 2005 contains a mechanism to achieve whitespace preservation. If a whitespace-only text node contains at least one entitized character then it is preserved.
For example, if we do this

DECLARE @var XML
SET @var = '<root>&#x09;<a>&#x0A; &#x09;<b>Data b</b>&#x0A; </a>&#x0A;</root>'
SELECT @var
go

the query returns
<root>&#x09;<a>
&#x09;<b>Data b</b>
&#x09;</a>&#x0A;</root>

The parser preserves whitespace-only text nodes and for each of them it entitizes the last character.

Of course SQL Server 2005 provides a way to do it for you in the form of an optional parameter for the CONVERT function, like this

DECLARE @var XML
SET @var = CONVERT (XML,'<root>
<a>
<b>Data b</b>
</a>
</root>',1)
SELECT @var
go

The query returns
<root>
&#x09;<a>
&#x09;<b>Data b</b>
&#x09;</a>&#x0A;</root>

Things are very different for typed XML. Unless the whitespace-only text nodes are part of mixed content, they will not be preserved, no matter what option you use.
Here is an example

create xml schema collection SC as '<xsd:schema xmlns:xsd="https://www.w3.org/2001/XMLSchema">

<xsd:complexType name="T_mixed" mixed="true">
<xsd:sequence>
<xsd:element name="a" type="xsd:byte"/>
<xsd:element name="b" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>

 <xsd:complexType name="T" mixed="false">
<xsd:sequence>
<xsd:element name="a" type="xsd:byte"/>
<xsd:element name="b" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>

 <xsd:element name="E_mixed" type="T_mixed"/>

 <xsd:element name="E" type="T"/>

</xsd:schema>
'
go

DECLARE @var XML(SC)
SET @var = CONVERT(XML(SC),
'<E>
<a>1</a>
<b>Data b</b>
</E>', 1)
SELECT @var
go

The query returns
<E><a>1</a><b>Data b</b></E>

DECLARE @var XML(SC)
SET @var = CONVERT(XML(SC),
'<E_mixed>
<a>1</a>
<b>Data b</b>
</E_mixed>', 1)
SELECT @var
go

This time the query returns
<E_mixed>
&#x09;<a>1</a>
&#x09;<b>Data b</b>&#x0A;</E_mixed>

As you can see the non significant whitespaces are only preserved in the second case.

If we don't use the "1" option for the CONVERT option, whitespace only text nodes are discarded, even in the case of mixed content models as shown in the example below

DECLARE @var XML(SC)
SET @var = CONVERT(XML(SC),
'<E_mixed>
<a>1</a>
<b>Data b</b>
</E_mixed>')
SELECT @var
go

The query returns
<E_mixed><a>1</a><b>Data b</b></E_mixed>

Please be mindful that if in a same XML instance you have both mixed and unmixed content models, the behavior will vary.
Let's update our schema collection and run an example

alter xml schema collection SC add '<xsd:schema xmlns:xsd="https://www.w3.org/2001/XMLSchema">
<xsd:element name="root">
<xsd:complexType>
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element ref="E"/>
<xsd:element ref="E_mixed"/>
</xsd:choice>
</xsd:complexType>
</xsd:element>

</xsd:schema>'
go

Now let's run the following query

DECLARE @var XML(SC)
SET @var = CONVERT(XML(SC),
'<root>
<E>
<a>1</a>
<b>non mixed</b>
</E>
<E_mixed>
<a>2</a>
<b>mixed</b>
</E_mixed>
</root>', 1)
SELECT @var
go

The result is
 <root><E><a>1</a><b>non mixed</b></E><E_mixed>
&#x09;<a>2</a>
&#x09;<b>mixed<
/b>
&#x09;</E_mixed></root>

As you can see whitespace-only text nodes were preserved under E_mixed but not under E.

Finally, I'll leave you with one small piece of advice: always be mindful of leading and trailing whitespaces. If you use CONVERT with the option that preserves whitespaces they will be seen as text nodes, which can be a problem.
For example

DECLARE @var XML(SC)
SET @var = CONVERT(XML(SC),
' <root/>', 1)
SELECT @var
go

will fail with the following error
XML Validation: Text node is not allowed at this location, the type was defined with element only content or with simple content. Location: /

The leading spaces were preserved and became a text node, which is not allowed at the top level for typed XML when at least one global element definition is available, hence the error.

-
Disclaimer:
This posting is provided “AS IS” with no warranties, and confers no rights.
Use of included script samples are subject to the terms specified at https://www.microsoft.com/info/cpyright.htm.

Comments

  • Anonymous
    September 14, 2006
    This is a little off topic, but is it possible to allow text nodes where a schema of node types is defined?

    I'm getting the:

    "XML Validation: Text node is not allowed at this location"

    ...error you mention, but in my case I want to store a subset of XHTML and text nodes should be allowed.

    I used an XmlValidationReader to confirm that an XHTML fragment is valid given my schema, but apparently SQL Server 2005 behaves a bit unexpectedly and it won't accept the input when it has text nodes.