XML processing with XQuery (part 1)

A few weeks ago, one of my friends came to me with the following problem. He was using a software music player that was storing his music library and playlists in a XML file and he wanted to do some extractions and transformations.

Here’s how the XML looks like

<song id="1" artist="Sarah Mc Lachlan" title="Fallen" /> <song id="2" artist="Moby" title="Extreme Ways" /> <song id="3" artist="Jewel" title="Becoming" /> <playlist> <song id="2"/> <song id="3"/> </playlist>

XML is supported natively by the engine like a standard datatype, so you can use XML in any place where an atomic type is allowed. There are some restrictions but it's beyond the scope of this blog entry. Let's put it in a variable. Later we can use a table if we want. For quick testing, a variable will do fine.

declare @x xml set @x = '<song id="1" artist="Sarah Mc Lachlan" title="Fallen" /> <song id="2" artist="Moby" title="Extreme Ways" /> <song id="3" artist="Jewel" title="Becoming" /> <playlist> <song id="2"/> <song id="3"/> </playlist>'

Our goal is to resolve the song ids in the playlists and produce this output

<playlist> <song id="2" artist="Moby" title="Extreme Ways" /> <song id="3" artist="Jewel" title="Becoming" /> </playlist>'

We'll use the XQuery language to achieve that. If you like skipping all the steps and go straight to the solution, here it is:

 select @x.query('<playlist> {for $id in /playlist/song/@id 
   return <song artist="{/song[@id = $id]/@artist}" title="{/song[@id = $id]/@title}"/>} </playlist>')
 

But the journey is often more rewarding that the solution itself and this trip will take us through various areas of the product like:

  • XQuery/XPath
  • Binary XML
  • Relation+ algebra
  • UDX
  • and many other things

So let's get started.

In SQL Server, the support we have for XML is based on the XML data model (XML 1.0 standard). Our XML is actually a XML fragment since it doesn't have a prologue (the <?xml ...> header) but that's fine for our purpose. This model organizes the content in a tree where elements are containers, attributes belong to elements and text nodes are leaves.

Because SQL Server works on the XML data model content, it allows for easy navigation of the tree. For instance, you can refer to the third song from the top or the first song inside the playlist using a language called XPath. Also for this reason, the database engine will not guarantee strict character by character preservation. In fact, in many case if you retrieve the XML fragement, it will not be lexically identical to the XML you inserted. We drop the XML prologue and we can reorder the attributes for a given element. Insignificant whitespaces are also dropped. There is usually a bit of a confusion regarding what is an insignificant whitespace. It's worth taking some time here to clarify.

Insignificant whitespace and whitespace entitization

Insignificant whitespace is whitespace (by whitespace, we denote space, tab, line feed and carriage return) only text nodes. The key points are whitespace only and text nodes . Let's see with some examples.

 <a>
 <b/>
</a>

That's equivalent to

 <a><b/></a> 

But if you some non white space then the whole node becomes significant

 <a>
 Hello
   <b/>
</a>

There you cannot drop anything. You can't remove whitespace inside attribute values either because it's not a text node and if it is contained in an element (directly as a parent or somewhere as an ancestor) that has xml:space='true', then it has to be kept too. Sometimes we need to entitize some characters because otherwise they would be transformed into a space.

In attribute values, we entitize tabs (&#x09;), line feeds (&#x0A) and carriage returns (&#x0D). This section of the XML spec explains the rules behind this.

In text nodes, we only entitize carriage returns. We obviously have to entitize all the 'special' characters such as &, <, >, etc. which have a particular meaning in XML.

Some applications require a stricter preservation of whitespaces. Word processors that save their documents in XML form are interested in keeping all the whitespaces. SQL Server provides an option to indicate this. When you insert the XML, there is an implicit conversion from text to XML.

set @x = '<a/>' is equivalent to set @x = convert(xml, '<a/>')

The advantage of this syntax is that you can specify additional options to the convert operation. With option 1, you indicate strict whitespace preservation

 set @x = '<a>             </a>'
select @x

returns

<a/>

but,

 set @x = convert(xml, '<a>           </a>', 1)
select @x

returns

 <a>          &#x20;</a>

Note that we entitized the last whitespace character so that if the same string is reinserted into the server, the whitespace will be preserved even if the option 1 is not present. It is important in replication scenarios where this XML is shipped to a remote server. We want to keep the replicated version identical to the original one.

Next week we'll talk about how the server stores the XML and XPath.

Comments