XslCompiledTransform Performance: Beating MSXML 4.0
Update:
Transformation times for Saxon processors have been remeasured and updated based on the feedback received from Dimitre Novatchev and Michael Kay. I also slightly altered the text below to reflect the change in Saxon command-line arguments.
Interestingly enough, the first live.com hit for the "XslCompiledTransform Performance" query at the moment is this post of Jeff Prosise, where he says he was disappointed that XslCompiledTransform
ran just 3 times faster than XslTransform
on a "fairly simple style sheet". He is concerned that XslCompiledTransform
is not fast enough comparing to the good old MSXML 4.0. Well, as we will see very soon, XslCompiledTransform
may easily outperform MSXML 4.0 by several times!
Here I compare transformation speed of different widely-used XSLT processors for several arbitrary chosen stylesheets. I deliberately do not consider many other important aspects, such as working set, start-up time, compilation time, scalability issues, etc., focusing on pure transformation time only. I fairly tried to make all processors compete on equal terms; however I could miss some important details, especially for Saxon, which I know very little about. So this post should in no way be considered as a thorough comparison of XSLT processors; you are encouraged to run your scenarios with different processors and pick the one that fits your needs in the best way.
Let's first briefly describe our today's contestants:
- MSXML 3.0. The native XSLT processor implemented in MSXML 3.0 is still used by default in the Internet Explorer 6.0 and 7.0. It compiled a stylesheet to a tree of "actions", each of which knew how to "execute" itself. So it worked as a pretty simple XSLT interpreter.
- MSXML 4.0. The XSLT processor in MSXML 4.0 was completely reworked. It implemented a number of optimization techniques and compiled a stylesheet to some sort of P-code, which resulted in significantly faster transformation speed. This processor is more conformant and reliable than its MSXML 3.0 predecessor. Further versions of MSXML—5.0 and 6.0—bore the same XSLT processor as MSXML 4.0, so there is no much sense to consider them separately.
- XslTransform. The first managed XSLT processor,
XslTransform
, was a port of MSXML 3.0 code. Unfortunately, in addition to bugs and performance issues ported from MSXML 3.0, some new ones were introduced during the porting process.XslTransform
was good enough for many applications; however it was clear that its radical improvements were impossible without radical reworking of the code like the one happened between MSXML 3.0 and 4.0. - XslCompiledTransform. The .NET Framework 2.0 presents a new managed XSLT processor,
XslCompiledTransform
, which is going to replace the obsoleteXslTransform
class.XslCompiledTransform
operates as a true compiler, translating a stylesheet into a set of dynamic MSIL methods, which use the highly-optimized XSLT runtime library. While compiled stylesheets run amazingly fast, incurred set-up costs—XSLT-to-MSIL compilation time plus JIT-compilation time—are considerably higher than for other XSLT processors, which may hinder its adoption in some applications. - Saxon 6.5.5. Saxon is an open-source Java implementation of XSLT 1.0, developed by Michael Kay, a great XSLT enthusiast and the editor of the XSLT 2.0 specification. Version 6.5.5 was the last XSLT 1.0 processor release. As you can conclude from Michael Kay's "XSLT and XPath Optimization" article, Saxon uses basically the same approach as the XSLT processor in MSXML 3.0 together with some optimization techniques.
- Saxon 8.7.3. The latest version of Saxon, Saxon 8.7.3, implements the recent candidate recommendations for XSLT 2.0, XQuery 1.0 and XPath 2.0, and provides better integration with the .NET platform. Though it is an XSLT 2.0 processor, it is interesting to know how fast it can execute XSLT 1.0 stylesheets.
To run tests with MSXML I used the Msxsl.exe command-line utility. I had to tweak its code a little, because the -t
option for measuring load and transformation times failed to work on CPUs faster than 2 GHz. The utility was developed around 09/2000, and apparently some of Microsoft developers did not realize how fast processors would become in 6 years! More precisely, this part of the Timer
class constructor retrieves the frequency of the high resolution performance counter and rejects any value above INT_MAX
= 2,147,483,647:
if (!::QueryPerformanceFrequency((LARGE_INTEGER *)&_freq) || _freq > INT_MAX)
{
// Counter not available
_freq = 0;
}
Below are the command-line arguments I used with Msxsl and Saxon. The number after -u
specifies the version of MSXML to use, -o nul
redirects output to the NUL device, so that file input/output operations affect our measurements in a minimal way. The undocumented -9
option forces Saxon to repeat the transform 9 times in a row, so that we obtain transformation time for the "warm" process. Unfortunately, the Msxsl utility does not provide a similar option, so for now MSXML 3.0/4.0 will be a little discriminated against. Both Saxon processors were run under Java™ 2 Runtime Environment version 1.4.2.
C:\XsltPerf>msxsl.exe -t -o nul -u 3.0 Kasparov-Karpov.xml chess.xsl
C:\XsltPerf>msxsl.exe -t -o nul -u 4.0 Kasparov-Karpov.xml chess.xsl
C:\XsltPerf>java -jar saxon6.5.5\saxon.jar -t -o nul -9 Kasparov-Karpov.xml chess.xsl
C:\XsltPerf>java -jar saxon8.7.3\saxon8.jar -t -o nul -9 Kasparov-Karpov.xml chess.xsl
Finally, for XslTransform
and XslCompiledTransform
I used the XsltPerf utility, presented in my previous post. The System.Data.SqlXml assembly was NGen'd, though I doubt it could considerably affect performance in the "warm" case. As a separate step, I verified that all processors produce the correct output.
For the first test, let's try the Queens stylesheet I used in the previous post. To not force you to read it, I recall here that this XSLTMark benchmark stylesheet, developed by Oren Ben-Kiki, finds all the possible solutions to the problem of placing N queens on an N×N chess board without any queen attacking another. XSLTMark uses N = 6, and the issue I immediately encountered was that one run of this scenario was executed too fast to make measurements quite reliable. So I tweaked its input file, which originally looked as <BoardSize>6</BoardSize>
to make the stylesheet solving the same problem 20 times:
<Root>
<BoardSize>6</BoardSize>
... 18 identical lines skipped here ...
<BoardSize>6</BoardSize>
</Root>
Below are results for my Intel® Xeon® 3GHz box. Since XslCompiledTransform
performance is affected by JIT-compilation on first use, as I described in my previous post, I give execution times of the first Transform
call for this processor in parentheses. For example, for this stylesheet the first Transform
takes about 53 ms, and subsequent ones take about 34 ms.
Queens | ||
---|---|---|
XslTransform |
| |
MSXML 3.0 SP5 |
| |
Saxon 8.7.3J |
| |
Saxon 6.5.5 |
| |
MSXML 4.0 SP2 |
| |
XslCompiledTransform |
|
As you can see, MSXML 4.0 and XslCompiledTransform
are much faster than other processors on this test; moreover, the latter is about 4 times faster than the former. I would like to note that the Queens stylesheet is rather artificial—it is an implementation of the backtracking algorithm in the language mainly oriented to deal with XML transformations. While it cannot be considered a real-world scenario, XslCompiledTransform
performs really good even in that area. And if, in the past, performance issues might force you to implement similar helper functions in a general-purpose programming language, like C# or JScript, and call them using embedded scripts or extension objects technologies, now there is a greater chance you can implement those functions in XSLT itself and still have good performance.
For the following tests we take a couple of Sarvega XSLT Benchmark stylesheets, which represent real-world XSLT transforms. The Chess-FO stylesheet, developed by Anton Dovgyallo from the Russian Academy of Sciences, reads the sequence of moves in a chess game and produces a set of chess board diagrams, representing every intermediate position as a graphical image in the XSL-FO format:
Kasparov–Karpov
1990 World Championship Game
Again, MSXML 4.0 and XslCompiledTransform
are several times faster than other processors. And if the first transformation for XslCompiledTransform
takes 2 times longer than for MSXML 4.0 due to JIT-compilation, subsequent ones are 4 times faster.
Chess-FO | ||
---|---|---|
MSXML 3.0 SP5 |
| |
XslTransform |
| |
Saxon 8.7.3J |
| |
Saxon 6.5.5 |
| |
MSXML 4.0 SP2 |
| |
XslCompiledTransform |
|
The DocBook-XHTML stylesheet, developed by Norman Walsh, transforms documents written in the DocBook format to XHTML. The input document used in Sarvega XSLT Benchmark is rather small—under 100 KB—and produces dozens of messages during its transformation. I had to redirect those messages to a file to minimize influence of xsl:message
instructions on transformation time.
DocBook-XHTML is a huge stylesheet with thousands of templates, global parameters and variables, and you can see how badly JIT-compilation affects the first stylesheet run in case of XslCompiledTransform
: 1970 ms versus 60 ms for subsequent runs. It would be really nice to have the ability to pre-compile and "pre-JIT" stylesheets, so you would not pay this price again and again on each application run, but currently the .NET Framework 2.0 does not provide means for that.
DocBook-XHTML | |||
---|---|---|---|
MSXML 3.0 SP5 |
| ||
XslTransform |
| ||
Saxon 6.5.5 |
| ||
Saxon 8.7.3J |
| ||
MSXML 4.0 SP2 |
| ||
XslCompiledTransform |
|
One can make a couple of conclusions from the results above:
- XSLT compilers steal a march on XSLT interpreters. While MSXML 4.0 is not a true compiler, the P-code it generates is close to the machine code, which allows it to surpass by far pure XSLT interpreters.
XslCompiledTransform
is a true XSLT compiler and may transform several times faster than MSXML 4.0. However, since the .NET Framework 2.0 currently does not allow you to save compiled stylesheets, you have to pay the compilation price on each application run.
Now it does not seem a coincidence that the last release of the Java platform, J2SE 5.0, replaced the Xalan interpreting processor with the XSLTC compiling processor as the default XSLT engine. And that Michael Kay, the creator of Saxon, is experimenting in the same direction. However, it is a very untrivial task to develop a compiler from an interpreter. As you remember, Microsoft had to discard the old interpreter code base and start from scratch twice—and their efforts led to creating swift and reliable XslCompiledTransform
and MSXML 4.0 XSLT processors.
Comments
- Anonymous
July 25, 2006
Is this an apples to apples comparison? Are the java timings from a cold start, or a warm start? How do the timings differ if they are run from a warm server vm? - Anonymous
July 25, 2006
Very interesting post and results!
Question: Did you make any attempt to run the same tests with Saxon on .NET 8.7.3? - Anonymous
July 26, 2006
A good article.
To do Java applications justice you need to allow for warm-up time. Try running Saxon with the -t flag for timings, and the undocumented -9 flag to repeat the transform 9 times. The figures I get for the Chess stylesheet under Saxon 8.7.3 are in milliseconds:
741, 280, 261, 260, 251, 250, 260, 240, 250
The first run excludes the compile time of 681 ms. I don't know of course if the machine is comparable with the one you used. There's clearly some way still to go to catch up with the .NET 2.0 compiler, but the gap isn't as big as you portray. And it's not surprising that there should be a gap, since I've spent the last five years working on increased functionality while MS have been working on increased performance!
The figures I get for Saxon 6.5.5 are:
610, 331, 330, 341, 340, 331, 320, 341, 320
I think the faster time you are seeing for 6.5.5 is entirely due to the fact that the product is half the size and therefore loads faster. At steady state, these results show 8.7.3 being 25% faster than 6.5.5, and that's typical of my other measurements.
For Saxon on .NET, I'm currently seeing
761, 471, 490, 481, 481, 481, 510, 481, 491
i.e. not much better than half the Java speed. Presumably that's not an intrinsic feature of the two environments, but a measure of the overhead of cross-compiling.
Michael Kay - Anonymous
July 26, 2006
The comment has been removed - Anonymous
July 28, 2006
Great Article!
It would be very interesting to see the same test run with very large files.