Authoring Well-Formed HTML

Well-formed HTML, or XHTML, simply means HTML that conforms to the rules of XML. This means that the same HTML tags are available, but the stricter XML syntax is required. An XSLT style sheet is itself XML and any HTML within it must be well formed.

In addition to HTML within an XSLT style sheet, you should consider authoring well-formed HTML for its own sake. The industry is moving toward well-formed HTML as a way to make the Web more robust, while simplifying and accelerating the processing of well-formed documents and data. Well-formed HTML has great advantages for authoring tools and can benefit hand authoring by ensuring that the markup is unambiguous. The industry expectation is that a future HTML standard will be an XML application.

The price for these benefits is that a less forgiving syntax must be used.

Writing well-formed HTML is simple. Here are the basic rules to follow as you author or convert to well-formed HTML.

All tags must be closed

HTML allows certain end tags to be optional, the most common being <P>, <LI>, <TR>, and <TD>. XML requires all tags to be closed explicitly. The following table shows tags in basic HTML compared to well-formed HTML.

HTML Well-formed HTML
<P> This is an HTML paragraph.
<P>or two.
<P>This is an HTML paragraph.</P>
<P>or two.</P>

Leaf nodes must also be closed by placing a forward slash (/) within the tag. The most common examples are <BR>, <HR>, <INPUT>, and <IMG>. The following table shows leaf nodes in both basic and well-formed HTML.

HTML Well-formed HTML
<IMG src="sample.gif"
   width="10" height="20">
<IMG src="sample.gif"
   width="10" height="20" />

No overlapping tags are allowed

XML does not allow start and end tags to overlap, but enforces a strict hierarchy within the document. The following table shows an example of these tags.

HTML Well-formed HTML
<B>Well <I>Hello</B> Dolly!</I>
<B>Well</B> <I><B>Hello</B> Dolly!</I>

Case matters

Choose a consistent case for start and end tags. Generally, try to use uppercase for HTML elements. The following table shows how case matching should appear in well-formed HTML.

HTML Well-formed HTML
<B><i>Hello Dolly!</I></b>
<B><I>Hello Dolly!</I></B>

Quote your attributes

All attributes must be surrounded by either single or double quotation marks. The following table shows how to appropriately include attributes.

HTML Well-formed HTML
<IMG src=sample.gif 
   width=10 height=20 >
<IMG src='sample.gif'
   width="10" height="20" />

Use a single root

Shortcuts that eliminate the <HTML> element as the single top-level element are not allowed. The following table shows how to properly include the <HTML> element.

HTML Well-formed HTML
<TITLE>Funky markup</TITLE>
<BODY>
  <P>Amazing that this HTML works.</P>
</BODY>
<HTML>
  <HEAD>
    <TITLE>Clean markup</TITLE>
  </HEAD>
  <BODY>
    <P>Not nearly so amazing that 
    this well-formed HTML works.</P>
  </BODY>
</HTML>

Fewer built-in entities

XML defines only the following minimal set of built-in character entities:

  • &lt; — (<)
  • &gt; — (>)
  • &amp; — (&)
  • &quot; — (")
  • &apos; — (')

Numeric character entities are supported.

Escape script blocks

Script blocks in HTML can contain characters that cannot be parsed, such as < and &. These must be escaped in well-formed HTML by using character entities, or by enclosing the script block in a CDATA section.

In addition, Microsoft JScript® (compatible with ECMA 262 language specification) comments terminate at the end of the line, so preserving the white space within script blocks containing comments is important. By default, the xml:space attribute value normalizes white space by compressing adjacent white space characters into a single space. This destroys the new line that terminates the JScript comment. Any JScript following the comment is treated as part of the comment and ignored, often resulting in script errors. The CDATA notation also ensures that the white space is preserved.

The following table shows HTML script block that contains both a character that cannot be parsed (<) and JScript comments. The well-formed script block uses CDATA to encapsulate the script.

HTML Well-formed HTML
<SCRIPT>
  // checks a number against 7
  function less-than-seven(n) 
  {
    return n < 7;
  }
</SCRIPT>
<SCRIPT><![CDATA[
  // checks a number against 7
  function less-than-seven(n) 
  {
    return n < 7;
  }
]]></SCRIPT>

Not all scripts will fail if they are not escaped in this way; however, Microsoft recommends that you do it as a matter of habit. This ensures not only that the script will work if it contains escaped characters or comments now, but also will continue to work if these characters are added in the future.

 Last updated on Saturday, April 10, 2004

© 1992-2003 Microsoft Corporation. All rights reserved.