Leveraging content in other formats

There is a really cool feature that we added into the WordprocessingML format that allows you to pass a file off to a consumer using alternative formats embedded within the WordprocessingML if you know that the consumer supports that alternate format. We had a lot of customers asking for this type of functionality, so we added the alternate content anchors into the WordprocessingML format to help with document assembly scenarios.

Scenario

I'm building a document generation tool that will allow users to fill out a form on a web site and automatically generate a rich Wordprocessing document based on the values they filled out in the form (think of a tool like a contract generator). I get some rich content back from the user that they filled out in a web form, so it's formatted as XHTML. Rather than having to do a translation from XHTML into WordprocessingML, I can just include the XHTML in the file as well, as long as I know that the user is going to open the file in an application that knows how to consume XHTML. If I don't know this, then of course I'll need to transform it into WordprocessingML.

This was a scenario we saw a lot of folks hitting with the earlier version of WordprocessingML from Office 2003. People had content repositories with HTML, and they didn't want to have to build an HTML to WordprocessingML translator just to get those chunks into the Word document.

Types of content allowed

This is completely up to the consuming application. The Ecma spec just defines where you put the alternate content, and how you identify it. There are no limitations on what kind of content you can place, and there are no rules on what type of content you must support. If you look at the definition in the spec, it says the type of content allowed is:

Any content, support for which is application-defined.

[Note: Some examples of formats which might be supported include:

  • Text = application/txt
  • RTF = application/rtf
  • HTML = application/html
  • XML = application/xml

end note]

So you can see that there are a few examples of the types of content you might want to support, but there are no limits or requirements.

Creating alternate content

In order to make it clear that there isn't an additional burden of supporting the various types of content that may occur, we said in the spec that a conformant producer is not able to create the alternate content chunks. This way, you know when you write a consuming application that you aren't required to support these alternate chunks. It's only something you can optionally decide to support. A producer should only create alternate chunks if they have a knowledge of what the consumer understands. There is no guarantee that anyone else will support XHTML within the files for example.

Guidelines in the spec

The IBM folks are clearly spending some resources scouring the Open XML spec looking for ways in which they can try to block the ISO approval (we've already discussed the huge financial bet they've made in ODF being the only standard). It looks like Rob Wier of IBM found a rather poorly worded description of alternate chunks in Part 1 of the spec. I think we did a poor job of explaining that there is no requirement to consume alternate chunks. The spec was trying to call out that any content type can be used if you want to, but it's not a conformant document because consumers don't have to support that content. We also wanted to be clear that if a consumer does understand the alternate content, they should translate it into WordprocessingML to match the rest of the file, so that on save, it's all the same format.

This was a good catch by Rob. I agree that it could be a bit clearer in what is required of both consumers and producers. I really wish IBM had spent more energy trying to improve the spec earlier on (they are members of Ecma and could have joined the Open XML TC). This is something we easily could have cleared up the wording on. As part of the ISO fast track process though, we have a chance to gather comments from the various national bodies and make any fixes required before finalizing. This is definitely something we can look into clearing up.

-Brian

Comments

  • Anonymous
    January 17, 2007
    "This is something we easily could have cleared up the wording on." While reviewing/editing/approving at an fairly forced 18.3 pages/day[0] in order to make the Dec 2006 Ecma vote deadline?[1] Are you sure you had time? [0] http://www.robweir.com/blog/2006/12/notable-achievement.html [1] http://www.sutor.com/newsite/blog-open/?p=1281

  • Anonymous
    January 17, 2007
    "consumers don't have to support [alternate chunk] content." Hmmm.....looks to me like the standard says it does. I think Rob's interpretation of that paragraph ("A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML [...]") is spot on. Yes, the "support for which is application-defined" does somewhat imply that different applications may have support some formats better or worse than others, but the "shall" in the paragraph in question does read to me like the consumer has to do something "valid" with it. The only other part that supports your point is the "assuming that the content type of Demo.html is supported by the application" sentence in the example below. However, if the MOOXML spec is like most others, examples are informative, and if they conflict with normative text it is the example that is considered to be in error. What's the process going to be for MOOXML defect reports? Will they be publicly listed/discussed somewhere? Is there a rough guess anywhere as to when the first TC might be issued?

  • Anonymous
    January 18, 2007
    Thanks Brian for clearing it out. Weir's post was a bit confusing when I first read it. As I understand, the "alternate content" part in WordProcessingML allows any arbitrary contents, including binary blobs. I understand the design reasons behind this but wonder is there anything that OOXML did that will stop vendors to sneaking proprietary "extensions" into these binary blob to achieve vendor lock in? I mean, if I were devilAdvocateVendor1, I can write a program which only my devilApplication will read and substitute all occurrance of "shall not" with "shall" and vice-versa in  the "10 commandment" and keep the "10 commandments" file in such a way that unsuspected person, using an alternative application, we read exactly the opposite of what the "10 commandments" is about.

  • Anonymous
    January 18, 2007
    Sinleeh, you are correct that people could put their own proprietary binary information into the file, but there is nothing in the spec that says others need to understand it. In addition to that, notice that a truly conforming producer isn't allowed to create these things. Only folks who don't want to create a conforming interoperable document and instead want to create a document and they know more about the application that will be consuming it that would use them. ODF and Open XML are both fully extensible specifications, which means that they can be improved over time, but it also means 3rd parties can add their own proprietary markup. It's kind of hard to put a restriction on this, and it's actually undesirable. An application may decide that they want to add some stuff to OpenXML, but it's not worth submitting to the Ecma TC because it's too specialized. Open Office today has a ton of proprietary extensions that they've added to ODF. The way they store spreadsheet formulas, view settings, print settings, and even some layout settings (as I discussed last week) are proprietary extensions. It's not always a bad thing, as you may have things that you don't think need to be included in the standard (especially if it doesn't affect the interoperability of the document). -Brian

  • Anonymous
    January 18, 2007
    Here are a few links to recent news of interest to Open XML developers ... Package Explorer Update. The

  • Anonymous
    January 19, 2007
    "there is nothing in the spec that says others need to understand it." Apart from the words "A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML [...]" you mean. "a truly conforming producer isn't allowed to create these things." No, but that doesn't mean that a WordproceccingML consumer won't be presented with them. In which case, the standard requires that it "shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML [...]" "I think we did a poor job of explaining that there is no requirement to consume alternate chunks." I think that's an understatement - I think you did add a requirement (possibly unintentionally) to consume alternate chunks. This is a standard now. It's been accepted by Ecma. I realise that MS's position on standards is to treat them as guidelines instead of rules, but not everyone else does it that way. While MS's implementors may be fine with a verbal "oh, it's not really meant that way, don't worry about it", that's not good enough for other people. You (or Ecma) need to fix the spec. Or at least take the first step and issue a Defect Report. Given that MOOXML was ratified at 20x the speed of other specs, I'd expect a roughly equivalent rise in the defect density compared with other specs. This is not meant to disparage the people doing the work; I'm sure if they were given an equivalent amount of time to do the work that they'd have got on other specs, they'd have caught more problems. Given that they were skimming it though... Of course, given also that MOOXML is more than 6x as long as most other specs, I'd again expect at least an equivalent rise in the total number of defects in this spec than most others. (This may be conservative; a lot of defects in other specs are internal inconsistencies between different sections. As the number of pages goes up, the possible combinations of sections will rise more along the lines of the length squared) Taken together, my guess is that this spec, due to its nature alone (length and speed of writing) will be more than 100x as buggy as most other specs. Don't you think it be wise to let this one mature for at least a couple more years (e.g. at least until the first Technical Corrigendum, and possibly until a beta of the next version of Word after that has implemented the TC) before it goes further down the standards path? (e.g. to ISO).

  • Anonymous
    January 19, 2007
    It seems IBM has also done another donation to Groklaw or something as it is now also trying to aid in the IBM effort to try to move the discussion for ISO certification.

  • Anonymous
    January 19, 2007
    hAl, Thanks for pointing that out. IBM has taken a very odd way to approach here. It's basically saying "hey we don't like this and want to block it, help us find ways to make that happen." That's a very competitive antagonistic approach. Adam, Check out Part 4 of the spec which is a much more detailed reference. You'll see in section 2.17.3.1 that there is a very detailed description of altChunks. In there, it clearly states that: "If an application cannot process external content of the content type specified by the targeted part, then it should ignore the specified alternate content but continue to process the file. If possible, it should also provide some indication that unknown content was not imported." -Brian

  • Anonymous
    January 19, 2007
    The comment has been removed

  • Anonymous
    January 22, 2007
    @Brian Thomas The whole Groklaw site is mostly dedicated in supporting IBM in it's articles. It is hardly surprising people might consider it an IBM front. The suggestions of Groklaw being a very one-sided IBM supporter are not originating from me but are to be found on other places on the internet as well. The articles about the office formats regularly show direct citations from IBM bloggers and are always negative on OOXML and is  never negative on ODF, where that format and the way it has proceded to ISO standards without being really complete should also warrant simular critisisms that are placed on OOXML. That is of course ok for a blog but Groklaw is clearly making themselfs a target for critisism on being a very biased blog if they mostly write very one-sided articles. (oh and btw, you should mayby tell Marbux that insulting me on blogs is not the way to discuss issues)

  • Anonymous
    January 22, 2007
    Groklaw has absolutely no connection with IBM. http://floatingpoint.wordpress.com/2006/10/22/groklaws-non-connection-to-ibm/ I´m surprised PJ is siding with IBM on this issue. She rarely does that.

  • Anonymous
    January 24, 2007
    The comment has been removed

  • Anonymous
    December 08, 2008
    One of the most common requests we hear related to word processing documents is the ability to merge

  • Anonymous
    April 13, 2009
    Resolution ================ Step 1: Open a new Microsoft Word 2007 document and type A B C Save the document