The eXcel files: Do not use Office Object Model when processing Office documents with BizTalk 2004
Every once in a while, someone posts on microsoft.public.biztalk a question about processing Office files in BizTalk. Sometimes it is Word documents, sometimes it is Excel workbooks. This post explains a typical trap you should not fall in and why.
What are the options?
In general, you have at least the following options:
- Do not process the file itself but process a reference to the file (i.e. URL to the file for instance),
- Arrange for the file to be saved in XML before submission to BizTalk 2004,
- Arrange for the file to be saved in a comma separated format (or any other delimited format) and use the flat file disassembler to produce XML messages in the receive pipeline,
- Write a custom receive pipeline component to crack open the Office file and build an XML message yourself.
These options are sorted by complexity: the reference processing is indeed the simplest way and does not offer much flexibility. Writing a custom pipeline component is the most complex solution with high flexibility.
Option 1 is very good if your intent is to perform some kind of workflow on the document (i.e. document workflow process). You will save large amounts of bandwidth on your network and improve performances on the BizTalk server by not sending the actual document to BizTalk for processing. Instead, send a small XML document that contains all the informations you need and add a reference (an URL for instance) to the actual file. The file can be stored in Windows Sharepoint Services, on a file share ... Of course this option may not always be possible. Often, business processes require data from within the document to execute properly.
Option 2 suggests that you put the conversion burden on whoever submits the file to your BizTalk server. The conversion happens before submitting the document to BizTalk so the document is already XML when you receive it. Microsoft Office 2003 allows to save documents in XML so it might be easy to write a quick VBA Excel macro or an Office AddIn (COM or C# / VB.NET) that converts the file to XML and then submits the result to BizTalk. Users could even be presented with a new button on the toolbar "Submit ..." that triggers the macro and the submission to the back end system for processing.
Option 3 takes advantage of the fact that in practice, Comma Separated files have been the data exchange format between businesses for a very long time. If your Excel documents happen to be stored as comma separated files (or if you can arrange for them to be saved in that format easily), then you can use the BizTalk flat file disassembler to create XML messages. In fact, files do not have to be comma separated. The flat file disassembler is highly customizable and can parse most of delimited files as well as most positional files. For more informations, you can refer to this page .
Option 4 requires you to write a BizTalk 2004 custom component (a receive pipeline disassembler component to be precise) that will take an Office file and will convert it to an XML document. I have already outlined how a custom disassembler can be implemented . Office documents can get fairly large so the disassembler should be able to operate without loading the whole document in memory.
However, this approach requires a way to understand the binary format of the specific Office document you are parsing. For instance, there are numerous documents on the internet which describe the Excel file format and one might even find libraries allowing to manipulate Office files programmatically. Another way is to use the appropriate Office application's Object Model to load the document and manipulate it.
There are caveats in using Office applications to load the document on the server: Office applications are client applications and were never tested in a server environment. This means that Office applications may or may not behave appropriately under stress and/or performances may or may not degrade when the number of documents to load grows. Also, Word or Excel are quite heavy in memory. This web page on Microsoft web site explains with more details why it is no a good idea to run an Office application on the server side.
In general, no one should be using client components on a server. For instance, XMLHTTP is a client component and should not be used on the server side (because it relies on WinInet, a client side component).