The flat file will never surrender: pre-processing before using the flat file disassembler?

I sometimes receive e-mails from BizTalk developers asking me to solve their specific problem. Recently, I received an e-mail from Nic asking about some flat file disassembling issues. This is great feedback, keep it coming guys!

If you have a specific question, I would prefer you to post to newsgroups or comment on my blog: this will ensure that questions (and answers) get seen by many people, increasing your chances to get an accurate answer. As everybody, I make mistakes and more eyes seeing problems is always a good thing. Also, I am sure you understand that I do my best to answer your questions but I am still only human and days have only 24 hours. So please accept my sincere apologizes if I cannot answer your question or if it takes some time.

So let's see what Nic is up to. Here is Nic's e-mail (I formatted it and replaced some off-topic sentences by "[...]"):

[...] So, the context is: I have one large document, which has a single header, and a number of lines (about 2500). Each line is an invoice line item, with an invoice number on it, amongst other things. There can be 1 or more lines per invoice. I can parse the file fine, so far.
The goal is to break it down into N seperate documents, 1 document for each invoice (which may have 1 or more lines.....)
So:

  1. If I put a custom disassembler in, can I push out the originally formatted flat file info, and then put a flat file disassembler later in the pipeline to turn it into XML? The schema is way to complex (on a line-by-line basis) to do in C#, but I can do the disassemble / getnext thing quite easy - as long as I can then re-disassemble the output.
  2. Am I even going about it right? I have a format similar to this:

00HEADER
01BODYINV001
01BODYINV001
01BODYINV002
01BODYINV002
01BODYINV002
01BODYINV003(fixed width) - eg a discriminator / tag at the start (00/01), tho I dont care about the header, then X number of (in this case) invoice lines, except they are for N invoices, all concatenated together. (eg, 2500 lines, with the line contents of 1200 invoices).
Ideally, I'd like to split them up into individual documents, eg:
<invoice id="INV0001">
<invoiceline linenum="1" otherdatahere/>
<invoiceline linenum="2" otherdatahere/>
</invoice> (and the others in seperate documents, same format)
Is a custom disassembler the right way to go about this?
Cheers for any help.
Nic [Name protected].

I am not sure I understand Nic's question number 1. I'll take a stab at it and if I am wrong, please feel free to correct me. I understand that you would like to pre-process the flat file into a file that is easier to parse with the flat file disassembler. You can indeed do this. Create a custom pipeline decoding component and perform your pre-processing here. Then, you can put the out of the box flat file disassembler with the appropriate schema(s) at disassembling stage to create the XML files out of the pre-processed flat file. For instance, you could have your pre-processor component break out the file into N separate messages and have the flat file disassembler disassemble them one by one.

Another alternative is to create a disassembling component that extends the out of the box flat file disassembler. This is a great feature that is often ignored. Just inherit from FFDasmComp as explained here . You could for instance feed the out of the box flat file disassembler with the input required to produce one message only. This would reduce the number of components in the pipeline.

The out of the box flat file disassembler does a great job at parsing strongly structured files, for instance, an invoice line is always taking 3 lines of input with a given format. In those cases, the flat file disassembler is easy to use, fast and probably the best solution.

If records spread over multiple lines; i.e. an invoice line is sometimes 2 lines of input and sometimes 4 lines of input and you cannot easily "predict" when it will be 4 lines, then the flat file disassembler is more challenging to use. While the flat file disassembler sometimes works fine in those cases (provided that you can make the adequate assumptions), a quick custom decode pipeline component is often adequate.

So in your case, assuming that I have understood this right, I would suggest to first run the file through a custom decode pipeline where you prepare the file to be easily disassembled by the out of the box disassembler. Perhaps this pre-processing removes the line breaks between invoice lines (in your example between the two 01BODYINV001) so the flat file disassembler has only to worry about splitting data in a line.

It is very hard to decide which one is the best. I am not sure how large your file is and how many files you wish to process per second. Implementing a fast, reliable pipeline component can be challenging if you would like to achieve high throughput.

Hope this helps.

Comments

  • Anonymous
    July 08, 2004
    The comment has been removed
  • Anonymous
    July 08, 2004
    Nic: Happy to have been helped. As I am sure you are aware, to ensure maximum performances, I suggest you avoid reading the whole stream in the decoder but read as much as is needed to fullfill the read request from the messaging engine.

    I have explained this with more details at http://weblogs.asp.net/gzunino/archive/2004/07/07/175596.aspx