How to efficiently generate OpenXML Workbook from Excel files containing very large amount of data

Few days back, one of my colleague asked me for my help for a requirement to convert an Excel 97-2003 Workbook to an OpenXML based workbook. Although it may sound pretty easy, the catch was that the file to be converted had around millions of rows (you read it right, Millions!). If we go by the standard technique of looping through the cells of the source file, and generate the corresponding XML for the OpenXML file, it would take, if not ages, yet a substantial amount of time to complete.

So, the question was how to efficiently generate the OpenXML from an Excel workbook that contains huge amount of data?

In order to solve this, we decided to use the XSL Transform technique on our source XML’s data to generate the required content for OpenXML. In this post, I am going to show how we achieved this.

Here is the overall flow of what we are going to implement here:

Flow

Before we go into the details of this, let us briefly discuss what is XSL transform. I’ll not go into the details of XSLT, however the overall idea is that using XSLT (eXtensible Stylesheet Language Transformations) file we can transform an XML data to some other format, for e.g. XHTML; and in our case to XML based on a different schema.

For more details and explanation of XSLT, please see following links:

 

Extracting Data From Excel Workbook

Our first task is to extract the data contained in the Excel workbook into an XML file. We will achieve this by using XML mapping in Excel. Here is the snapshot of the workbook that I used for our example:

 ExcelBook

Here is the XSD file that I used to map the data in the above workbook; note that the contents of this file will depend on the structure of the source Excel workbook:

 <?xml version="1.0" encoding="utf-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="https://www.w3.org/2001/XMLSchema">
  <xs:element name ="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" name="data">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="object" type="xs:string"/>
              <xs:element name="price" type="xs:integer"/>
              <xs:element name="count" type="xs:short"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

After mapping this schema, we can extract the data from Excel by saving the current file as XML Data (*.xml) format. Here the XML that was generated from my example workbook using the above schema; let’s call it data.xml:

 <root>
    <data>
        <object>Table</object>
        <price>150</price>
        <count>1</count>
    </data>
    ...
</root>

After generating our data XML, next task is to prepare XSLT for transforming this data XML to our required format.

 

 

Preparing XSLT

In my XSLT, I have two templates that I apply on the generated Data XML, one at the root node, and the other for each data element and its children.

For each data node, we need to generate a row element as per Spreadsheet ML, and for each child element of data node we need to generate col elements. Each row element that we need to generate, must have an attribute named r which will contain the row number. Similarly for each col node we need to generate an attribute named r that needs to have value of the cell name which this column corresponds to.

For e.g. for first row in our data.xml our transformed row element should have an attribute r=”1” , and its first col element should have an attribute r=”A1” , second col element should have r=”B1” and so on.

Now that actual data value that is contained in a cell, can be maintained in following two ways according to Spreadsheet ML specifications:

  1. Keep each value in a separate file called sharedStrings.xml in the OpenXML package
  2. Keep them as inline strings as part of the sheet.xml

We will be using the second way to generate our XML. The second way also requires that each col element specifies the type attribute (t) with value as inlineStr. This tells Excel that it contains an inline string and Excel doesn’t need to search for the value from sharedStrings.xml. The actual content of the inline string is contained in the child element named is under the col node.

Here is the XSLT that I used for transforming our data XML to the required XML:

 <?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="https://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="root">
    <xsl:element name="sheetData">
      <xsl:apply-templates select="data"></xsl:apply-templates>
    </xsl:element>
  </xsl:template>

  <xsl:template match="data">
    <xsl:variable name="rowID">
      <xsl:number value="position()" format="1"/>
    </xsl:variable>
    <xsl:element name="row">
      <xsl:attribute name="r">
        <xsl:value-of select="$rowID"/>
      </xsl:attribute>

      <xsl:for-each select="*">
        <xsl:element name="c">
          <xsl:variable name="colID">
            <xsl:number value="position()" format="A"/>
          </xsl:variable>
          <xsl:attribute name="r">
            <xsl:value-of  select="concat(string($colID),string($rowID))"/>
          </xsl:attribute>
          <xsl:attribute name="t">
            <xsl:text>inlineStr</xsl:text>
          </xsl:attribute>
          <xsl:element name="is">
            <xsl:element name="t">
              <xsl:value-of select="."/>
            </xsl:element>
          </xsl:element>
        </xsl:element>
      </xsl:for-each>
      
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

 

Generating transformed XML using the XSLT

Now that we have our XSLT ready, we need to transform our data XML into our desired format using this XLST. For this example, I used a C# based Windows Forms application to perform this task.

Here is the code snippet I used:

 XPathDocument xDoc = new XPathDocument("data.xml");
XslCompiledTransform xXslt = new XslCompiledTransform();
xXslt.Load("test.xslt");

XmlTextWriter xOutput = new XmlTextWriter("output.xml", null);

xXslt.Transform(xDoc, null, xOutput);

xOutput.Close();

 

Running this snippet, results in an XML in following format:

 <?xml version="1.0" encoding="utf-8"?>
<sheetData>
  <row r="1">
    <c r="A1" t="inlineStr">
      <is>
        <t>Table</t>
      </is>
    </c>
    <c r="B1" t="inlineStr">
      <is>
        <t>150</t>
      </is>
    </c>
    <c r="C1" t="inlineStr">
      <is>
        <t>1</t>
      </is>
    </c>
  </row>
  ...
  ...
</sheetData>

 

Generating XLSX from our transformed XML

This involves following steps:

  1. Generate a blank XLSX file in memory. I have used OpenXML 2.0 SDK for this task, and also referred to this post:

         How To generate a .xlsx using Open XML SDK without loading any .xml

  2. Insert the sheetData element from our transformed XML into the sheet.xml

Here is the code snippet I used:

 // Read the generated XML
StreamReader sr = File.OpenText("output.xml");
string strSheetData = sr.ReadToEnd();
 
using (SpreadsheetDocument doc = SpreadsheetDocument.Create("output.xlsx", DocumentFormat.OpenXml.SpreadsheetDocumentType.Workbook))
{
    WorkbookPart workbook = doc.AddWorkbookPart();
    WorksheetPart sheet = workbook.AddNewPart<WorksheetPart>();
    string sheetId = workbook.GetIdOfPart(sheet);
 
    // Create a blank XLSX file
    string XML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><workbook xmlns=""https://schemas.openxmlformats.org/spreadsheetml/2006/main"" xmlns:r=""https://schemas.openxmlformats.org/officeDocument/2006/relationships""><sheets><sheet name=""{1}"" sheetId=""1"" r:id=""{0}"" /></sheets></workbook>";
    XML = string.Format(XML, sheetId, "Sheet1");
    this.AddPartXml(workbook, XML);
 
    // Insert our sheetData element to the sheet1.xml
    XML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><worksheet xmlns=""https://schemas.openxmlformats.org/spreadsheetml/2006/main"" >{0}</worksheet>";
    XML = string.Format(XML, strSheetData);
    this.AddPartXml(sheet, XML);
 
    doc.Close();
}
  protected void AddPartXml(OpenXmlPart part, string xml)
{
    using (Stream stream = part.GetStream())
    {
        byte[] buffer = (new UTF8Encoding()).GetBytes(xml);
        stream.Write(buffer, 0, buffer.Length);
    }
}

 

After running this code snippet, a file named output.xlsx is generated with our transformed XML inserted into its sheet.xml. Here is how it looked on my machine, when opened in Excel 2007:

Generated XLSX

 

As an example, I have just inserted my data in the sheet.xml. In a more realistic scenario, you might want to implement extra tasks like:

  • Pull data from a database as XML (instead of extracting it from an XLS file) and transform that to the required XML using XSLT.
  • Add header row to the sheet.xml in addition to the data
  • Define styles for the columns. You might have to modify your XSLT to generate columns with associated styles.
  • Format columns for different types of data.

You can also play around with XSLT to generate a full sheet.xml for your requirement, rather than generating just the sheetData element of sheet.xml.

With these I am ending this post. Please feel free to post your queries as comments, I will definitely try to answer them.

Comments

  • Anonymous
    December 13, 2010
    Thanks for your post. It's useful. I want to add header row to the sheet.xml and the add the style for excel file. Can you guide me to do this? Thanks.

  • Anonymous
    June 09, 2011
    Hello..      I just wanted to know how to create multiple sheets inside one workbook? I'm getting 1 XML from the database that contains the data for all the sheets that has to come in the workbook. Please help me out in this case. Thanks in advance.

  • Anonymous
    June 09, 2011
    Hi Veeresh, You can achieve this by, a) Add a new <sheet> tag in the workbook XML,   b) add a new worksheet part and feed the data corresponding to the this new sheet (extracting it from the one XML that you get from the database). Hope this helps! Regards, Manvir Singh (http://www.manvirsingh.net/)

  • Anonymous
    June 11, 2011
    Hello Manvir Singh,     I had tried doing as u sujested to my previous post, it downloading an xlsx file for me. but when i tried to open the file it is saying file is corrupted. here is the code i'm trying to use please let me know if any changes has to be done for the following. private void button1_Click(object sender, EventArgs e)            {                ArrayList DataNode = new ArrayList();                XmlDocument xmlobj = new XmlDocument();                ArrayList FinalXML = new ArrayList();                XslCompiledTransform xXslt = new XslCompiledTransform();                xmlobj.Load(@"D:ExcelImportInput.xml");                xXslt.Load(@"D:ExcelImportdemoxsl.xslt");                XmlNodeList DN ;                DN = xmlobj.DocumentElement.GetElementsByTagName("Data");                for (int i = 0; i < DN.Count; i++)                {                    DataNode.Add("<ShaleDataExport><Data Flag = '" + i + "' >" + DN.Item(i).InnerXml + "</Data></ShaleDataExport>");                    }                string ShaleDataExportXML;                int k = 0 ;                while (k < DN.Count)                {                    ShaleDataExportXML = DataNode[k].ToString();                    XmlDocument xml =  new XmlDocument();                    xml.LoadXml(ShaleDataExportXML);                    StringWriter sw = new StringWriter();                      xXslt.Transform(xml, null, sw);                    FinalXML.Add(sw);                    sw.Close();                    k++;                }                using (SpreadsheetDocument doc = SpreadsheetDocument.Create(@"D:ExcelImportOutPutOutPut.xlsx", DocumentFormat.OpenXml.SpreadsheetDocumentType.Workbook))                {                    WorkbookPart workbook = doc.AddWorkbookPart();                     string XML;                     string WorbookXML;                     WorbookXML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><workbook xmlns=""schemas.openxmlformats.org/.../main"" xmlns:r=""schemas.openxmlformats.org/.../relationships""><sheets>";                     for (int j = 0; j < DN.Count; j++)                     {                         WorksheetPart[] sheet = new WorksheetPart[DN.Count];                         sheet[j] = workbook.AddNewPart<WorksheetPart>();                         string sheetId = workbook.GetIdOfPart(sheet[j]);                         XML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><worksheet xmlns=""schemas.openxmlformats.org/.../main"" >";                         XML += FinalXML[j].ToString() + "</worksheet>";                         string SheetXML = XML.ToString();                         XmlDocument SXML = new XmlDocument();                         SXML.LoadXml(SheetXML);                         byte[] byteArray = Encoding.ASCII.GetBytes(SXML.OuterXml);                         MemoryStream stream = new MemoryStream(byteArray);                         StreamReader reader = new StreamReader(stream);                         string text = reader.ReadToEnd();                          WorbookXML += "<sheet name="+ AddPartXml(sheet[j], text) + " sheetId=" + j.ToString() + "  r:id=" + sheetId.ToString() + " />";                     }                     WorbookXML += "</sheets></workbook>";                     AddPartXml(workbook, WorbookXML);                     doc.Close();                }            }            public string  AddPartXml(OpenXmlPart part, string xml)            {                Uri uri = part.Uri;                String[] sheetNames = uri.OriginalString.Split('/');                string sheetName = sheetNames[sheetNames.Length - 1].Split('.')[0];                using (Stream stream = part.GetStream())                {                    byte[] buffer = (new UTF8Encoding()).GetBytes(xml);                    stream.Write(buffer, 0, buffer.Length);                }                return sheetName;            } Thanks in advance Veeresh

  • Anonymous
    July 13, 2011
    Hi Veeresh Nannivala, Create file in excel with you data and save. Rename file [name file].zip and extract archive. Create file in you programm, rename file and extract archive. Compare file data. You find errors.

  • Anonymous
    July 09, 2012
    Hi Quemer, I also faced the same problem what the Veeresh have. I have tried to fix but not got proper solution. I extracted the file from the zip file as you told to Veeresh to do. I got files in xml/worksheet/sheet.xml. It looks same ,what done for single sheet. Can you please check this issue and give me some solution for this issue. Thanks in advance, Vineet Mangal

  • Anonymous
    July 11, 2012
    Hi Quemer, I got my fault and corrected the code. The problem was with the escaping the string when mentioning the sheet name in worksheetpart.

  • Anonymous
    July 24, 2012
    How did you generate the xslt file? Was it autogenerated?

  • Anonymous
    July 24, 2012
    Hi TM, The xslt file was written manually and is not autogenerated. Yes, that means you will need to have some basic understanding of xslt to use this technique. HTH.