How to identify chapter numbers in Word without needing Table of Contents using openXML SDK

Oscar Vagle 21 Reputation points
2022-07-07T13:47:31.857+00:00

I am currently trying to read headings from a Word document and categorize them with header numbers using the openXML SDK.
The problem is that I can't figure out a good solution for retrieving header numbers.
Example document
The picture below shows a simple Word document with two headers:
![218539-simpleword-doc-one-header.png][1]
I can't find header number 1.1.1.1 when inspecting the document by looking at the XML code for the "Small header" header.
Below is a screenshot of the XML code for the heading.
![220416-xml-header-small-header.png][2]
The XML code does not include the header number "1.1.1.1", only the header text is included "Small header".
I have searched through the main xml file ("document.xml") for "1.1.1.1" but I can only find it in the Table of Contents XML code here:
![220417-xml-table-of-content-small-header.png][3]
A possible solution which might be a hacky one
It is possible to use Table of Contents to retrieve header numbers. This is because the
hypertext tag's "w:anchor=_Toc108099241" from the last picture is the same as <w:bookmarkStart w:name="_Toc108099241"../> from the first picture.
I can first fetch the "w:name" of the bookmarkStart tag from the header XML code and search for it in the XML code for Table of Contents.
However, this seems like a hacky solution.
Therefore I am wondering if there exists any better solutions for this problem.
(I am including the XML code for the example word document as an attachment)
[218578-document.xml][4]

C#
C#
An object-oriented and type-safe programming language that has its roots in the C family of languages and includes support for component-oriented programming.
10,573 questions
Office Open Specifications
Office Open Specifications
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Open Specifications: Technical documents for protocols, computer languages, standards support, and data portability. The goal with Open Specifications is to help developers open new opportunities to interoperate with Windows, SQL, Office, and SharePoint.
127 questions
{count} votes

Accepted answer
  1. Mike Bowen 1,516 Reputation points Microsoft Employee
    2022-07-15T21:44:44.823+00:00

    Hi @Oscar Vagle ,

    The styleId s like "Header1", "Header2", etc. are set by Word, so as long as you aren't doing anything that could change them, you can use those to determine the header level. This won't tell you what their Heading number should be (e.g. sections 1.1.1.1 and 2.1.1.1 would both be <h4>. You could keep track of the level in the header tree and add your own numbering if you want. The code below will read in a document and create the html for the headers and turn everything else into a paragraph. If you want to handle lists and other styles, you could extend it to handle those situations as well. Please let me know if this answers your question.

       WordprocessingDocument doc = WordprocessingDocument.Open(filePath, true);  
         
       HtmlContentBuilder htmlContentBuilder = new HtmlContentBuilder();  
         
       doc.MainDocumentPart?.Document?.Body?.ChildElements.ToList().ForEach(t =>  
       {  
         
           if (t is Paragraph)  
           {  
               Paragraph? p = t as Paragraph;  
         
               if (!(p is null))  
               {  
                   ParagraphProperties? x = p.ChildElements.First<ParagraphProperties>();  
         
                   if (!(x is null) && x.ParagraphStyleId != null)  
                   {  
                       string? style = x.ParagraphStyleId.Val;  
         
                       if (style == "Heading1")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h1>", p.InnerText, "</h1>"));  
                       }  
                       else if (style == "Heading2")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h2>", p.InnerText, "</h2>"));  
                       }  
                       else if (style == "Heading3")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h3>", p.InnerText, "</h3>"));  
                       }  
                       else if (style == "Heading4")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h4>", p.InnerText, "</h4>"));  
                       }  
                       else if (style == "Heading5")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h5>", p.InnerText, "</h5>"));  
                       }  
                       else if (style == "Heading6")  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<h6>", p.InnerText, "</h6>"));  
                       }  
                       else  
                       {  
                           htmlContentBuilder.AppendLine(string.Concat("<p>", p.InnerText, "</p>"));  
                       }  
                   }  
                   else  
                   {  
                       htmlContentBuilder.AppendLine(string.Concat("<p>", p.InnerText, "</p>"));  
                   }  
               }  
           }  
       });  
         
       using (StringWriter writer = new StringWriter())  
       {  
           htmlContentBuilder.WriteTo(writer, HtmlEncoder.Default);  
         
           // The html string for the document  
           string htmlString = writer.ToString();  
           Console.WriteLine(htmlString);  
       }  
    

    Mike Bowen
    Microsoft Open Specifications Support

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful