Retrieving Word Content Based on Styles

In previous posts, like Importing a Table from Word to Excel, I showed you how to retrieve content within specific content controls. In these posts, content controls were used to add semantic structure to a document, where this structure aided in retrieving and inserting content. What about other types of content? Well, one common request is being able to retrieve content based on styles, where content can be paragraphs, runs, or even tables. In other words, styles, too, can be used to add semantic structure and meaning to a document. In today's post, I am going to show you two things:

  1. How to retrieve content based on styles
  2. How to extend the Open XML SDK with Extension Methods

Something new I am going to try is to also create a video for my blog posts. Let me know if these videos are helpful to you as well.

Solution

To find Word content based on styles we need to take the following actions:

  1. Open up a Word document via the Open XML SDK
  2. Get access to the main document part
  3. Find the style id that references the style name. The style id is referenced in paragraphs, runs, and tables
  4. Look for all paragraphs, runs, or tables within the main document part
  5. Filter down the list of paragraphs, runs, or tables based on whether those objects/elements reference a specific style name or not
  6. Return back the final list of paragraphs, runs, or tables

For the sake of this post, let's say I am starting with the following Word document, which contains Paragraph, Run and Table styles:

In this document, I am using the following styles:

  • Paragraph Style – Heading 1
  • Run Style – Intense Emphasis
  • Table Style – Light List Accent 1

If you want to jump straight into the code, feel free to download this solution here.

The Code

For this solution I thought it would be really cool to take advantage of Extension Methods for C#. Extension methods allow me to "add" methods to existing types without creating a new derived type, recompiling, or otherwise modifying the original type. In my case, I am going to add three extension methods off of the MainDocumentPart class (remember this class represents the main document.xml part within my Word document):

  1. ParagraphsByStyleName – This method will retrieve a list of paragraphs contained within the main document part that have a specific style name
  2. RunsByStyleName – This method will retrieve a list of runs contained within the main document part that have a specific style name
  3. TablesByStyleName – This method will retrieve a list of tables contained within the main document part that have a specific style name

These three methods are very similar, but have one important difference; these methods all use different strongly typed classes to query for information. These extension methods will live within a class I called WordStyleExtensions. Feel free to reuse or even extend this class for your own purposes.

Since styles are referenced via ids on paragraphs, runs, and tables, we need a way to look up the style id from a style name. The following code accomplishes this task for any style:

private static string GetStyleIdFromStyleName(MainDocumentPart mainPart, string styleName) { StyleDefinitionsPart stylePart = mainPart.StyleDefinitionsPart; string styleId = stylePart.Styles.Descendants<StyleName>() .Where(s => s.Val.Value.Equals(styleName)) .Select(n => ((Style)n.Parent).StyleId).FirstOrDefault(); return styleId ?? styleName; }

This code simply looks up the style id from a style name. If one is not found then the style name is returned.

Let's dive down into the code for retrieving paragraphs based on a style name. As described in the solution section above, this task is broken down into two steps. The first step is to retrieve all paragraphs in the main document, which can be accomplished with the following code:

public static IEnumerable<Paragraph> ParagraphsByStyleName(this MainDocumentPart mainPart, string styleName) { string styleId = GetStyleIdFromStyleName(mainPart, styleName); IEnumerable<Paragraph> paraList = mainPart.Document.Descendants<Paragraph>() .Where(p => IsParagraphInStyle(p, styleId)); return paraList; }

The next step is to filter down the paragraphs based on whether the paragraph uses a specific style name. This task can accomplished with the following code:

private static bool IsParagraphInStyle(Paragraph p, string styleId) { ParagraphProperties pPr = p.GetFirstChild<ParagraphProperties>(); if (pPr != null) { ParagraphStyleId paraStyle = pPr.ParagraphStyleId; if (paraStyle != null) { return paraStyle.Val.Value.Equals(styleId); } } return false; }

Pretty simple! The cool thing is that these methods can be easily modified to work with runs and tables. Here are the methods to retrieve content based on run and table styles:

public static IEnumerable<Run> RunsByStyleName(this MainDocumentPart mainPart, string styleName) { string styleId = GetStyleIdFromStyleName(mainPart, styleName); IEnumerable<Run> runList = mainPart.Document.Descendants<Run>() .Where(r => IsRunInStyle(r, styleId)); return runList; } private static bool IsRunInStyle(Run r, string styleId) { RunProperties rPr = r.GetFirstChild<RunProperties>(); if (rPr != null) { RunStyle runStyle = rPr.RunStyle; if (runStyle != null) { return runStyle.Val.Value.Equals(styleId); } } return false; } public static IEnumerable<Table> TablesByStyleName(this MainDocumentPart mainPart, string styleName) { string styleId = GetStyleIdFromStyleName(mainPart, styleName); IEnumerable<Table> tableList = mainPart.Document.Descendants<Table>() .Where(t => IsTableInStyle(t, styleId)); return tableList; } private static bool IsTableInStyle(Table tbl, string styleId) { TableProperties tblPr = tbl.GetFirstChild<TableProperties>(); if (tblPr != null) { TableStyle tblStyle = tblPr.TableStyle; if (tblStyle != null) { return tblStyle.Val.Value.Equals(styleId); } } return false; }

Now that our extension methods have been created all we have left to do is call these methods:

static void Main(string[] args) { string paraStyle = "Heading1"; string runStyle = "IntenseEmphasis"; string tableStyle = "LightList-Accent1"; using (WordprocessingDocument myDoc = WordprocessingDocument.Open("input.docx", true)) { MainDocumentPart mainPart = myDoc.MainDocumentPart; Console.WriteLine("Number of paragraphs with " + paraStyle + " styles: " + mainPart.ParagraphsByStyleName(paraStyle).Count()); Console.WriteLine("Number of runs with " + runStyle + " styles: " + mainPart.RunsByStyleName(runStyle).Count()); Console.WriteLine("Number of tables with " + tableStyle + " styles: " + mainPart.TablesByStyleName(tableStyle).Count()); } Console.ReadKey(); }

End Result

Putting everything together and running this code, we end up with an easy way to retrieve content based on styles. For simplicity sake, I decided to just show the number of paragraphs, runs, or tables with a specific style.

Here is the output:

Zeyad Rajabi

Comments

  • Anonymous
    May 05, 2009
    PingBack from http://microsoft-sharepoint.simplynetdev.com/retrieving-word-content-based-on-styles/

  • Anonymous
    May 05, 2009
    Thank you for submitting this cool story - Trackback from DotNetShoutout

  • Anonymous
    May 06, 2009
    Do you envisage (highly unlikely) that MS will ship something that will convert Open XML documents to PDF, ODF given the support to save documents with SP2 in that format?

  • Anonymous
    May 08, 2009
    Hi Roberto, We've heard this request from other customers as well. While it's still too early to discuss O14 plans, there are several 3rd party tools that can accomplish this scenario.