Clarification on picture placement and location in OLE file format

Parth Gupta 180 Reputation points
2024-06-02T12:01:04.7433333+00:00

Hi,

I am parsing a '.doc' file in python (OLE).

I was looking at [MS-CFB] and [MS-DOC] and i am unable to find a reference to find the following:

  1. Where (at what offset/sector) can i find (multiple) images that are in my file?
  2. In the 'WordDocument' stream of my OLE file, I can find some text which is in the document. However, if some text is followed by an image, i cannot understand the reference for it in the 'WordDocument' stream.

For example, look at the screenshot of a .doc file opened in a Hex editor. When I open the document in Microsoft Word, there are images right after the 'Ans1', 'Ans2' and 'Ans3' texts. However, by viewing the 'WordDocument' stream, the image reference is not found.

Kindly provide a clarification or documentation on how to locate the images inside an OLE file.

Thanks

User's image

Office Open Specifications
Office Open Specifications
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Open Specifications: Technical documents for protocols, computer languages, standards support, and data portability. The goal with Open Specifications is to help developers open new opportunities to interoperate with Windows, SQL, Office, and SharePoint.
127 questions
{count} votes

Accepted answer
  1. Mike Bowen 1,516 Reputation points Microsoft Employee
    2024-06-04T23:15:45.5866667+00:00

    Hi @Parth Gupta ,

    Structures used to find all the pictures in a .doc file, regardless of position, are found in both MS-DOC and MS-ODRAW. To find them, use this algorithm:

    Read the FIB from offset zero in the WordDocument Stream.

    1. Find the FibRgFcLcb97.
    2. Find FibRgFcLcb97.fcDggInfo -> the offset in the Table Stream of the MS-DOC 2.9.171 OfficeArtContent
    3. Find FibRgFcLcb97.lcbDggInfo -> the size (in bytes) of the OfficeArtContent
    4. Find the OfficeArtContent.
    5. Find the OfficeArtContent.DrawingGroupData, which is a [MS-ODRAW] section 2.2.12 OfficeArtDggContainer
    6. Find OfficeArtDggContainer.blipStore, which is a MS-ODRAW 2.2.20 OfficeArtBStoreContainer.
    7. Find OfficeArtBStoreContainer.rgfb, which is an array of MS-ODRAW 2.2.22 OfficeArtBStoreContainerFileBlock.
    8. OfficeArtBStoreContainerFileBlock is either a OfficeArtFBSE or OfficeArtBlip depending on the data of the contained record.
    9. If it is a MS-ODRAW 2.2.23 OfficeArtBlip, it will be defined as below
    Value Meaning
    0xF01A OfficeArtBlipEMF, as defined in section 2.2.24.
    0xF01B OfficeArtBlipWMF, as defined in section 2.2.25.
    0xF01C OfficeArtBlipPICT, as defined in section 2.2.26.
    0xF01D OfficeArtBlipJPEG, as defined in section 2.2.27.
    0xF01E OfficeArtBlipPNG, as defined in section 2.2.28.
    0xF01F OfficeArtBlipDIB, as defined in section 2.2.29.
    0xF029 OfficeArtBlipTIFF, as defined in section 2.2.30.
    0xF02A OfficeArtBlipJPEG, as defined in section 2.2.27.<5>

    If you need to find the position of pictures, remember that the fundamental unit of a Word binary file is a character. This includes visual characters such as letters, numbers, and punctuation. It also includes formatting characters such as paragraph marks, end of cell marks, line breaks, or section breaks. Finally, it includes anchor characters such as footnote reference characters, picture anchors, and comment anchors. MS-DOC 1.3.1 Characters.

    A picture anchor is a character that specifies the location of a picture within a document. To find where to place pictures you need to examine the sprmCFSpec property of a character, which specifies whether the current text has a meaning that differs or displays differently than the underlying character to which it is applied and the and the sprmCPicLocation, which specifies the location of the position in the Data Stream of the picture. MS-DOC 2.6.1 Character Properties.

    The location and size of each character in the file can be computed using the algorithm in MS-DOC 2.4.1 (Retrieving Text).

    I hope this answers your question.

    Best regards, Michael Bowen Microsoft Office Open Specifications

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

    1 deleted comment

    Comments have been turned off. Learn more