TN002: Persistent Object Data Format

This note describes the MFC routines that support persistent C++ objects and the format of the object data when it is stored in a file. This only applies to classes with the DECLARE_SERIAL and IMPLEMENT_SERIAL macros.

The Problem

The MFC implementation for persistent data relies on a compact binary format for saving the data for many objects in a single contiguous part of a file. This binary format provides the structure for how the data is stored, but it is the object's Serialize member function that provides the actual data saved by the object.

The MFC solves the structuring problem by using the class CArchive. A CArchive object provides a context for persistence that lasts from the time the archive is created until the CArchive::Close member function is called, either explicitly by the programmer or implicitly by the destructor when the scope containing the CArchive is exited.

This note describes the implementation of the CArchive members ReadObject and WriteObject. ReadObject and WriteObject are not called directly by instead are used by class-specific type-safe insertion and extraction operators generated automatically by the DECLARE_SERIAL and IMPLEMENT_SERIAL macros.

class CMyObject : public CObject
{
    DECLARE_SERIAL(CMyObject)
};

IMPLEMENT_SERIAL(CMyObj, CObject, 1)

// example usage (ar is a CArchive&)
CMyObject* pObj;
CArchive& ar;
ar << pObj;        // calls ar.WriteObject(pObj)
ar >> pObj;        // calls ar.ReadObject(RUNTIME_CLASS(CObj))

This note describes code located in the MFC source file ARCOBJ.CPP. The main CArchive implementation can be found in ARCCORE.CPP.

Saving Objects to the Store (CArchive::WriteObject)

The member function CArchive::WriteObject writes header data used to reconstruct the object. This data consists of two parts: the type of the object and the state of the object. This member function is also responsible for maintaining the identity of the object being written out, so that only a single copy is saved, regardless of the number of pointers to that object (including circular pointers).

Saving (inserting) and restoring (extracting) objects relies on several “manifest constants.” These are values that are stored in binary and provide important information to the archive (note the "w" prefix indicates 16-bit quantities):

Tag Description
wNullTag Used for NULL object pointers (0).
wNewClassTag Indicates class description that follows is new to this archive context—(-1).
wOldClassTag Indicates class of the object being read has been seen in this context (0x8000).

When storing objects, the archive maintains a CMapPtrToPtr (the m_pStoreMap) which is a mapping from a stored object to a 32-bit persistent identifier (PID). A PID is assigned to every unique object and every unique class name that is saved in the context of the archive. These PIDs are handed out sequentially starting at 1. It is important to note that these PIDs have no significance outside the scope of the archive and, in particular, are not to be confused with record number or other identity items.

Starting with MFC version 4.0 the CArchive class has been extended to support very large archives. In previous versions, a PID was a 16-bit quantity, limiting the archive to 0x7FFE (32766) objects. PIDs are now 32-bit, but they are written out as 16-bit unless they are larger than 0x7FFE. Large PIDs are written as 0x7FFF followed by the 32-bit PID. This technique maintains file backward compatibility.

When a request is made to save an object to an archive (usually through the global insertion operator), a check is made for a NULL CObject pointer; if the pointer is NULL, the wNullTag is inserted into the archive stream.

If we have a real object pointer that is capable of being serialized (the class is a DECLARE_SERIAL class), we then check the m_pStoreMap to see if the object has been saved already. If it has, we insert the 32-bit PID associated with that object.

If the object has not been saved before, there are two possibilities we must take into account: either both the object and the exact type (that is, class) of the object are new to this archive context, or the object is of an exact type already seen. To determine if the type has been seen we query the m_pStoreMap for a CRuntimeClass object that matches the CRuntimeClass object associated with the object we are saving. If we have seen this class before, WriteObject inserts a tag that is the bit-wise OR'ing of wOldClassTag and this index. If the CRuntimeClass is new to this archive context, then WriteObject assigns a new PID to that class and insert it into the archive, preceded by the wNewClassTag value.

The descriptor for this class is then inserted into the archive using the CRuntimeClass member function Store. CRuntimeClass::Store inserts the schema number of the class (see below) and the ASCII text name of the class. Note that the use of the ASCII text name does not guarantee uniqueness of the archive across applications, thus it is advisable to tag your data files to prevent corruption. Following the insertion of the class information, the archive places the object into the m_pStoreMap and then calls the Serialize member function to insert class-specific data into the archive. Placing the object into the m_pStoreMap before calling Serialize prevents multiple copies of the object from being saved to the store.

When returning to the initial caller (usually the root of the network of objects), it is important to Close the archive. If other CFile operations are going to be done, the CArchive member function Flush MUST be called. Failure to do so will result in a corrupt archive.

Note   This implementation imposes a hard limit of 0x3FFFFFFE indices per archive context. This number represents the maximum number of unique objects and classes that can be saved in a single archive, but note that a single disk file can have an unlimited number of archive contexts.

Loading Objects from the Store (CArchive::ReadObject)

Loading (extracting) objects uses the CArchive::ReadObject member function and is the converse of WriteObject. As with WriteObject, ReadObject is not called directly by user code; user code should call the type-safe extraction operator which calls ReadObject with the expected CRuntimeClass. This insures the type integrity of the extract operation.

Since the WriteObject implementation assigned increasing PIDs, starting with 1 (0 is predefined as the NULL object), the ReadObject implementation can use an array to maintain the state of the archive context. When a PID is read from the store, if the PID is greater than the current upper bound of the m_pLoadArray, then ReadObject knows that a new object (or class description) follows.

Schema Numbers

The schema number, which is assigned to the class when the class' IMPLEMENT_SERIAL is encountered, is the "version" of the class implementation. The schema refers to the implementation of the class, not to the number of times a given object has been made persistent (usually referred to as the object version).

If you intend to maintain several different implementations of the same class over time, incrementing the schema as you revise your object's Serialize member function implementation will enable you to write code that can load objects stored using older versions of the implementation. 

The CArchive::ReadObject member function will throw a CArchiveException when it encounters a schema number in the persistent store that differs from the schema number of the class description in memory. It is not easy to recover from this exception.

You can use VERSIONABLE_SCHEMA OR'd with your schema version to keep this exception from being thrown. By using VERSIONABLE_SCHEMA, your code can take the appropriate action in its Serialize function by checking the return value from CArchive::GetObjectSchema.

Calling Serialize Directly

There are many cases where the overhead of the general object archive scheme of WriteObject and ReadObject is not necessary or desired. This is the common case of serializing the data into a CDocument. In this case the Serialize member function of the CDocument is called directly, not with the extract or insert operators. The contents of the document may in turn use the more general object archive scheme.

Calling Serialize directly has the following advantages and disadvantages:

  • No extra bytes are added to the archive before or after the object is serialized. This not only makes the saved data smaller, but allows you to implement Serialize routines that can handle any file formats.

  • The MFC is tuned so the WriteObject and ReadObject implementations and related collections will not be linked into your application unless you need the more general object archive scheme for some other purpose.

  • Your code does not need to recover from old schema numbers. This makes your document serialization code responsible for encoding schema numbers, file format version numbers or whatever magic numbers desired at the start of your data files.

  • Any object that is serialized with a direct call to Serialize must not use CArchive::GetObjectSchema or must handle a return value of (UINT)-1 indicating that the version was unknown.

Because Serialize is called directly on your document, it is not usually possible for the sub-objects of the document to archive references to their parent document. These objects must be given a pointer to their container document explicitly or you must use CArchive::MapObject function to map the CDocument pointer to a PID before these back pointers are archived.

As noted above, you should encode the version and class information yourself when calling Serialize directly, allowing you to change the format later while still maintaining backward compatibility with older files. The CArchive::SerializeClassRef function can be called explicitly before directly serializing an object or before calling a base class.

Technical Notes by NumberTechnical Notes by Category