Public Folder Replication Troubleshooting – Part 2: Troubleshooting the Replication of Existing Data
Original post on US blog: January 20, 2006
This is a second blog post about troubleshooting public folder replication issues. In first post we covered Troubleshooting the Replication of New Changes. This blog post covers Troubleshooting the Replication of Existing Data and the last post in this series will cover Troubleshooting Replica Deletion and Common Problems. To get the full picture, please read all referenced material!
Troubleshooting the Replication of Existing Data
When new changes replicate, but old unchanged stuff doesn't, you have a backfill problem. The most typical situation where hierarchy backfill occurs is when a new public store is created. The most typical content backfill scenario is when a public store has been added to the replica list of a folder.
When you have a backfill problem like this, it may have already occurred to you that there’s an easy workaround – just make a change to all the items. By doing this you circumvent the broken backfill process and replicate everything as new changes. Despite the fact that I wrote both of the tools that are typically used for this (PFDAVAdmin and ModifyItems), it’s usually best to troubleshoot the backfill process and fix the root cause. If you just change everything to make it replicate, you may end up with the same backfill problem in the future when the replicas get out of sync again. That said, let’s move on to discussing backfill. To understand the backfill process, it's first necessary to understand how changes are tracked.
Every folder and message in the store is assigned a Change Number (CN) when it is created and every time it is modified. When replication occurs, the CNs on each object are used to determine whether that object needs to be replicated. A group of CNs is called a CNSet. The CNSet for a particular folder on a particular server is called status information. This status information is included on every replication message. Every message type 0x2 contains the hierarchy status for the sending server. Likewise, every message type 0x4 contains the status information for that particular folder for the sending server. All other replication message types contain status information for their respective folders as well.
When a new public store mounts for the first time, it sends a status request (type 0x20) for the hierarchy to all the existing public stores. Similarly, when a new store is added to the replica list of a folder, that store will send a 0x20 to all other replicas of that folder. Like every replication message, a status request contains a CNSet of all CNs for the folder in question (or the hierarchy) that the originating store has, and asks the other stores to respond if they have CNs that the originator doesn't. Note that prior to Exchange 2003 Sp2, every replica was not asked to respond to the status request, so some stores would ignore the status request even if they had changes that the originating store did not. A 2003 Sp2 server will ask for responses from all replicas, and will respond even when the originating server did not specifically ask it to, as long as it has changes that the originating server does not. This can greatly improve the decisions made during the backfill process. The unique thing about a status request is that it doesn't contain any data to replicate - it just has a list of change numbers. The other stores respond with status messages (0x10), which list their own CNSets for that same folder (or the hierarchy). When the originating server receives the 0x10 messages, it compares the CNSet contained within to its own CNSet. If the 0x10 contains changes that the store doesn't already have, the backfill process begins.
The first step in the backfill process is to add entries to the backfill array for the folder in question. These entries have a CNSet that describes the missing changes, and a timeout value describing when the store will request the missing data. The backfill timeout will vary depending on the situation. In the case of a new public store being brought online or a new replica of a folder being added, the initial timeout is 15 minutes.
Backfill entries may be added to the backfill array during the course of normal operation as well. Consider a situation where a particular public store has broadcast two changes in two separate 0x2 messages. Let's say the administrator deletes the first 0x2 message out of the queue, but the second one makes it through. When the other servers receive this 0x2, they will find that the CNSet in the status information contains CNs that they never got. As a result, they will create backfill entries for that data. Backfill entries for missing data that was discovered during the normal course of replication will start with a timeout of 6 hours if the data is available in the same Routing Group (RG), or 12 hours if it is only available in a different RG. Each time a backfill request is issued, the next timeout will be 12 and then 24 hours for intra-RG requests, or 24 and 48 hours for inter-RG requests.
Every five minutes the store will check to see if any backfill entries have reached their timeout. If they have, a backfill request (type 0x8) is issued for the missing CNs, and the timeout is set to the next interval. A backfill request is not a broadcast; it is directed at a single server - one of the servers that previously indicated it had the missing CNs in the status information it sent to the requesting server. When that server receives the incoming 0x8, it immediately processes the request and responds with one or more backfill responses (0x80000002 for hierarchy or 0x80000004 for content), which contain the actual data for the requested change numbers. Like backfill requests, backfill responses are not broadcasts - they are sent only to the requesting server.
If the requesting server successfully processes the incoming backfill response, the CNs it contained are cleared from the backfill array on that store. Actually, any incoming message that contains CNs that are outstanding in the backfill array will cause those CNs to be cleared from the array.
Troubleshooting
As you can see, there are a lot more questions to answer when troubleshooting the backfill process.
1. Does the store know it's missing data?
First you should determine if the server even realizes that other stores have changes that it needs to request. Unfortunately, there is no supported tool or utility that will let you view the backfill array directly to see if it has anything in it. However, there are other more indirect ways of figuring this out.
One way is to wait. If the server knows it's missing data, it will be requesting it at least once every 24 or 48 hours. This means you can simply turn up logging and wait to see if a 0x8 message ever goes out. If you never see a 0x8 for the folder in question, but you are seeing 0x8's for other folders, you may have hit the outstanding backfill limit, which we'll discuss shortly.
Another option is to make sure the server receives the latest status information. Remember, the server only sends a status request that one time after you add the new replica. After that, the only status information it receives will be through the normal course of replication. So if its initial attempt to get status was lost because the 0x20 or the 0x10 in response was lost or deleted, it may sit there indefinitely and not even realize that it's missing anything. There are several ways to make sure the server has received status information for a folder.
- Go to a server that has all the data and make a change to the folder by adding, deleting, or modifying a message. In the case of the hierarchy, create, delete, or change the properties of a folder. The resulting 0x4 or 0x2 will contain status information for that folder or the hierarchy, respectively. When the server that's missing the data successfully processes the incoming replication message, you know that it has added any appropriate entries to the backfill array.
- Use the Synchronize Content option in Exchange 2003 ESM. This is a well-hidden but very useful option. To find it, go under the Public Folders tree and go to the folder in question. Highlight the folder in the left-hand pane. In the right-hand pane click the Status tab. Right-click on the server that has all the data and choose Synchronize Content. This does two things - it causes the server to issue a status request 0x20 for the folder, and it causes it to immediately timeout any backfill entries. Notice that I said you should Synchronize Content from the server that already has the data. You may wonder why you would do that, when it's the other server that has the backfill entries that need to be timed out. Remember that at this point we're just trying to ensure that the server missing the data KNOWS it has something to backfill. To that end, we can use Synchronize Content from the server that has the data to send a 0x20 to the server that doesn't. In this case we're not really interested in seeing a status 0x10 response to the 0x20. We just want the store missing content to receive a replication message for the folder from a store with content, so it can add the appropriate entries to the backfill array. The 0x20 from the server with the data serves this purpose. Note that in Exchange 2003 Sp2, Synchronize Content is now available for the hierarchy by right-clicking on the Public Folders node itself.
- Use the Replication Flags registry value (KB813629). If you put this value in place, along with the Enable Replication Messages At Startup value from KB321082, it causes the store to send a status request 0x20 for every folder on startup. Again, you would want to use this on the server that has the content - the point of this step is to get the server that has content to send its status information to the server that's missing content.
- Use 2003 ESM to send a backfill response. In 2003 Sp1, you could use the Send Hierarchy option to send a hierarchy backfill response and the Send Contents option to send a folder content backfill response. In 2003 Sp2, both of these options became Resend Changes. This sends a backfill response for the range of data you specify, but you probably shouldn't specify the whole range of data since that might satisfy all outstanding backfill entries and end up working around the original problem. Instead, specify a range of only a day or two. This causes a 0x80000002 or 0x80000004 to go to the target server, which again serves the purpose of giving it status information for the store that has the data.
Once you've used one of these options to force status information, and you've verified that the store missing the data received the incoming message by watching the application log, then you know it knows it's missing the data.
2. Does the store request the missing data?
After you've made sure the store know it needs to backfill some data, does it ever issue a backfill request? Recall that after it has tried to backfill the data a couple of times, You may have to wait 24 or 48 hours for the next backfill request, since that will be the longest timeout interval for intrasite and intersite backfills, respectively. There is one way to speed this up, and that is to use Synchronize Content again, but this time from the server that's missing the data. This will immediately timeout the backfill entries for that folder. However, you may still find that the store does not issue a backfill request for the folder you're focusing on. If this is the case, watch the app log for the next 24-48 hours. If the store is sending backfill requests for other folders, but not for the folder you're focusing on, it may have hit the outstanding backfill limit.
When you experience a situation where you've added replicas of a lot of folders to a new store, and replication seems fine at first but then grinds to a halt over the next day or two, you have probably hit the outstanding backfill limit. The outstanding backfill limit is a mechanism intended to throttle replication. By default, the store will only allow 50 outstanding backfill requests at a time. Once it has 50 outstanding, it will re-request those 50 over and over until they are satisfied. Once any one outstanding entry has been satisfied, that opens up a slot in the OBL for a new set of data to be requested. This means that if 50 requests are having problems being satisfied for whatever reason, replication can not proceed.
If you are seeing this behavior, you should watch the application log to see what the store is requesting. You'll be seeing periodic 0x8 messages for the current 50 outstanding backfill requests, and you'll find that no backfill response is received, which is why they're still outstanding. At that point you should change your focus to troubleshooting one of the folders the store is currently trying to backfill, since resolving the problem will allow it to move on to other folders.
There is one other option, and that is to increase the Oustanding Backfill Limit (OBL). You can do this by creating a registry value called Replication Oustanding Backfill Limit under the registry key for that store. The maximum value is 5000 decimal. However, once you do this the replication floodgates will open and it will be hard to determine which 50 folders caused it to choke. You'll need to postpone troubleshooting until things settle down again. Typically I recommend leaving the limit at 50 and fixing the problem, instead of working around it by increasing the limit.
If the OBL doesn’t appear to be a problem, and you still aren’t seeing outgoing 0x8 messages for the folder in question, see the “Common Problems” in last post of this series.
3. Does the other store respond to the request?
Once you have a backfill request to focus on, you need to determine if the backfill target ever got the request. Check the application log on that server for the incoming 0x8. You can also search the application log for the message ID mentioned in the outgoing event from the sending side. If you can find no sign of it in the application log, use message tracking to see how far it got. If it received the 0x8, it should respond almost immediately with one or more 0x80000002 or 0x80000004 messages (you will often see many backfill responses to a single backfill request, since the changes are not all sent in a single message). Of course, the time it takes to generate the backfill response messages will vary based on the data in the folder and the replication message size limit. For instance, if you set the maximum replication message size to 1 GB, the responding server could try to pack the entire hierarchy into a single backfill response, which might take an hour or more just to pack up!
4. Does the requesting server get the response?
Now it's time to check the application log on the requesting server to see if it received the backfill response. If not, track the message and see how far it got. If it received the backfill response and logged it in the application log, then that backfill request should have been satisfied and it should be able to move on.
As mentioned earlier, if you find that message tracking shows that one of these messages was delivered to the store, yet the application log does not show the incoming replication message, have a look at “Common Problems” in last post of this series.
In next blog post: Troubleshooting Replica Deletion and Common Problems.