Search Index fails for all users in a particular database in Exchange 2007

In Exchange 2007 there are lot of improvements in the way search works. Search feature helps end users to search in the mailbox quickly and efficiently. To know more about Exchange Search, read the following article.

https://technet.microsoft.com/en-us/library/bb232132.aspx

https://msexchangeteam.com/archive/2006/10/11/429163.aspx

In this blog I would like to discuss Search failure for all the users in a particular database in Exchange 2007. There are multiple reasons for failure, but I would concentrate on the failure due to stuck batch.

 

Symptoms:

You notice that all users in a particular database are not able to do search from outlook in online mode properly. Users get partial result or no result while searching. You may also observe that search result does not return items newer than a particular date. You might be able to search older items.

Any database on this server can go into this failure state at any time.

Test-ExchangeSearch “user” command against any user in that database will show the following result.

[PS] >test-exchangesearch user |fl
ResultFound : False
SearchTime : -1

You might notice this issue after a power outage or unexpected reboot of the server or some unknown cause.

 

Solution:

This is fixed in Exchange 2010.

On Exchange 2007, follow one of the methods below:

1) Disable the Catalog and rebuild the full text index Catalog.

Open Exchange Management Shell

Move to the following directory where the ResetSearchindex.ps1 script is located.

<drive> :\Program Files\Microsoft\Exchange Server\Scripts

Run the following command.

ResetSearchIndex.ps1 -force database name

This command will remove the index directories that are associated with the specified mailbox database.

 

2) Restart the Search Service or reboot the server.

How can I confirm that my issue is due to a stuck batch?

There are couple of methods to prove this:

1) Run Perfmon with the following counters for 10 to 15 min. Add all instances for all the counters mentioned below. Instances represent each database. You could compare the counters between the problem database and good database.

MSExchange Search Indices

                Age of Last Notification Indexed

                Age of Last Notification Processed

Average Latency of RPCs during Crawling

                Average Latency of RPCs to get Notifications

                Average Latency of RPCs used to Obtain Content

                Number of Create Notifications

                Number of Content Conversion Done

Number of Documents Successfully Indexed

                Number of Documents that failed During Indexing

                Number of Failed Retries

                Number of Indexed Attachments

Number of HTML Message Bodies

                Number of RTF Message Bodies

                Number of Outstanding Batches

Number of Outstanding Documents

                Number of Plain text message Bodies

                Number of Update Notifications

               

MSFTESQL-Exchange:Catalogs

                Batches Completed Success

                Batches Completed w/errors

                Batches Completed w/warnings

                Batches done

                Batches in Progress

                Batches Received

                Trans in Progress

Let us take a look at each counter.

Following counters show a value of 1 for all the problem databases throughout the trace. This is the stuck batch which is causing the problem here. Value of this counter varies for a good database.

MSFTESQL-Exchange:Catalogs\Batches in Progress

MSFTESQL-Exchange:Catalogs\Trans in Progress

MSExchange Search Indices\Number of outstanding Batches

MSExchange Search Indices\Number of outstanding documents shows a small value (3 or 5 or something else) constant for the problem databases throughout the trace, but it shows high value and it varies for good databases, also it touches zero at times. This means that these 3 or 5 documents are part of the batch which is stuck on the problem databases and it is not accepting any more notifications.

Following counters are going to show a constant value throughout the trace for the problem databases, but for the other databases it keep increasing. This clearly indicates that the indexing stopped at some stage (before this trace was started) for the problem databases.

MSExchange Search Indices\Number of documents successfully indexed
MSExchange Search Indices\Number of documents that failed during Indexing
MSExchange Search Indices\Number of failed retries

Following activities will not be happening on the problem databases, so it will show a constant value throughout the trace, but for the other database, we can see that this counter keeps increasing.
MSExchange Search Indices\Number of Create Notifications

MSExchange Search Indices\Number of content conversions done

MSExchange Search Indices\Number of Delete Notifications

MSExchange Search Indices\Number of Documents Successfully Indexed

MSExchange Search Indices\Number of Documents That failed During Indexing

MSExchange Search Indices\Number of Failed Retries

MSExchange Search Indices\Number of indexed attachments
MSExchange Search Indices\Number of RTF message bodies
MSExchange Search Indices\Number of HTML message bodies
MSExchange Search Indices\Number of plain text message bodies

MSExchange Search Indices\Number of Update Notifications

MSFTESQL-Exchange:Catalogs\Batches Completed Success

MSFTESQL-Exchange:Catalogs\Batches Completed w/errors

MSFTESQL-Exchange:Catalogs\Batches Completed w/warnings

MSFTESQL-Exchange:Catalogs\Batches done

MSFTESQL-Exchange:Catalogs\Batches Received

This also means that the thread for that particular database within the Notification engine is unable to poll for the notifications because of the stuck batch.

Following counters are going to show a value of zero for the problem databases. Good database shows some latency.

MSExchange Search Indices\Average Latency of RPCs During Crawling

MSExchange Search Indices\Average Latency of RPCs to get Notifications

MSExchange Search Indices\Average Latency of RPCs used to Obtain Content

This indicates that the problem is not during processing. These problem databases are not processing anything.

Age of the last notification processed and Age of the last notification Indexed are going to show a constant value for the problem databases. For example, processed time shows 1 sec and indexed time shows 11 sec constant. This is the age (in seconds) of the last notification we have successfully processed and indexed for the given MDB.

There will be one thread that polls for notifications for the servers and MDBs we are handling in this content indexing instance. That thread/task will use MAPI.NET to grab notifications using the current high watermark event identifier. It will then call into the notification queue to add the events for processing. To minimize startup costs, each notification queue needs a maintenance procedure to go in and take the in-memory information about the high watermark for what events have been read and about what the highest event identifier that has actually been committed to Content Index.

2) Obtain 64bit version of Dip.exe tool from Microsoft.

You need to find the catalog directory for each database.

The search catalog has its own directory in the same place where database files are located.

For example, the catalog directory has the following name:

CatalogData-9e6c1672-7fe2-48b8-8845-2335fa2ed019-7b201267-3ed8-4d69-9d48-6fa753588cee

The first GUID (in red) is the database GUID and the second one (in blue) is the server GUID.

On the Exchange 2007 server, copy dip.exe to the Program Files\Microsoft\Exchange Server\Bin directory
Create a sub-directory under the Bin directory called Log
Register the msftepxy.dll file using the following: regsvr32 "c:\Program Files\Microsoft\Exchange Server\bin\msftepxy.dll"

For the example above, type in Dip.exe /p ExSearch- 9e6c1672-7fe2-48b8-8845-2335fa2ed019 -7b201267-3ed8-4d69-9d48-6fa753588cee >databasename-p.txt

Databasename-p.txt is the output file. Take this output for few problem databases and good databases.

Also take Dip.exe /c output using the following command.

Dip.exe /c >Dip-c.txt

Dip-c.txt is the output file.

Following is a sample output of Dip.exe /p from a problem database.

Init. CocreateInstance DataSink

admin cocreated

got datasink for project exsearch-f5cf7c68-b7c6-41ef-9b4b-67be1e474fef-cc64dd2d-2428-4f12-bba2-79d6d34c4d27

Heartbeat- state 1 Available 0x00000000

Batches Rcv 91810 InProg 1 Complete 86588

Trans Done 362880 Err 1579

Trans Hld 0 Dly 0 RdQ 0

        FD 1 STFD 0 HDFD 0

        Pip 0 Pnd 0 Cmp 1

        Wait 0 Red 0 Rmv 0

Batches InProg and FD will show 1 always. This is the stuck batch.

Following is a sample output of Dip.exe /p from a good database

Init. CocreateInstance DataSink

admin cocreated

got datasink for project exsearch-15a1b31c-2e23-4d48-aef2-487d630617ae-cc64dd2d-2428-4f12-bba2-79d6d34c4d27

Heartbeat- state 1 Available 0x00000000

Batches Rcv 602879 InProg 3 Complete 574927

Trans Done 2348724 Err 8365

Trans Hld 0 Dly 0 RdQ 0

        FD 2 STFD 0 HDFD 0

        Pip 0 Pnd 6 Cmp 0

        Wait 0 Red 0 Rmv 0

Batches InProg shows more than 1 most of the time if this is a busy server. The number in FD shows how many batches are being processed in Multi-thread filter daemon, which is equivalent to a snapshot. So if the engine is constantly indexing docs, the number will always be positive, meaning there are batches inside. You may see FD showing as zero if there are no batches inside.

Dip.exe /c output will show the following symptom:

Free FDId 0 MSFTESQL_CBExchanger85fj CBID 00000000026F92F0

          Cat 21 NoLongerValid docs 1 keys 209

          batch 3419385 docid 4417252 total 62760 secs updated 62760 secs

Highlighted ‘secs’ means the age of the chunkbuffer processed. This value is way larger than the normal one, so it indicates the symptoms of stuck batches.

As I mentioned earlier, this can be due to a power outage or an unexpected reboot of the server. You can see 6008 and 1076 events in the system log of the Exchange 2007 server during this unexpected reboot. You might notice that the search is successful for all items added to the mailbox before this event.

Power outage or unexpected reboot are not the only cause of this issue. We have seen that after following the steps mentioned in the solution section, search fails for a particular database again after few weeks or months.

This has been fixed in Exchange 2010. We are trying to back port this fix to Exchange 2007.

Thanks & Regards

Roopesh Pattan