PowerShell to Rebalance Crawl Store DBs in SP2013

In SharePoint 2013, simply adding a new Crawl Store DB doesn't cause the SSA to rebalance links among stores, and admins are unable to manually trigger a rebalancing process until the standard deviation of links in all existing Crawl Stores exceeds the threshold defined by the SSA property CrawlStoreImbalanceThreshold.

Once this threshold is reached eventually, the Search Admin UI displays a control that allows the administrator to initiate the rebalancing process. Specifically, the CrawlStoresAreUnbalanced()  method checks whether the standard deviation of link counts among all crawl stores is higher than value defined by the SSA property CrawlStoreImbalanceThreshold. Being said, you may have to lower the threshold value much lower than expected to trigger CrawlStoresAreUnbalanced() to evaluate as TRUE. Another SSA property, CrawlPartitionSplitThreshold, determines the threshold when hosts can be split across multiple Crawl Store DBs during the rebalancing process.

The following example illustrates a full example of these cmdlets, which are largely derived from the CrawlStorePartitionManager Class ( https://msdn.microsoft.com/en-us/library/microsoft.office.server.search.administration.crawlstorepartitionmanager )

Prior to the rebalancing process, we can see that all links currently exist in a single Crawl Store DB:

​Crawl Store DB Name ​ContentSourceID HostID​ linkCount​
​V5_SSA_CrawlStore ​1 ​4 ​20,558
​V5_SSA_CrawlStore 4 ​1 ​157,671
​V5_SSA_CrawlStore 6 ​2 ​14,813
​V5_SSA_CrawlStore ​6 ​3 ​10,818

 

$SSA = Get-SPEnterpriseSearchServiceApplication

New-SPEnterpriseSearchCrawlDatabase -SearchApplication $SSA -DatabaseName V5_SSA_CrawlStore2

New-SPEnterpriseSearchCrawlDatabase -SearchApplication $SSA -DatabaseName V5_SSA_CrawlStore3 

$foo = new-Object Microsoft.Office.Server.Search.Administration.CrawlStorePartitionManager($SSA)

$foo.CrawlStoresAreUnbalanced()

False 

$ssa.GetProperty("CrawlStoreImbalanceThreshold")

10000000 # 1 million (this is the default value)

$ssa.SetProperty("CrawlStoreImbalanceThreshold",10000)

# Verify in registry of Crawl Component that this changes to the new value

# ex: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\15.0\Search\Applications\1d330903-aad9-47e2-9373-f30e945c933c-crawl-0\CatalogNames

$foo.CrawlStoresAreUnbalanced()

True # After lowering the threshold, it's no longer "balanced"

$ssa.GetProperty("CrawlPartitionSplitThreshold")

10000000 # 10 million (this is the default value)

$ssa.SetProperty("CrawlPartitionSplitThreshold",50000)

# This allows any partition greater than 50,000 items to be split across Crawl Store when rebalancing 

$foo.BeginCrawlStoreRebalancing()

Guid

----

f9923696-76f1-482d-96cd-c10aedd92fa2

$foo.TimeToCompletion("f9923696-76f1-482d-96cd-c10aedd92fa2")

# Repeat as needed using GUID from above...

$foo.Completed("f9923696-76f1-482d-96cd-c10aedd92fa2")

True

 

After the rebalance, use SQL Queries such as the following to confirm:

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore2].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore3].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

​Crawl Store DB Name ​ContentSourceID HostID​ linkCount​
​V5_SSA_CrawlStore ​1 ​4 ​20,558
​V5_SSA_CrawlStore 6 ​2 ​14,813
​V5_SSA_CrawlStore 6 ​3 ​10,818
​V5_SSA_CrawlStore2 ​4 ​1 77,836
​V5_SSA_CrawlStore3 ​4 ​1 ​79,835

 

Which confirms the reblanced Crawl StoreDBs as well as illustrating the splitting of a single HostID across the crawl stores (in this case, HostID 1 was split across CrawlStore2 and CrawlStore3).

 

Update: I've recently had several people reach out to me after reading this TechNet article, which states:

“In SharePoint Server 2010, host distribution rules are used to associate a host with a specific crawl database. Because of changes in the search system architecture, SharePoint Server 2013 does not use host distribution rules. Instead, Search service application administrators can determine whether the crawl database should be rebalanced by monitoring the Databases view in the crawl log”

In response, I've just published Why Host Distribution Rules Don't Apply to SharePoint 2013.

 

Update: For reference, use the following PowerShell to determine the document counts being used by CrawlStoresAreUnbalanced() to calculate the standard deviation among all crawl stores:

$crawlLog = New-Object Microsoft.Office.Server.Search.Administration.CrawlLog $SSA

$dbHashtable = $crawlLog.GetCrawlDatabaseInfo()

$dbHashtable.Keys
    Guid
    ----
    5bf0290a-ad4c-4462-a7b2-6892be9431c1
    9e95a69e-7129-4d96-aaff-d577c4663cb3

$dbHashtable["5bf0290a-ad4c-4462-a7b2-6892be9431c1"]
    DocumentCount : 5094767
    Partitions : {msdn.microsoft.com, technet.microsoft....
    ID : 5bf0290a-ad4c-4462-a7b2-6892be9431c1
    Name : V5_SSA_CrawlStoreToo

$dbHashtable["9e95a69e-7129-4d96-aaff-d577c4663cb3"]
    DocumentCount : 188343
    Partitions : {{853da760-f456-4375-a77b-8e41bc218770}...
    ID : 9e95a69e-7129-4d96-aaff-d577c4663cb3
    Name : V5_SSA_CrawlStore

Comments

  • Anonymous
    April 01, 2014
    The comment has been removed

  • Anonymous
    September 02, 2015
    Hi Craig - the check for CrawlStoresAreUnbalanced() is purely a check of standard deviation among the crawl stores. If you have one large content source (e.g. tied to a single domain) and one or more small content sources, then the behavior you're seeing is completely expected... Keep in mind that we try to keep all items tied to the same host together (which is why the CrawlPartitionSplitThreshold is 10 MILLION items - if there are less than this number of items tied to the host, then the re-balancing won't break up those links across crawl stores). So in your case, the 871K items (assuming they are tied to the same host) would not be split. So think of the re-balancing at the host (e.g. domain name) level like the host distribution rules of SP2010. We try to keep each bucket of items together, but if it is sufficiently large, we can break it off into another crawl store (this also prevents the max size of a crawl store from limiting the number of items from any given host... because we can now break it off into a new crawl store). In other words, tthe goal of rebalancing is not to make each crawl store closely(or even similarly) balanced in terms of the number of links in each... but rather, instead, trying to balance the buckets as best as possible. Being said, it is quite possible for buckets to be balanced the best we can... but the actual standard deviation is higher than your threshold (so the CrawlStoresAreUnbalanced() would continue to report true). I hope this helps *(apologies for the delayed response)

  • Anonymous
    October 05, 2015
    Hi, thanks for the good post. It helped me to split my Crawl Databases and understand what to expect. Those posts are unfortunately a rare thing :)

  • Anonymous
    February 15, 2016
    The rebalancing shows completed. Bu the SSA status is still Administrative status Paused for:Refactoring Any idea?

  • Anonymous
    February 17, 2016
    It sounds like the SSA was paused with $SSA.PauseForIndexRepartitioning() which you would use when adding a new Index partition... Such as that described here: technet.microsoft.com/.../jj862355.aspx

    • Anonymous
      September 01, 2016
      Hi,I have about 18 million items in my single crawl database. I am planning to create another crawl database and run a full crawl. How do i decide the value for CrawlStoreImbalanceThreshold and CrawlPartitionSplitThreshold setting. I have 19 content sources. DocumentCount Partitions ID Name ------------- ---------- -- ---- 18075335 {{6f2db109-745t-4e35-a0a4-161dd39ffd42}, {7... ef3f1a0f-34rdf-42c1-b336-9897aed27fc3 ContentSourceID HostID linkCount1 5 14 1 18825 1 89616 1 15778577 1 9382408 1 13322779 1 251773710 1 20940511 1 128805312 1 154483215 2 983217 2 4218 1 344457220 1 29304723 3 1923 4 192524 1 2527 1 114292628 1 3762984Great Post. Thanks!!
      • Anonymous
        February 28, 2017
        I generally wouldn't modify the CrawlStoreImbalanceThreshold and CrawlPartitionSplitThreshold values - they will typically be the most performant as is... - that post was really trying to illustrate how they were managed (not advocating any change to these values)