SharePoint 2013 + Distributed Cache (AppFabric) Troubleshooting

發行項
03/19/2014

Two messages you may have seen if you’ve administered SharePoint 2013 in anyway way about caching are “This Distributed Cache host may cause cache reliability problems” and/or “cacheHostInfo is null” from PowerShell. This article is about how to fix those errors & caching reliability problems in general for SharePoint 2013.

Update: see a simplified version of this article here if you're not sure how AppFabric works with SharePoint.

Cache reliability warnings are fairly common to see in SharePoint 2013 installations of any complexity. It’s to do with how SharePoint interacts with the distributed cache cluster that’s used for all sorts of caching needs in 2013 from caching user tokens (with a fall-back option if it fails), to security trimming search results (also with fall-back on failure), to the social news-feed (with no fall-back – social just doesn’t work without a healthy cache cluster), all powered by AppFabric. For the most part a cache failure just means less than optimal performance but not always.

Therefore if you see this message in SharePoint you should pay attention to it. Here’s an example health error:

This message can come up for several reasons but in short, one or more servers that SharePoint thinks should be hosting the cache cluster, isn’t, for one reason or another. This guide will hopefully show how to fix this rather broad issue, but it depends on what the problem is first so to start you need to pick a scenario that describes your own…

Scenario 1 –SharePoint and AppFabric Don’t Agree Which Servers are in the Cluster

As already mentioned, SharePoint uses AppFabric for caching under the hood, which is an entirely standalone product in its’ own right. This means that AppFabric has its own ideas about what machines should make up the cluster in parallel to SharePoint. Normally this list of servers perfectly coincides so nobody notices AppFabric is even a thing until there’s a problem but any mismatch in server-info between the two products can often cause some pretty ugly problems and is often the root cause of the infamous “cacheHostInfo is null” error. The two server-lists need to be identical (and healthy) so let’s check both…

Query AppFabric for Caching Servers/Statuses

To find out, get the list of servers AppFabric thinks there should be run “Get-CacheHost” (use “Use-CacheCluster” if necessary). This command gives us a bit more than just the servers but also each servers’ serviceability status as far as AppFabric’s concerned.

Query SharePoint for Caching Servers/Statuses

To do the same for SharePoint, run:

Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status

This will give you the same kind of data but from SharePoint’s POV instead. Make sure all statuses say “Online” but more importantly that both SP & AF have the same names between them. As mnentioned before, if you’re seeing “cacheHostInfo is null” somewhere then it’s quite likely there’s a mismatch here.

Oh No! AppFabric and SharePoint Server Lists Don’t Match!

Maybe AF thinks there are more servers caching than SharePoint does; maybe the server names don’t coincide. Here’s an example of a server-name mismatch:

Even if the names matched by the way, this particular example would also fail because the service-instance is disabled but for now let’s just focus on the name mismatch, which will indeed cause all sorts of cache reliability problems too.

It’s probably going to be AppFabric that’s got a server that SharePoint doesn’t think is caching anything, possibly because said server isn’t in the farm anymore or at least the name of the server isn’t (renaming a server with Rename-SPServer at the time of writing won’t update the name in AppFabric too, causing this type of mismatch. A small “feature” if you will).

In any case, AppFabric and SharePoint need a coinciding list and SharePoint needs the service-instance to be “online” (not “disabled”).

How to Remove Zombie AppFabric Service Instances from SharePoint Topology

If as is more common you need to also remove AppFabric instances from SharePoint, say because the service-instance is disabled, you can do it with this command:

$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq $env:computername}

$serviceInstance.Unprovision()
$serviceInstance.Delete()

This PowerShell snippet (tries to) un-provision the service on the server (which might fail) then removes the service-instance from the SharePoint configuration database. If you look at the query, we pick out the service-instance that matches this machine-name so there’s no danger of it doing anything wrong as long as it’s run on the right machine PowerShell console.

You can do this from any machine for any other machine if you change the last where clause that passes in “this computer name”. For my example above I’ll change the computer-name to “sfb-sp15-wfe1” as that’s the server that has the bad service-endpoint.

How to Remove Ghost Servers from AppFabric

We need to remove any server that just doesn’t exist in the farm in any way. However if there’s a server in AppFabric that is in the farm but just shouldn’t be hosting AF do not use this method; try running “Remove-SPDistributedCacheServiceInstance” on the farm server in question first.

If on the other hand, manually ripping out the host from AppFabric cluster is the last resort, this is how. From a machine that is still working in the cluster (if possible), run Unregister-CacheHost passing in the name of the server to remove + the SharePoint provider + “connection-string” as so:

Unregister-CacheHost -HostName [machine] -ProviderType SPDistributedCacheClusterProvider -ConnectionString \\[machine]

Replace [machine] with the NetBIOS name of the machine you want to evict. In my example it would be:

Unregister-CacheHost -HostName sfb-sp15-wfe1.sfb-testnet.local -ProviderType SPDistributedCacheClusterProvider -ConnectionString \\sfb-sp15-wfe1.sfb-testnet.local

Once all phantom hosts have been eliminated from AppFabric, all being well we should have a healthy-if-slimmed-down cluster we can re-add other nodes to in the normal way with Add-SPDistributedCacheServiceInstance – which adds to AppFabric and SharePoint both, as the good SPLord intended. Before doing so, verify one more time that both SharePoint and AppFabric have the same server-list and that AppFabric says the server is “up” and SharePoint says the service-instance is “online”.

One More Time: Verify Service End-Points and AppFabric cluster Agree on Servers

All servers need to be in the AppFabric cluster and host an AppFabric service-instance in the farm, and be online:

Having cleaned out the rogue entries, I’ve gone back and added the other servers too with Add-SPDistributedCacheServiceInstance which sorts out both the SP and AF configuration at once.

Until you achieve this exact parity do not continue. The AppFabric hosts don’t necessarily need to be “up” at this time but the names have to coincide and SharePoint needs to have the service-instances online.

At this point your caching woes may even be over! In Central Administration get SharePoint to recheck any health-warnings about distributed cache.

Scenario 2 – No Server Mismatch but One or More AppFabric Service Instances are Disabled

At this stage we’ve verified the server lists between SP and AF match-up. Run this PowerShell command to find out if we have zombie endpoints in SharePoint:

Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status

If any status say “disabled” then you have a problem. You need to:

Remove the service-instance (see above).
Try re-adding it with Add-SPDistributedCacheServiceInstance
Verify the new service-instance is “online”.

If for some reason Add-SPDistributedCacheServiceInstance doesn’t give you a healthy endpoint, try running Remove-SPDistributedCacheServiceInstance then Add-SPDistributedCacheServiceInstance on the server in question. If you still can’t get a healthy endpoint after all that you’ll probably need to contact Premier support.

Scenario 3 – AppFabric & SharePoint Agree on Cache Servers but Some Servers are Down

In this scenario both products are on the same page about who should be caching but one or more nodes just aren’t for some reason or other.

Problem: Servers use Dynamic or Shared Memory

AppFabric is particularly sensitive to dynamic/shared memory. It can work on it but Microsoft doesn’t support it and if you wanted our help with an AppFabric cluster we wouldn’t do much unless each server had a fixed amount of memory, always.

Now the disclaimers’ done; I’ve had it working just fine with testing VMs on a dynamic VM using around 16gb; I tend to find that if memory usage expands suddenly and the host OS can’t provide the guest OS memory quick enough AppFabric will just give up and you’ll have to re-provision it all over again. The moral of the story here is, don’t be cheap on memory and expect AppFabric to work. Really, don’t, especially for anything that’s not your dev-box.

Problem: AppFabric Server Configuration State is Corrupt

First of all let’s see if the failing node even knows about the cluster. I’ve had a couple of occasions where the configuration has just died for various reasons and has just had to be reset. Run a check by getting the local cluster status with “Get-CacheHost” (use “Use-CacheCluster” if necessary).

This would suggest the cluster configuration on this failing node has died for reasons we don’t know, nor particularly care about assuming it’s not a regular occurrence. Cache clusters are trivial to setup so let’s just jump to the solution…

If you see the “cacheHostInfo is null” message during any of those, remove the service instances from SharePoint and the host from the AppFabric cluster as shown above, then repeat the remove/add commands.

Problem: AppFabric Service not Started

You’ll get reliability problems if the service isn’t started.

This is bad. This however is good:

If the service won’t start for some reason then I’d try removing & re-adding the server with Remove-SPDistributedCacheServiceInstance and Add-SPDistributedCacheServiceInstance.

Problem: Firewall Interference

Firewalls are a consideration for AppFabric. You should be able to see lots of chatter on port 22234 which is the internal cluster-chatter port. You should also see some activity on 222233 which is how SharePoint talks to the cluster; just make sure you don’t see any TCP resend packets being sent consistently.

Each cluster node needs these ports open between themselves and network tracing skills come in pretty handy here if you’re not sure if the ports are open.

More information about ports needed @ https://msdn.microsoft.com/en-us/library/ee790914(v=azure.10).aspx

Edit: my good colleague Filip Bosmans has written a nice script which checks the general health of the cache-cluster all round, so can make some of these checks more automatic. If you're having AppFabric issues, try his script out here.

Wrapping Up

Getting this right isn’t as easy as you might think. For the most part the caching and AppFabric just taking care of itself but there’s clearly a need to get your hands dirty now & then. Many people don’t even realize SharePoint just drives AppFabric and how that setup works; fixing these issues is mainly about understanding how to troubleshoot these two products as one.

Let me know if there’s any scenarios I haven’t covered in the comments – this is something I’d like to add to over time if needed. Thanks to my colleagues Filip Bosmans, Vlad Mihat, and others for helping out with this.

Cheers,

// Sam Betts

Comments

Anonymous
March 18, 2014
Nice summary! Thanks Sam!
Anonymous
March 18, 2014
Awesome work Sam!
Anonymous
March 20, 2014
The comment has been removed
Anonymous
March 20, 2014
Hmm that looks like a timeout of some kind - try increasing the client-side cache settings to increase this value on the WFE in question with Set-SPDistributedCacheClientSettings (technet.microsoft.com/.../jj219593.aspx) - the ReceiveTimeout is the one you're after if memory serves.
Anonymous
April 14, 2014
That is a very helpful article. I am seeing something in my farm that shows the status of the server is 'Provisioning' and the services on server shows it is stuck on starting. Which approach would work best to resolve this?
Anonymous
April 14, 2014
Try a Remove/Add-SPDistributedCacheServiceInstance on the affected server and see if that makes a difference. That solves a multitude of sins, most of the time.
Anonymous
May 29, 2014
This is the best article on Distributed Cache and AppFabric I have seen. You fixed my farm for me. Thanks!
Anonymous
June 02, 2014
Hey Samuel, great article! You combined all I've been reading in other places today. But unfortunately nothing has helped to resolve my issue yet or I might be doing something wrong. I tried to add Distributed Cache back to one of the servers, but it got stuck. Its status is "Starting" and I get the same errors in the screenshot you have for "Problem: AppFabric Server Configuration State is Corrupt". I tried to restart "AppFabric Caching Service", but it didn't help. Any suggestions on how to fix this? Thank you!
Anonymous
June 29, 2014
Try removing all instances of the AF cluster on all nodes to try and clean-out the config store, then adding the nodes again. Make sure SharePoint doesn't think there's any SPServiceInstances with that role. Failing that, the config store in the SharePoint Config DB is probably corrupt (which would be very strange) so you may need a new Config DB (farm rebuild, basically).
Anonymous
July 07, 2014
Hi Samuel, i do have a question on Scenario 1 –SharePoint and AppFabric Don’t Agree Which Servers are in the Cluster When i run the Query from Sharepoint POV i get Server Name : DEV status online When i run from App Fabric POV HostName Service Status DEV.domain.local :22233 Running Both shows online and running One is showing the Short computer name and the other showing the FQDN Is this a mismatch? Thanks
Anonymous
July 08, 2014
Not necessarily; is SharePoint reporting a problem in CA? Can you see SharePoint ULS giving cache errors?
Anonymous
July 08, 2014
The comment has been removed
Anonymous
August 14, 2014
The comment has been removed
Anonymous
September 22, 2014
Excellent, a life saver, thanks
Anonymous
January 07, 2015
Hi Sam, Thanks for putting this together. It takes a lot of worry off my mind as I've read this is a finicky service but your post shows there's resolutions to the issues when they come up. I'm looking at having a dedicated Distributed Cache Service that runs Windows 2012 R2 Standard and SP 2013 Foundation. My farm is Enterprise license though, can I get away with running foundation on the DCS boxes? I'm also wondering about what to do when I need to patch those servers though. In the event that I need to reboot after applying a patch should I stop the service before rebooting or before applying the patch? If I use the -Graceful parameter it should preserve the cached contents. I read in a forum that you also need to run the Remove-SPDistributedCacheServiceInstance command when shutting down. I don't see why that would be needed though. Can you provide some insight into the patching process of the dedicated cache host and the cache hosts that run on farm elements (wfe or app boxes)? I don't think I'm the only one with concerns about this but it wouldn't be the first time if I were ;) Thanks!
Anonymous
February 13, 2015
Hi Rich, Just in case I've misunderstood, we don't support mixing SharePoint editions in the same farm; a farm should be all enterprise/standard/foundation but not a mixture. For when a AppFabric machine is going to be rebooted, this article should be followed to avoid objects in the cache being lost on reboot - technet.microsoft.com/.../jj219613.aspx
Anonymous
March 03, 2015
The comment has been removed
Anonymous
March 03, 2015
Hi Samuel Thanks for contributing this blog. I resolved my query with the help of this.
Anonymous
March 03, 2015
Hi And one more thing, at Scenario 3 which are servers may get down. I basically didn't understand that part.
Anonymous
March 08, 2015
Hey Eric,

Yep, I'd still run it - AppFabric caches a bunch of stuff; not just social. Some things require AppFabric to work properly - login tokens for example (if you don't want users to continually have to log-in, seemingly at random).
Correct; graceful shutdowns won't be so necessary if you're not running social but see answer #1 ^. Indul - What's not clear? Basically, all AF servers need to be able to communicate to each other (services started & network comms not blocked).

Anonymous
March 17, 2015
Great article on what Distributed Cache is; however, we are not really sure why this it is on. We have a single SharePoint 2013 server that was implemented by a vendor. From what we have read this function is not on by default and is turned on. I am guessing this was not configured properly or the issue described her occurs over a period of time. Should we apply the fix described here or should we consider shutting off distributed cache? Thanks for any advice!
Anonymous
March 17, 2015
Distributed cache should be on, and is on by default for SPServers. If it's not working, I'd highly recommend activating/fixing it :) Edit: it's mainly used for log-in token + view-state caching, some other internal stuff, and the social capabilities.
Anonymous
March 25, 2015
Thank you Samuel! We will reread the article and look to make the necessary changes..
Anonymous
March 27, 2015
Thank you, helpful blog.
Anonymous
April 26, 2015
I am getting this error in logs Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedLogonTokenCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0009>:SubStatus<ES0001>:Cache referred to does not exist. and I have checked the instances using Get-CacheHost and Get-SPServiceInstance. Both matches. I am still getting the error in logs and Newsfeeds - "We're still collecting the latest news. You may see more if you try again a little later"
Anonymous
April 28, 2015
Hi Ashish, Have you tried granting access to the UPA account? Look for "Grant-CacheAllowedClientAccount" in my other post - blogs.msdn.com/.../troubleshooting-appfabric-reliability-issues-for-sharepoint.aspx // Sam
Anonymous
May 19, 2015
Thanks great article. It worked like charm.
Anonymous
June 15, 2015
Good job, pal.
Anonymous
June 30, 2015
I have a question, I started the appfabric over two servers , but one of the servers is able to see the status of the nodes using "get-cachehost" , and the other node is not able to see, with error " cannot read the connection String, please add them manually" the instances online over both servers
Anonymous
July 12, 2015
Have you tried Use-CacheCluster 1st?
Anonymous
August 13, 2015
Good one - thanks for putting this together Samuel!
Anonymous
August 13, 2015
Hi Samuel - I was able to sort out most of my issues with Distributed Cache Host with your guide except one. Hoping you can shed some light. Health Analyzer failing on a non-existent server. "Unregister-CacheHost" fails with "No such host is known". Any suggestion how can I remove this. Tried doing multiple "ReAnalyze". Host names match through Get-ServiceInstance and Get-CacheHost commands. That non-existent server does not show up in these lists.Not sure where Health Analyzer picking that name from. That server used to exist at one point but decommissioned long time ago. :-( Appreciate any feedback!
Anonymous
August 13, 2015
Hey BlueSky. If you export the cache configuration to an XML file, do the host names match up? Another possibility is that the rule result is out of date - delete the error and see if it comes back.
Anonymous
August 17, 2015
Looks like I just needed to wait long enough for the HA to pick this up after my cleanup. Came back Monday and not seeing the problem one anymore. Thanks for your feedback Samuel!
Anonymous
August 31, 2015
Hi Samuel, I am trying to Add-SPDistributedCacheServiceInstance but the console says do not load the aseembly Microsft.ApplicationServer.Caching.Configuration version=1.0.0.0, I have installed AppFabric 1.1 and the version of this assembly is 1.0.4632.0 How i can solucionate this? Thanks!!!!
Anonymous
September 08, 2015
I have my AppFabric Service with status UP, the name of service match with my SPService, but this is disabled I do not Remove the service-instance because i get an error the assembly of my dll is version 1.0.0.0 (windows fabric) and y my server 2012 i have installed the version 1.0.4632.0. Can i resolve this error? I need help!! I make all types of things but I do not change the version of assembly
Anonymous
September 08, 2015
Hey CleopatraKent, Maybe this script might help in figuring out why - blogs.technet.com/.../how-to-check-for-issues-with-distributed-cache-and-the-script.aspx // Sam
Anonymous
September 29, 2015
Hi Samuel, Now I have AppFabric Service UP and my sharepoint caching ONLINE but my Distributed Cache in sharepoint do not work well . I try to Stop and Start the service "Distributed Cache" in Sharepoint but it tell me "cacheHostInfo is null" Can i resolve this error? I need help!! Please!! 1000 Thanks Samuel
Anonymous
October 01, 2015
Hey Sam, What a fantastic, clear and concise article! A somewhat rare event. It made getting my second WFE back to playing nice a walk in the park. Especially when I just read the TechNet article that states: The Distributed Cache service can end up in a non-functioning or unrecoverable state if you do not follow the procedures that are listed in this article. ***** In extreme scenarios, you might have to rebuild the server farm. ***** technet.microsoft.com/.../jj219613.aspx Thanks mate! Ben
Anonymous
October 04, 2015
Hey Ben, thanks for the comments - glad it helped!
Anonymous
November 26, 2015
Any ideas on how to remove a "ghost" cache host that doesn't appear in in the cache cluster (get-cachehost)), nor is it part of the Farm. Only indication the Farm even "knows" about this server is from the health warnings that I see generated daily. The Unregister-CacheHost tells me that that "No such host is known".
Anonymous
December 09, 2015
The comment has been removed
Anonymous
January 07, 2016
Hi, i have one dedicated server in my farm for Distribution cache service, we are not using mysites or newsfeed so far. We need to shut down all SharePoint servers (Due to some internal outages) and bring it back, so in this scenario do we need to stop Distribution Cache service gracefully before doing server shutdown? Thanks, John
Anonymous
January 13, 2016
Hi John, If there's just one server then a graceful shutdown by definition won't do anything as theres nothing to offload cache objects to. No worries though - the worst thatll happen is logon tokens will be lost & users will have to reauthenticate. Sam
Anonymous
January 13, 2016
Simon, Is SP reports the cluster is down, if I'm not mistaken that could mean a network issue amongst other things (not necessarily a problem, but maybe saturation for example). I'd set your max connections to 1 for all the cache-containers & see if that helps - maybe it's being overloaded, especially if you're using claims logins which will generate a lot of connections. Unfortunately though that's just one potential cause amongst many; I assume the timer-job works normally & this is a one-off now & then? // Sam
Anonymous
January 13, 2016
Gerald S, When was the health warning generated? It's possible it's not been refreshed in a while maybe? You can always export the cache config & see if it appears anywhere in the XML file. // Sam
Anonymous
March 25, 2016
Thanks for the post really helped get my dev farm back in action
- Anonymous
  December 02, 2016
  Hi. Question: is the graceful shutdown still necessary with the latest sp 2013 updates. I did a test by creating some feeds and documents and social stuff then restarted both servers using windows start/shutdown button. When the servers came back I ran the repopulate command and all the stuff showed up ! I'm running april 2016 CU i guess. I'll have to do some resting to confirm. I'm using Appfabric CU 7.
Anonymous
December 02, 2016
Hi, great post for me. A question for you: I have a Sharepoint 2013 Server farm in Live where one node is Central Administration and 2 node are WFE. If I query the cachecluster from WFE2 with get-cachehost I see all the 3 cache UP but if I query the cachecluster from CA or from WFE1 the WFE2 cache is UNKNOW. The firewall are down. AppFabric ( CU7 ) was installed as prerequisite of Sharepoint. Any idea? Thanks

共用方式為