How to check for issues with Distributed Cache and the script
We will go through the logs you get via the script:
Recommendations post:
AppFabric.txt
=======
==================
AppFabric Version
==================
This will tell you which CU for AppFabric is installed.
For instance:
1.0.4639.0 |
|
1.0.4644.0 |
|
1.0.4652.2 |
|
1.0.4653.2 |
|
1.0.4655.2 |
|
1.0.4656.2 |
|
1.0.4657.2 |
CU 3 and onwards, you need to add a section in the config file to let Garbage Collection work correctly, you can check in the "AppFabric Configuration File" in the AppFabric.txt
Look for:
<appSettings>
<add key="backgroundGC" value="true"/>
</appSettings>
Add Garbage Collection Setting
Recommendation is to go to at least to CU 6 at the time of writing.
==================
AppFabric Service
==================
name : AppFabricCachingService
startname : CONTOSO\spfarm
startmode : Auto
state : Running
Compare the account name with the output from the hosts section.
Sometimes someone changed it manually (not supported) in the service itself, this will not be picked up by SharePoint nor AppFabric.
==================
Registry
==================
AdminConfigured : 1
ServiceConfigured : 1
ConnectionString : Data Source=spalias;Initial Catalog=SharePoint_Config;Integrated Security=True;Enlist=False
Provider : SPDistributedCacheClusterProvider
PSPath : Microsoft.PowerShell.Core\Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AppFabric\V1.0\Configurat
ion
PSParentPath : Microsoft.PowerShell.Core\Registry::HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AppFabric\V1.0
PSChildName : Configuration
PSDrive : HKLM
PSProvider : Microsoft.PowerShell.Core\Registry
If the connection string is empty, the system has no way determining the Cache Cluster, in our case (SharePoint) stored in the Configuration Database.
In this example I used a SQL alias during configuration, check if needed if it is referencing correctly.
Export.txt
=======
</cache>
<cache consistency="StrongConsistency" name="DistributedLogonTokenCache_49ef67e4-a8c6-44e4-b276-d51c9eb01b93"
minSecondaries="0">
<policy>
<eviction type="Lru" />
<expiration defaultTTL="10" isExpirable="true" />
</policy>
</cache>
Rarely happens, but for all the caches, it needs to have the FarmID at the end of the name like 49ef67e4-a8c6-44e4-b276-d51c9eb01b93
If it is not there (except for "default cache" that "can" be mentioned as number one), that means somebody added a custom cache which is not supported.
<hosts>
<host replicationPort="22236" arbitrationPort="22235" clusterPort="22234"
hostId="203503434" size="614" leadHost="true" account="CONTOSO\spfarm"
cacheHostName="AppFabricCachingService" name="SPCA.contoso.com"
cachePort="22233" />
<host replicationPort="22236" arbitrationPort="22235" clusterPort="22234"
hostId="390812545" size="614" leadHost="true" account="CONTOSO\spfarm"
cacheHostName="AppFabricCachingService" name="SPWFE.contoso.com"
cachePort="22233" />
</hosts>
This is important, size, ports and account need to be the same on all Cache Hosts, when you add/remove a host, the size and account will be back to the default, always check.
Verify that all the host names are FQDN, only seen this once long time ago but just to be sure ...
Size needs to be sufficient for running the service, if it is too low it can lead to eviction of cached items and even crashes of the service.
GetInfo.txt
========
==================
Get-CacheClusterHealth
==================
HostName = SPCA.contoso.com
-------------------------
NamedCache = DistributedViewStateCache_49ef67e4-a8c6-44e4-b276-d51c9eb01b93
Healthy = 4.55
UnderReconfiguration = 0.00
NotPrimary = 0.00
InadequateSecondaries = 0.00
Throttled = 0.00
Unallocated named cache fractions
---------------------------------
NamedCache = default
Unallocated fraction = 0.07
NamedCache = DistributedActivityFeedCache_49ef67e4-a8c6-44e4-b276-d51c9eb01b93
Unallocated fraction = 0.14
NamedCache = DistributedDefaultCache_49ef67e4-a8c6-44e4-b276-d51c9eb01b93
Unallocated fraction = 0.07
If you see anything mentioned under "Unallocated" that means the caches are still building up, for instance after adding a Cache Host, nothing to worry, just be sure it is empty after like 15 minutes.
The "split" between Cache Hosts should always reflect how many Hosts there are, 3 hosts will divide content in 3 so every host will have a value of around 3, 2 will have around 4.5.
"Throttled" is self explanatory.
==================
Get-CacheHost
==================
HostName : SPCA.contoso.com
PortNo : 22233
ServiceName : AppFabricCachingService
Status : Up
VersionInfo : 3[3,3][1,3]
HostName : SPWFE.contoso.com
PortNo : 22233
ServiceName : AppFabricCachingService
Status : Up
VersionInfo : 3[3,3][1,3]
This is AppFabric side.
==================
Get-SPServiceInstance
==================
TypeName : Distributed Cache
Status : Online
Id : 8dde3fe4-b745-445c-ae1d-7f8a5497fe64
Server : SPServer Name=SPCA
TypeName : Distributed Cache
Status : Online
Id : 04bf5462-3089-44ff-9760-90b7945b3980
Server : SPServer Name=SPWFE
This is SharePoint side.
Usually a mismatch is the culprit when it comes to crashes of the AppFabric service for instance, AppFabric and SharePoint have different information regarding which SharePoint Servers are supposed to run as a Cache Host.
You can use Sam's blog to fix this:
==================
Cache Settings
==================
DistributedLogonTokenCache
ChannelInitializationTimeout : 60000
ConnectionBufferSize : 131072
MaxBufferPoolSize : 268435456
MaxBufferSize : 8388608
MaxOutputDelay : 2
ReceiveTimeout : 60000
ChannelOpenTimeOut : 20
RequestTimeout : 20
MaxConnectionsToServer : 4
Check if these settings are increased/decreased correctly, keep in mind that these are recommendations and you can change them even more, but this is usually not necessary.
Recommendation from below article is 3000/3000/1
==================
NetShell
==================
MIB-II TCP Connection Entry
Local Address Local Port Remote Address Remote Port State
-----------------------------------------------------------------------------
192.168.0.51 1077 192.168.0.50 22234 Established
192.168.0.51 1098 192.168.0.50 22234 Established
192.168.0.51 3047 192.168.0.50 22233 Established
0.0.0.0 22233 0.0.0.0 0 Listen
0.0.0.0 22234 0.0.0.0 0 Listen
192.168.0.51 22234 192.168.0.50 21969 Established
192.168.0.51 22234 192.168.0.50 21996 Established
0.0.0.0 22236 0.0.0.0 0 Listen
As you can imagine, check if 22233/22234/22236 ports are established or listening. (22235 should not show up)
MIB-II TCP Statistics
------------------------------------------------------
Timeout Algorithm: Van Jacobson's Algorithm
Minimum Timeout: 10
Maximum Timeout: 4294967295
Maximum Connections: Dynamic
Active Opens: 1840
Passive Opens: 291
Attempts Failed: 0
Established Resets: 648
Currently Established: 27
In Segments: 2042415
Out Segments: 1990285
Retransmitted Segments: 14201
In Errors: 0
Out Resets: 530
A huge number of connections might cause issues.
The netstat part I added so you can check which servers (by name) are communicating.
==================
Memory and Manufacturer
==================
12287.55078125 MB Total Physical Memory
8733.18359375 MB Total Free Memory
Microsoft Corporation System Manufacturer
VMWare BIOS Manufacturer
Did not find a sure way to always be able to tell if the Server is virtualized or not, if in doubt ask your Server Administrator, have them check if the SharePoint servers are using a Hyper-Visor and more specifically if Virtual Memory is configured, which is not supported.
Microsoft Corporation indicates Hyper-V.
VMWare BIOS indicates ...
Still see customers using virtual memory, most likely due to non-SharePoint admins managing these environments.
Check if your free memory is getting low.
==================
Local running Services
==================
TypeName
--------
Microsoft SharePoint Foundation Workflow Timer Service
Microsoft SharePoint Foundation Web Application
Central Administration
Microsoft SharePoint Foundation Incoming E-Mail
Distributed Cache
It is not recommended to run Distributed Cache on SharePoint servers that have:
SQL Server 2008 or SQL Server 2012
Search service
Excel Services in SharePoint
Project Server services
Something that is not mentioned in the article is that the User Profile Sync Service "can" also compete with DC.
Did not add a check for SQL, simply open Start and check if Management Studio is listed, if so, you can check if the /DATA location in the File System contains databases if you don't have access to SQL.
Check "Capacity planning for the Distributed Cache service" section if you're using the recommended memory sizes based on Farm size.
==================
Accounts + Local Groups
==================
WSS_ADMIN_WPG
WSS_WPG
Alias name WSS_WPG
Comment Members of this group have read access to system resources used by Microsoft SharePoint Foundation.
Members
-------------------------------------------------------------------------------
CONTOSO\spfarm
CONTOSO\sppool
NT AUTHORITY\LOCAL SERVICE
NT AUTHORITY\SYSTEM
The command completed successfully.
Alias name WSS_ADMIN_WPG
Comment Members of this group have write access to system resources used by Microsoft SharePoint Foundation.
Members
-------------------------------------------------------------------------------
Administrators
CONTOSO\spfarm
CONTOSO\spinstaller
CONTOSO\test1
The command completed successfully.
Alias name Administrators
Comment Administrators have complete and unrestricted access to the computer/domain
Members
-------------------------------------------------------------------------------
Administrator
CONTOSO\Domain Admins
CONTOSO\spfarm
CONTOSO\spinstaller
The command completed successfully.
==================
Memory used by AppFabric Process
==================
Handles NPM(K) PM(K) WS(K) VM(M) CPU(s) Id ProcessName
------- ------ ----- ----- ----- ------ -- -----------
870 682 1421696 636972 1766 5,694.95 1364 DistributedCacheService
Did not add code to check inside nested Groups, so if you see those open them manually to check the Members.
If you have issues with for instance Social features, check if the User Profile Account has access to Distributed Cache through any of the groups.
==================
FireWall Rules
==================
Name : FPS-ICMP4-ERQ-In
DisplayName : File and Printer Sharing (Echo Request - ICMPv4-In)
Description : Echo Request messages are sent as ping requests to other nodes.
DisplayGroup : File and Printer Sharing
Group : @FirewallAPI.dll,-28502
Enabled : True
Profile : Any
Platform : {}
Direction : Inbound
Action : Allow
EdgeTraversalPolicy : Block
LooseSourceMapping : False
LocalOnlyMapping : False
Owner :
PrimaryStatus : OK
Status : The rule was parsed successfully from the store. (65536)
EnforcementStatus : NotApplicable
PolicyStoreSource : PersistentStore
PolicyStoreSourceType : Local
Had some issues with pulling specific FireWall Rules, so there might be more than needed.
Check if they are enabled, they are mandatory, keep in mind that they can be disabled due to other Firewalls managing ports, hence if you see they are not present or disabled, a simple ping from all hosts in the cluster to the rest of the cluster will tell you if ICMP is allowed.
For Remote Management check these specific ports via Telnet/PSPing or ask your Network team.
Might add Test-Connection to ping all Cache Hosts from all Cache Hosts.
Some things you are not supposed to change manually unless told to do so by MS Support, if in doubt, open a support case.
More to follow if needed.
Comments
- Anonymous
September 23, 2015
AppFabric & distributed cache issues in SharePoint is something that comes up with reasonable regularity - Anonymous
May 01, 2016
What does the "Healthy" number mean in GetInfo.txt?