SharePoint 2016: More Cache hosts are running in this deployment than are registered with SharePoint

Problem

Discovered that a SharePoint 2016 farm was generating the following series of health rule violations listed in the farm's Review problems and solutions list:

  • Distributed cache service is not enabled in this deployment
  • Server role configuration isn't correct
  • Distributed cache service is unexpected configured on server(s)
  • Distributed cache service is not configured on server(s)
  • More cache hosts are running in this deployment than are registered with SharePoint

The farm presented a three server topology: one DB, one APP (MinRole: Application with Search) and one WFE (MinRole: Front-end with Distributed Cache).  Began troubleshooting.

Analysis

01) Checked Services in Farm: found Distributed Cache service provisioned on both APP (Compliant: No(Fix)) and WFE (Compliant: Yes). Clicked on Fix - after a moment, Compliant returned to No(Fix).

02) Checked Services on Server: found Distributed Cache service on both APP and WFE having Status of Stopped.  On

03) Checked Windows Server Services administrative tool on APP and WFE and found both presenting Startup Type Disabled and Status [blank] for AppFabric Caching Service.  Also found that the identity of each was the farm's application service account, call it spSvc

04) Observation: Distributed Cache should only be running on the farm's single WFE. 

05) Checked status of service on each server by executing the following in an elevated SharePoint Management Shell on the APP server:

$instanceName = "SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName} ft server,id,status -auto

the outcome of which was:

Server ID Status
SPServer Name=APP [ID1] Disabled
SPServer Name=WFE [ID2] Disabled

06) Checked cache cluster status by executing the following in an elevated SMS on the WFE server:

Use-cachecluster
Get-cachehost

The outcome of which was

HostName : CachePort Service Name Service Status Version Info
[WFE]:22233 AppFabricCachingService DOWN 3 [3,3][1,3]

07) Checked job status, for running jobs involving Distributed cache services, by executing the following in an elevated SMS on the WFE server:

Get-SPTimerjob Job-Service-Instance-[ID1]
Get-SPTimerjob Job-Service-Instance-[ID2]

both of which returned [blank].  This verified that there were no jobs scheduled to remove these services.

08) Removed AppFabric service from APP server by executing the following in an elevated SMS on the WFE server (doesn't matter which):

(Get-SPServiceInstance -id "ID1").Delete()
$instanceName = "SPDistributedCacheService Name=AppFabricCachingService"
Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName} ft server,id,status -auto

This returned a single instance of the AppFabric service - running on the WFE server.

09) Removed AppFabric service from WFE server by executing the following in an elevated SMS on the WFE server (doesn't matter which):

(Get-SPServiceInstance -id "ID2").Delete()
$instanceName = "SPDistributedCacheService Name=AppFabricCachingService"
Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName} ft server,id,status -auto

tbd

I hope it helps .......