SharePoint 2016 Troubleshooting: The Execute method of job definition Microsoft.Office.Server. UserProfiles .FeedCacheRepopulationJob threw an exception

Problem

The following server error event is occurring regularly on a 2016 development farm server hosting Central Administration (below).

Log Name: Application

Source: Microsoft-SharePoint Products-SharePoint Foundation

Date: [date/time]

Event ID: 6398

Task Category: Timer

Level: Critical

Keywords:

User: DOMAIN\farmserviceacct]

Computer: hostA.subdomain.domain.domainextension

Description:

The Execute method of job definition Microsoft.Office.Server.UserProfiles.FeedCacheRepopulationJob

(ID d650b0df-d9be-4c17-8eda-d4b92d25a7ed) threw an exception. More information is included below.

Unexpected exception in FeedCacheService.IsRepopulationNeeded: Connection to the server terminated,

check if the cache host(s) is running .. (Correlation=4be6ba9e-cd51-f042-18fd-f855b63b06cb)

This event had been occurring regularly and frequently on a 2016 development farm

Troubleshooting

ULS logs

Reviewing the ULS logs, a range of entries repeated regularly. Most of these entries indicated that the cache cluster was down or offline and needed to be restarted:

Unexpected error occurred in method 'GetObject' , usage 'FeedCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus: There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) ---> System.ServiceModel.EndpointNotFoundException: No DNS entries exist for host hostC.subdomain.domain.domainextension. ---> System.Net.Sockets.SocketException: No such host is known at System.Net.Dns.GetAddrInfo(String name)...

and

ChannelInvoke::IsRepopulationNeeded::1 -- FaultException occurred: [FaultCode=Sender][FaultSubCode= UnspecifiedSubCode] System.ServiceModel.FaultException`1[Microsoft.Office.Server.UserProfiles.FeedCacheFault]: Unexpected exception in FeedCacheService.IsRepopulationNeeded: Cache cluster is down, restart the cache cluster and Retry. (Fault Detail is equal to Microsoft.Office.Server.UserProfiles.FeedCacheFault).

and

Unexpected exception in FeedCacheService.IsRepopulationNeeded: Cache cluster is down, restart the cache cluster and Retry.: at Microsoft.Office.Server.DistributedCaching.SPDistributedCache.GetObject(String key, String regionName) at Microsoft.Office.Server.FeedCache.FeedCacheDataCacheWrapper.GetObject(String key, String regionName) at Microsoft.Office.Server.FeedCache.FeedCacheImplementation.IsRepopulationNeeded(Guid callerID) ...

and

Exception occured in ExecuteOnChannel: System.ServiceModel.FaultException`1 [Microsoft.Office.Server.UserProfiles.FeedCacheFault]: Unexpected exception in FeedCacheService.IsRepopulationNeeded: Cache cluster is down, restart the cache cluster and Retry. (Fault Detail is equal to Microsoft.Office.Server.UserProfiles.FeedCacheFault).

The initial diagnosis was that these messages were due in fact to a cache cluster problem since the 6398 error event caused by cache cluster issues seen before. To check whether this was the case again, it was requiretd to remote into the server hosting distributed cache (a WFE) and then executing these two statements,

use-CacheCluster

get-CacheHost

The status returned was that the cache cluster was online and healthy. This indicated that the initial diagnosis was likely wrong and we needed to troubleshoot further. Since these ULS messages were occurring on the Central Administration server, and the messages themselves referred to the distributed cache running on the WFE server, it seemed that perhaps the feed cache service running on the CA server was unable to communicate with the distributed cache service running on the WFE server, and this problem was then surfaced from the CA server's perspective as the distributed cache being down. Going back to the ULS logs and reviewing them again, we noted some additional entries that we had not originally thought significant:

Unexpected error occurred in method 'Put' , usage 'FeedCache' - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) ---> System.ServiceModel. EndpointNotFoundException: No DNS entries exist for host hostC.subdomain.domain.domainextension. ---> System.Net.Sockets.SocketException: No such host is known ...

hostC was a farm server functioning in the WFE role and also running the distributed cache service. The statements, " No DNS entries exist for host..." and "No such host is known", seemed to indicate a communication problem. Opening a command shell on the hostA server, we tried pinging the URL stated in the message:

hostC.subdomain.domain.domainextension

The ping failed. we then tried pinging just hostC and this succeeded. Pinging "hostC.domain.domainextension" also succeeded. we then remoted into another farm server, hostB and again tried pinging the FQDN of the hostC URL. It failed again. After several such pinging experiments from and to different farm servers, we was able to determine that two farm servers, hostB and hostC, could not be pinged from hostA when using their FQDN, but they could be pinged if only using their hostnames or leaving out the subdomain name from the URL.

We then checked the Central Administration's Servers in Farm page for these servers and found them all listed with status "No action required," indicating that the farm recognized them and was able to communicate with them successfully - or at least the timer service was. As a quick check to see what each farm server understood its FQDN to be, we remoted into each server and executed the following PowerShell commandlet:

(Get-WmiObject win32_computersystem).DNSHostName+"."+(Get-WmiObject win32_computersystem).Domain

In each case, what was returned was the FQDN that included the subdomain. At this point, it seemed that the problem was not a distributed cache, SharePoint, or Windows Server problem but likely a network problem - specifically, a DNS problem. Discussing this with a systems administrator, he verified that these hosts had DNS entries lacking the subdomain but not entries that included the subdomain. we then submitted a ticket to have two new DNS entries added

  • hostB.subdomain.domain.domainextension
  • hostC.subdomain.domain.domainextension

After a few hours, we remoted into the CA server and tried pinging the FQDN URL for these hosts, and this time the pings succeeded. Checking server event logs the next day and found no more 6398 error events appearing.

Solution

  • When error event 6398 appears in the server event log, check the ULS log for the corresponding event. If the ULS message associated with this event states "No DNS entries exist...", verify that the DNS does not in fact exist and then add the appropriate DNS entry for the server.

References

Notes

  • tbd