Information: Skype for Business 2015 Persistent Chat Service Auto Restart Logic.

We are writing this article to let you know the logic behind Persistent chat auto restart in the event of connectivity failure between the PCHAT peers.

Typically, the Skype for Business 2015 Persistent Chat server must interconnect with three entities in order to provide persistent chat service to the Persistent chat client (Skype for Business clients).

  1. Skype for Business 2015 Front End Servers
  2. Backend SQL Server
  3. Other Persistent Chat servers within the pool

There will be a situation we have observed when communication failures occur between these entities. However, only when communication fails between two Persistent Chat servers (otherwise called as 'Peers') within a pool - it will start a service recycle. Service will be stopped and restarted solitary.

 Let’s take a scenario where In a four server Persistent Chat pool; example this is how the servers are defined:

 

  • PCHAT01.domain.com = Peer 1
  • PCHAT02.domain.com = Peer 2
  • PCHAT03.domain.com = Peer 3
  • PCHAT04.domain.com = Peer 4

 In a Group Chat pool, there would be two Peers defined per server as we have Channel Service and Lookup Service separately. So there would be 8 Peers in the above example if it’s a Group Chat pool.

 Now if there is some communication failure that’s when the ActiveClusterCheck comes into picture.

Following is one such scenario with the traces we have captured when Peer communication fails:

 

  • PCHAT01 Server is trying to establish the communication with PCHAT02 via the port 8011, unfortunately failed to communicate due to “N” number of reasons. The method used to establish the communication with peer Persisting chat server will be TCP URI scheme.

               

TL_ERROR(TF_COMPONENT) [4]10C4.0F74::03/31/2016-12:58:04.429.004c9911 (ChatCommon,PeerTransport.OnSendMessageFailed:peertransport.cs(25))

(0000000002EBB8C2)Couldn't send message of type to peer <<2> <net.tcp://PCHAT02.domain.com:8011/MGC/PeerService>>. Exception: Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.PeerCommunicationException: Could not communicate with peer. Unable to send message type Microsoft.Rtc.Internal.Chat.Server.ServerCommon.Messaging.GroupChatServerNotify to ID: 2 ---> System.ServiceModel.FaultException`1[Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.InvalidSessionFault]: The creator of this fault did not specify a Reason.

Server stack trace: at System.ServiceModel.Channels.ServiceChannel.HandleReply(ProxyOperationRuntime operation, ProxyRpc& rpc) at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout) at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation) at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.IPublisher.Send(Guid subscriptionId, IServerTransportMessage message) at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.PeerWrapper.WrappedSend(IServerTransportMessage message) at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.PeerWrapper.ExecuteWithRetry(Action action) at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.WCFService.SendToPeer(IServerTransportMessage message, Int32 peerId) --- End of inner exception stack trace --- at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.WCFService.SendToPeer(IServerTransportMessage message, Int32 peerId) at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.PeerTransport.Send(IServerTransportMessage message)

TL_INFO(TF_COMPONENT) [4]10C4.0F74::03/31/2016-12:58:04.429.004c9912 (ChatCommon,PeerServerManager.HandleSendMessageToPeerFailed:peerservermanager.cs(26))

(0000000000684B2B) Send message to peer failed.

 

  • Now let’s talk about the exception we have observed in the above traces:

         Exception message:

Exception: Microsoft.Rtc.Internal.Chat.Server.ServerCommon.PeerTransport.PeerCommunicationException: Could not communicate with peer. Unable to send message type Microsoft.Rtc.Internal.Chat.Server.ServerCommon.Messaging.GroupChatServerNotify to ID: 2 

  • This Exception error occurs when the peer is unreachable, or the peer have been stopped or may be experiencing network problems. Which mean most probably, we have a underlying network problem on this server that is causing the issue.
  • The ActiveClusterCheck function that is being performed every 2 minutes in all Persistent Chat servers will figures out that one of the peers is not responding and consider it is not active and initiate an Auto Service Restart:
  • Verifying the Active cluster in backend database.

TL_VERBOSE(TF_COMPONENT) [8]10C4.1E98::03/31/2016-12:58:31.763.004cf0c6 (ChatCommon,PeerServerManager.OnActiveClusterCheckTimerElapsed:peerservermanager.cs(29))

(0000000000684B2B)Checking the active cluster in the database

  • Triggering the Stored Procedure to identify the active peer in the cluster.

TL_NOISE(TF_COMPONENT) [8]10C4.1E98::03/31/2016-12:58:31.763.004cf0c7 (ChatCommon,DbCommand.executeUntilSuccessOrTimeoutT:dbcommand.cs(43))

(0000000001A1E2AB)Beginning execution of CmdID: 0ddf5da9-21b9-4ca9-be53-dea4c0fcebc1 with text: procGetActivePeerCluster #ARGS NOT PREPARED

TL_NOISE(TF_COMPONENT) [8]10C4.1E98::03/31/2016-12:58:31.763.004cf0c8 (ChatCommon,DbCommand.executeUntilSuccessOrTimeoutT:dbcommand.cs(45))

(0000000001A1E2AB)CmdID: 0ddf5da9-21b9-4ca9-be53-dea4c0fcebc1 is command: procGetActivePeerCluster @serverPingInterval=90000

TL_NOISE(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0c9 (ChatCommon,DbCommand.executeUntilSuccessOrTimeoutT:dbcommand.cs(46))

(0000000001A1E2AB)CmdID: 0ddf5da9-21b9-4ca9-be53-dea4c0fcebc1 Took <0> ms and <1> attempts to process sql: procGetActivePeerCluster @serverPingInterval=90000

TL_NOISE(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0ca (ChatCommon,ServerCluster.constructor:servercluster.cs(26))

Constructing ServerCluster with the following data:

TL_NOISE(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0cb (ChatCommon,ServerCluster.constructor:servercluster.cs(28))

Server List: 1, 2, 3, 4

TL_NOISE(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0cc (ChatCommon,ServerCluster.constructor:servercluster.cs(30))

Raw cluster data: (1, 2), (1, 3), (1, 4), (2, 1), (2, 3), (3, 1), (3, 2), (3, 4), (4, 1), (4, 3)

TL_INFO(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0cd (ChatCommon,ServerCluster.BuildGraphData:servercluster.cs(57))

(000000000393D889)The largest active clique of servers contains server IDs: 1, 2, 3

TL_INFO(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0ce (ChatCommon,PeerServerManager.OnActiveClusterUpdated:peerservermanager.cs(34))

(0000000000684B2B)List of active servers in the pool are: ServerIdentity: 4, ActiveServersList: { 1, 2, 3 }

TL_INFO(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0cf (ChatServer,ChannelServer.TakeActionOnMissingFromMaxClique:channelserver.cs(17))

(0000000002138C21)Restarting the server with a 0 millesecond delay.

TL_INFO(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0d0 (ChatCommon,ServerBase.Restart:serverbase.cs(22))

(0000000002138C21)Restarting the server in an attempt to bring it back online.

TL_INFO(TF_COMPONENT) [0]10C4.1E98::03/31/2016-12:58:31.763.004cf0d1 (ChatCommon,MgcServiceBase.RestartServer:mgcservicebase.cs(26))

(000000000350AA3F)Restarting the server due to configuration change.

TL_WARN(TF_COMPONENT) [8]10C4.22E4::03/31/2016-12:58:31.778.004cf0f1 (ChatCommon,ServerBase.Stop:serverbase.cs(21))

(0000000002138C21) server is Stopping

TL_INFO(TF_COMPONENT) [8]10C4.22E4::03/31/2016-12:58:31.778.004cf109 (ChatServer,ChannelServer.OnStop:channelserver.cs(57))

(0000000002138C21)ChannelServer disconnecting ... Transport

Would be appreciated, if someone have more in-depth to add into this data.