Managed Availability RCA
With the new concept “Managed Availability” in Exchange 2013 which runs on all Exchange servers for monitoring servers health, this process analyze hundreds of health metrics when something wrong happened an action will be invoked to correct this problem .
In some cases when Managed Availability perform an action to recover from an error you may need to know which health metrics “Probe” Managed Availability used to decide that Exchange component or server needs a fix.
Why you need to know that? Because you may need to stop a behavior like “system reboot” when an error occurred, you may find in some situations that it’s better to know the root cause and fix this problem by yourself instead of automatically let Managed Availability fix it.
Managed Availability
First lets overview what are the main components of Managed Availably:
- Probe engine: The Probe Engine takes measurements on the server.
- Monitoring probe engine: The Monitoring Probe Engine stores the business logic about what constitutes a healthy state. It functions like a pattern recognition engine, looking for patterns and measurements that differ from a healthy state, and then evaluating whether a component or feature is unhealthy.
- Responder engine: When the Responder Engine is alerted about an unhealthy component, its first action is to try to recover that component. Managed availability enables multi-stage recovery actions. The first attempt may be to restart the application pool, the second attempt may be to restart the corresponding service, and the third attempt may be to restart the server. And, the final attempt may be to put the server offline, so that it no longer accepts traffic. If all of these actions fail, an alert is sent to the help desk.
All above are controlled by the Exchange Health Manager Service (MSExchangeHMHost.exe) and the Exchange Health Manager Worker process (MSExchangeHMWorker.exe)
The relationship between these components is like
Probes (monitor and when fails occur) --> Monitor status change --> Responder takes action
So to find the root cause and why a responder invoked a specific actions we will go in the reverse way
Responder takes action --> which monitor? --> Find the failing probe.
For Example: Use the following command to get all windows events for responder that forced your server to reboot.
The output will be like this:
(Get-WinEvent -LogName Microsoft-Exchange-ManagedAvailability/* | % {[XML]$_.toXml()}).event.userData.eventXml| ?{$_.ActionID -like " *ForceReboot* "} | ft id,RequestorName,Endtime,result -AutoSize |
In this case, the bug check was initiated by the ActiveDirectoryConnectivityConfigDCServerReboot
Now let’s get more details about this Responder
(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/responderdefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "ActiveDirectoryConnectivityConfigDCServerReboot"} | ft ServiceName,Name,Alertmask |
The AlertMask show which Probe used by the ActiveDirectoryConnectivityConfigDCServerReboot Responder.
A repetitive failed probe causes a Monitor change and a recovery action is invoked. The details of the failing Probe is going to provide information about the exact failure.
Now we need to dig inside the Windows Events to get a Failed probe and check the error message associated with.
[PS] C:\>(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult | % {[XML]$_.toXml()}).event.userData.eventXml | ?{($_.ResultType -eq 4) -and ($_.ResultName -like "* ActiveDirectoryConnectivityConfigDCProbe*")} |
The Result may show you an error like
<Error>Received a referral to contoso.com when requesting DC=contoso,DC=com from dc1.child.contoso.com. You have specified the wrong server for this operation. Filter = (&(objectClass=\2a)(!(msExchCU=*))).</Error>
Now what! Now You know why this Probe failed and made a corresponding responder to restart your server .what you can do now is to stop this responder till you fix this issue if it going to take much time .
This Responder is temporary disabled by adding a GlobalMonitoringOverride
Add-GlobalMonitoringOverride -Identity Exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled -PropertyValue 0 -Duration 10.00:00:00 |
Important: The main focus should be to analyze and resolve the main issue that is causing the Probe failure. If you decide to disable the responder, be aware that you are preventing Exchange from taking automated recovery actions for any monitors that call this responder. The Responder should be disabled only if the Responder’s actions are causing serious outages and fixing the main issue is going to take a significant amount of time.
Later you can enable this Responder by removing the GlobalMonitoringOverride
Remove-GlobalMonitoringOverride -Identity Exchange\ActiveDirectoryConnectivityConfigDCServerReboot -ItemType Responder -PropertyName Enabled |
Mohamed Dawy from PFE Egypt team.
Comments
Anonymous
October 24, 2013
Thanks Eng. Mogamed for this post .. really appreciate I'm always learned new techniques from You. waiting for more posts.Anonymous
October 24, 2013
Very useful, straight forward to the point. Thank you Mr. DawyAnonymous
October 25, 2013
Amazing article, one comment if u accept this managed availability help me to troubleshooting all exchange 2013 issues or specific issues?Anonymous
November 09, 2013
The comment has been removedAnonymous
February 25, 2016
PFEs are the spartans of MS support :D:D Cheers bro.
The most important thing is miss after leaving MS support is VKB :D
Almost everyone here would agree.