Exchange 2016: Monitoring

Configure health probes on Load Balancers

Until Exchange 2010 the monitoring exchange we were dependent on SCOM. The SCOM management pack contained SCOM's health manifests and correlation engines which used to collect analyze and report through SCOM.

The Exchange CAS servers were load balanced on a VIP and the LB's used to check the CAS nodes just by pinging or telneting them frequently on port 443, 80 to check availability.

Behind the scene, there can be the application services which might not be available like Exchange services not running but still the LB's can ping them on required port.

In this case still, the connections will be going to the CAS server on which the Exchange services are stopped and unavailable. This does not give 100 percent high availability and monitoring.

To address this from Exchange 2013 onwards, Microsoft has released a new component called Managed Availability. This is a self-healing internal component that runs on every Exchange Server to monitor and fix any issues with the services on their own. It polls and analyzes hundreds of health metrics every second.

There is a component called health probes which should be configured to monitor the Exchange services on the load balancers where the Exchange services are published.

We need to monitor the probes below from the load balancer:

 

Basically, the servers are monitored from the load balancers on each protocol level.

As per the example below if the MBX1 has issues with OWA service and managed availability marks this service down, the load balancer with the above configuration will be able to identify that MBX1 has only issues with OWA through an offline responder. It will take only the OWA service out and keep the remaining service available and functional which is very good.

https://exchangequery.files.wordpress.com/2016/05/probe1.png

We can run the command below to check  the component state:

Get-ServerComponentState -Identity servername

https://exchangequery.files.wordpress.com/2016/05/proxy.png

We can make the required components inactive during our maintenance interval as well.

We will speak only a little bit about the components that are involved in managed availability since there are very good blogs about managed availability written by other experts and MVP's and we do not want to explain them again here.

Managed Availability has two groups

Health Sets

This is an internal view managed by managed availability using probes, monitors, and responders. It has the inbuilt capability to recover the services on its own if any issue occurs.

Below are the main components involved in Managed Availability:

  • Probe - Check the services and its status very frequently.
  • Monitor - Monitors the probe result.
  • Responder - Component responsible for taking necessary action.

Responder

Responder has these components:

  • Restart Responder - Terminates and restarts a service
  • Reset AppPool Responder - Stops and restarts an application pool in Internet Information Services (IIS)
  • Failover Responder - Initiates a database or server failover
  • Bugcheck Responder - Initiates a bug check of the server, thereby causing a server reboot
  • Offline Responder - Takes a protocol on a server out of service (rejects client requests)
  • Online Responder - Places a protocol on a server back into production (accepts client requests)
  • Escalate Responder - Escalates the issue to an administrator via event logging

So the above tasks for health sets is an automated action and we do not need to perform any steps from our side.

Health Groups

Health groups are exposed to System Center Operations Manager 2007 R2 and System Center Operations Manager 2012 and reported via dashboard. This health group is required for the SCOM to give a detailed dashboard report of the exchange status.

Any issues that can't be recovered automatically are escalated to the Exchange 2016 Management Pack as an alert.

Responder relevant for the Exchange 2016 Management Pack is the Escalate Responder.

When the Escalate Responder is triggered, it generates an event that the Exchange 2016 Management Pack recognizes and feeds the appropriate information into that alert that provides administrators with the information necessary to address the problem.

Below are the new additional health indicators added in the Exchange 2013 management pack:

https://exchangequery.files.wordpress.com/2016/05/21.png

  • Customer Touch Points: This shows the end user experiencing status. If this indicator is healthy, it means that the end users do not have any issues with connecting to exchange and using its components.
  • Service Components: This shows the state of the particular service associated with the component.

For example, when navigated to the service component indicator for mapi this will indicate whether the overall mapi service is healthy.

  • Server Resources: This shows the state of physical resources that impact the functionality of a server.
  • Key Dependencies: This shows the state of the external resources that exchange requires to function. Examples like network connectivity, DNS, Active Directory, storage.
  • **Very Important Note: **There is no separate management pack available for Exchange 2016. Exchange 2013 & 2016 uses the same management pack as of now and Microsoft recommends to use only Exchange 2013 management pack for Exchange 2016.

How to respond when Managed Availability cannot resolve a problem on its own

Exchange team has centralized Exchange monitoring inside of Exchange.

We can no longer configure monitoring thresholds in SCOM (other than turning on or off the SCOM monitor).

How we admins can troubleshoot when the issue occurs

Example if the OWA says it's unhealthy it is reported on the SCOM via an event logged on mailbox server

Check OWA component state by running the below command on the affected mailbox server

Get-ServerHealth Server1.contoso.com | ?{$_.HealthSetName -eq "OWA.Proxy"}

Also, check the OWA healthcheck htm availability and see if you are getting a 200 OK response by accessing the below URL: https://server/microsoft-server-owa/healthcheck.htm

Then we can start troubleshooting on the affected component and try to bring them up.

Also, the managed availability will generate some logs on the below location:

https://exchangequery.files.wordpress.com/2016/05/am.png

We can disable this and its not required and perform the below steps

Goto your Exchange Servers

Open <ExchangeInstallPath>:\bin\MSExchangeHMWorker.exe.config in a administrative notepad

Find the line <add key="IsTraceLoggingEnabled" value="true" /> and change to false and Save. Reboot server and you can now clear the logs in the monitoring path and they will not regenerate

**Reason not required: **If you take you time to look at the bottom of this config file it will say "Used for Exchange Online only" Microsoft have confirmed this has been set to true in error.

Note: Managed availability will never record any logs for the health probes and its value is stored in temporary memory only so we don't need to worry about the health probes.