Key Health Indicators in Lync Server 2013

 

Topic Last Modified: 2014-02-10

This article is a companion to the Key Health Indicators: The Foundation for Maintaining Healthy Lync Servers poster, which you can download from the Download Center.

Poster describing troubleshooting using KHI data

You can use this poster to learn about Key Health Indicators (KHIs), performance counters with thresholds aimed at revealing user experience issues. Gathering KHI data is usually the first step to implementing the Call Quality Methodology (CQM), which is focused on ensuring a quality audio experience for Lync users.

If you have questions about how to use CQM, you can submit your questions to cqmfeedback@microsoft.com.

The poster explains the following areas:

  • What are Key Health Indicators?

  • To Collect KHI Data

  • Remediation Flow for all Server Roles

  • Glossary

  • Front-end Servers

  • Backend SQL Servers

  • Mediation Servers

  • Edge Servers

What are Key Health Indicators?

Key Health Indicators are performance counters with thresholds aimed at revealing user experience issues. Gathering KHI data is usually the first step to implementing the Call Quality Methodology (CQM), which is focused on ensuring a quality audio experience for Lync users.

KHIs are used in addition to standard Lync Monitoring Solutions (e.g. System Center Operations Manager, Synthetic Transactions, Monitoring Server) and not instead of those solutions.

Collect the KHI performance counters and populate the KHI spreadsheet accompanying the Networking Guide to produce a scorecard that will help you determine the server health of a Lync deployment. Once populated, it guides you in repairing the environment and gives additional insight to other stakeholders. Evaluate KHIs on a monthly basis and incorporate them into any deployment’s ongoing operational processes.

Download the Lync Server Networking Guide to see the full list of KHIs and to get the related spreadsheets.

To Collect KHI Data

  1. Run the KHI script included with the Lync Server Networking Guide on each Lync Server. This will create a Data Collector inside of Performance Monitor and name it KHI. By default, data will be polled every 15 seconds.

  2. Before the start of your company's business day, go to each Lync Server and start the KHI Data Collector.

  3. At the end of that day, stop the KHI Data Collector and copy the data to a central location.

  4. After using Performance Monitor to fill in the KHI spreadsheet included with the Lync Server Networking Guide download, compare the results to the recommended targets.

Remediation Flow for all Server Roles

For each server in your Lync implementation, begin by verifying that the server’s component health and system performance is at or above the desired level. Only after that should you look at the indicators relating to the server’s role in the overall Lync implementation.

Begin by collecting KHI 
Performance Data for all servers. For each of the system roles (details discussed later in this document) determine whether the basic system components meet the recommended targets. If they do not, remediate the system performance then re-collect KHI data and ensure system health before looking at the metrics specific to the server’s role in the Lync implementation. Component health for all roles is defined as:

  • CPU Utilization < 80%

  • Avg. Disk Write < 10 ms

  • Avg. Disk Read < 10 ms

  • Available memory 
>20% System Total MB

  • Network Queue Length < 2

  • Discarded Packets (in / out) = 0

Glossary

The following terms and acronyms are used in this poster:

AS MCU = Application Sharing Multi-point Control Unit

AV MCU = Audio/Video MCU

IM MCU = Instant Messaging MCU

UCWA = Unified Communications Web API

AV Edge = Traversal of audio/video via edge

AV Auth = Audio/Video Authentication

SIP Stack = Contains Lync’s core SIP implementation

Data Proxy = Used for edge conferencing

LySS = Lync Storage Service

Front-end Servers

The following recommended KHI targets are specific to front-end servers in addition to basic component health:

Functional area Target Metrics

AS/AV/IM MCU

MCU Health State <2

Web Components

Distribution List expansion AD timeouts <0

ABWQ failures = 0

LIS failures = 0

Authentication Errors < 1/sec

ASP.NET v4 Requests Rejected = 0

SIP Stack

Avg. Incoming Message Processing < 1 sec

Incoming Responses Dropped < 1/sec
Incoming Requests Dropped < 1/sec

Queue Latency < 100 ms

Sproc Latency < 100 ms

Throttled Requests = 0

Authentication Errors < 1/sec

Incoming Messages Timed Out < 2

Avg. Incoming Message Hold < 1 sec

Flow Controlled Connections < 2

Avg. Out Queue Delay < 2 sec

LySS

% of space used by Storage Service DB < 80

# of replica replication failures = 0

# of data loss events = 0

SQL

Page life expectancy > 300 Sec.

Batch requests / sec < 2500

Backend SQL Servers

The following recommended KHI targets are specific to SQL servers in addition to basic component health:

Functional area Target Metrics

SQL

Page life expectancy > 300 Sec.

Batch requests / sec < 2500

Mediation Servers

The following recommended KHI targets are specific to mediation servers in addition to basic component health:

Functional area Target Metrics

Mediation Server Service

Load Call Failure Index = 0

Failed Calls due to Proxy <10

Failed Calls due to Gateway <10

Calls (in or out) rejected = 0

Media Candidates missing = 0

Media Connectivity Check Failures = 0

Edge Servers

The following recommended KHI targets are specific to edge servers in addition to basic component health:

Functional area Target Metrics

AV Auth

Bad Requests < 20/sec

AV Edge

Auth. Failures <20/sec

Allocation Failures <20/sec

Packets Dropped <300/sec

Data Proxy

Throttled Server connections < 3

System is Throttling <1

SIP Stack

Connections over limit dropped < 1

Sends timed out <10

Flow Controlled Connections <100

Incoming requests dropped < 1/sec

Avg. Message Processing < 3 sec