Exchange Performance Basics

Since we occasionally troubleshoot performance problems, it would be a good idea to write up some basics. The following information will help rule out server latencies and help determine whether a less than optimal messaging user experience comes from a server-side issue.

Due to the nature of the topic, this isn’t going to be a comprehensive coverage of all possible performance problems and counters that you may encounter. For now, we just want to focus on the main processes and a quick list of the things to look for.

The first part of performance analysis is to understand what the issues are. It is also important to know when they are on the server, network or client.

It is ideal to begin by understanding the following:

·         What are the Outlook connection points that are involved? 

o   Are the clients making a direct connection to a mailbox server as in Exchange 2007?

o   Are they connecting directly to a CAS server as in Exchange 2010/2013? 

·         Understand if the issue occurs only in Outlook.  Does it occur when in Online Mode or Cached Mode?  Does the issue occur when using OWA?

·         Understand the scope of the problem.  Does the performance issue occur for only one client or does it occur for multiple clients?

o   Are the users impacted located in the same location?

§  Are they on the same subnet?

o   Are the users impacted located on the same Exchange server or Database?

o   Have the users been granted delegated or Full Access to another mailbox?

§  Where is this additional primary mailbox located?

o   Does the problem happen at a specific time or is it constant?

·         Is mail queueing on the server? 

·         Are transaction logs not replicating and the copy queue length or replay queue backing up?

·         Is the round trip time high when you ping the server from a client?

Once we have discerned the above information we can begin to understand at what time and from where we need to collect performance data.  This could be from the client, from the Mailbox Server, from the Client Access Server or from a Hub Transport Server.  In this session, we will primarily be covering server-side performance.

 

Application and System Logs

The first place to look is the Application Log and then the System Log on the servers for errors. Poor server performance issues are sometimes surfaced in Event Viewer by warnings or errors regarding recurring events such as low virtual memory or disk issues.

 

** **

Performance Categories 

Performance problems can fall into two categories: 

1.      Increased server load

2.      Resource bottleneck

** **

** **

Resource Bottlenecks 

So what kind of resource bottlenecks can occur?  Resource bottlenecks may occur on any of the following resources:

  • Check for Disk bottlenecks
  • Check for Processor bottlenecks
  • Check for Memory bottlenecks
  • Check for RPC bottlenecks
  • Check for Network bottlenecks
  • Check for LDAP bottlenecks

 

What is Performance Data

There are multiple tools available that can be used to collect performance data.  The most basic tool used is Performance Monitor.  Performance Monitor has existed since Windows NT and has pretty much remained the same.  While we will not be going into detail on how to use Performance Monitor, it is a good idea to get familiar with the basics as this a segue into understanding how other tools collect data. 

Windows Performance Monitor uses performance counters, event trace data, and configuration information, which can be combined into Data Collector Sets.

Performance counters are measurements of system state or activity.  Performance Monitor requests the current value of performance counters at specified time intervals.

If users are able to connect to the Exchange server, but they encounter huge latencies, then performance analysis with will help tell you where the issue may be.

Additional information on how to use Perfmon and how to create data collector sets can be found at the following TechNet URL: https://technet.microsoft.com/en-us/library/cc749249.aspx

Additional tools that we use such as ExPerfWiz use Perfmon in the background to create the collector sets and collect the performance data.  ExPerfWiz is a PowerShell based script to help automate the collection of performance data on Exchange 2007, 2010 and 2013 servers.

** **

Performance Counters

The next step is to look at the performance counters and see if there are any latencies.  Remember the following basic process to analyze an Exchange performance concern (once it’s determined that the latency is server-side and not client-side or network related):

1.       Check the RPC Requests and RPC Latency counters

2.       Determine what resource is causing the bottleneck using the performance counters in the following sections.

We encourage you to initially focus on these specific counters to effectively identify potential performance issues.

Before we proceed, it is important to note that there may be slight differentiations in the thresholds for the performance counters from one version of Exchange to another.  More detail on the threshold values can be found at the following locations:

Exchange 2007:  https://technet.microsoft.com/en-us/library/bb201720(EXCHG.80).aspx

Exchange 2010:  https://technet.microsoft.com/en-us/library/dd335215(v=exchg.141).aspx

Exchange 2013:  https://technet.microsoft.com/en-us/library/dn904093(v=exchg.150).aspx

** **

Understanding RPC Performance

RPC Latency is made up of two parts

·         Server side RPC processing

·         Round-trip-time Network Latency

Network latency is probably the easiest to examine on the surface since we really just need to use ping.exe to find out what our TCP round-trip-time (RTT) value is to the target server.   You can also use Outlook’s connection status to see the average response and processing time.

From the Outlook Context menu select “Connection Status”

In the Connection Status dialog box, find the columns called Avg Resp and Avg Proc. The difference between these two values represents the network latency for each connection.

Use the following to maintain a good client experience in Cached mode.

  • Max Avg Proc Time (Exchange RPC Latency) = 25ms
  • Max Network RTT Time (Network Ping Time) = 300ms
  • Max Avg Resp Time (Exchange RPC Latency + Network Latency) = 325ms

From the server side of things, the following tables display acceptable thresholds and information about basic RPC Client Access counters.

 

·         MSExchange RpcClientAccess\RPC Averaged Latency: Should be below 250ms

·         MSExchange RpcClientAccess\RPC Requests: Should not be above 40

·         MSExchangeIS\Client: RPCs Failed/sec: Should be 0 at all times

 

Understanding Processor Performance

Processor usage is the same for all Exchange server roles and should maintain a load of about 60 percent during peak working hours. This percentage level allows room for periods of extreme load. If the processor usage is consistently greater than 75 percent, processor performance is considered a bottleneck.  For example, CPU spikes in the w3wp and store worker processes can prevent active sync devices from syncing and random connectivity issues with other OWA and MAPI clients.

 

If the servers are virtualized, it is very important to make sure that customers are not oversubscribed on CPUs.

 

The below information displays acceptable thresholds and information about the basic processor and process counters.

 

·         Processor(_Total)\Processor Time: Should be less than 75% on average

Shows the percentage of time that the processor is executing the application or operating system processes. This is when the processor isn't idle.

 

·         Processor(_Total)\User Time: Should remain below 75%

Shows the percentage of processor time spent in user mode. User mode is a restricted processing mode designed for applications, environment subsystems, and integral subsystems.

 

·         Processor(_Total)\Privileged Time: Should remain below 75%

Shows the percentage of processor time spent in privileged mode. Privileged mode is a processing mode designed for operating system components and hardware-manipulating drivers. It allows direct access to hardware and all memory.

 

·         System\Processor Queue Length: Should not be greater than 5 per processor

Indicates the number of threads each processor is servicing. Processor Queue Length can be used to identify if processor contention or high CPU utilization is caused by the processor capacity being insufficient to handle the workloads assigned to it. Processor Queue Length shows the number of threads that are delayed in the Processor Ready Queue and are waiting to be scheduled for execution. The value listed is the last observed value at the time the measurement was taken.

 

·         If total processor time is high, use the Process(*)\Processor Time counter to determine which process is causing high CPU.

** **

Understanding Memory Performance

Some server roles utilize memory more than other roles.  For example, a Mailbox Server will utilize more memory than a Client Access server.  With the introduction of the 64bit architecture, memory bottlenecks have become less common but may still occur.

If the servers are virtualized, it is very important to make sure that customers are not oversubscribed on memory.

 

The below information displays acceptable thresholds and information about basic memory counters.

·         Memory\Available Bytes: Should remain above 100MB for Exchange 2007 and 2010.  Should remain above 5% of total RAM for Exchange 2013

Shows the amount of physical memory, in megabytes (MB), immediately available for allocation to a process or for system use.

 

·         Memory\ Committed Bytes in Use: If this value is high (more than 80% or 90%) you may begin to see commit failures. 

** **

This is a clear indication that the system is under memory pressure.

Shows the ratio of Memory\Committed Bytes to the Memory\Commit Limit. Committed memory is the physical memory in use for which space has been reserved in the paging file should it need to be written to disk.

 

·         Memory\Pool Page Bytes: Monitor for increases in pool paged bytes indicating a possible memory leak.

 

 

Understanding Disk Performance

A key component of understanding how to measure the performance of an Exchange disk subsystem – or any disk subsystem – is to understand the difference between the LogicalDisk and PhysicalDisk performance objects. This is even more important than normal since these object may, or may not, measure the same things.

  • LogicalDisk – A logical disk is the unit of a disk subsystem with which Windows and users utilize a disk. When you open “Computer” (or “My Computer” for Windows 2003 and older versions of Windows) the hard disk drives shown there are logical disks.
  • PhysicalDisk – A physical disk is the unit of a disk subsystem which the hardware presents to Windows.
  • A logical disk may consist of multiple physical disks (think of RAID)
  • A physical disk may host multiple logical disks (think Windows partitions)

If you put all of these together, this means that in the case where a physical disk contains only a single Windows volume, LogicalDisk and PhysicalDisk measure the same thing.

Somewhat confusingly, disk aggregators (this includes RAID controllers, Storage Area Networks, Network Attached Storage, iSCSI, etc.) may present many physical disks as a single logical device to Windows. However, each of these devices (known as a logical unit number or LUN) may again actually represent multiple logical or physical disks. Thankfully, from a Windows performance perspective, those distinctions can be ignored, at least until a specific LUN is identified as having a performance issue. In that case, in order to acquire more specific data, you will have to use performance tools from the aggregator’s provider as the disk aggregators can sometimes provide unpredictable results.

The conclusion is that in the most common cases, the performance of a LogicalDisk object is what you are most interested.

Note: Lots of SAN and NAS software provides a feature called “LUN stacking” or “disk stacking” which allows multiple LUNS to exist on a single physical disk. This just complicates your life. Avoid it. J Just always remember that you have to be able to identify what you are measuring and the boundaries on that measurement. If you have multiple applications accessing a single physical disk, then your performance will always be non-deterministic and difficult to predict.

The below information displays acceptable thresholds and information about the most basic disk counters to review first.

 

·         LogicalDisk(*)\Avg. Disk sec/Read: Should be below 20ms on average for both database and transaction log disks.

 

·         LogicalDisk(*)\Avg. Disk sec/Write: Should be below 50 on average for database disks and below 10ms for transaction log disks

** **

Understanding LDAP Performance

A key component of Exchange are LDAP queries to domain controllers and global catalog servers.  Almost every aspect of Exchange and Outlook utilize LDAP queries to domain controllers.  LDAP query response time should be in an acceptable range.  If not, this may cause both a poor client experience: For example, Outlook may appear to hang, or continuously disconnect/reconnect.

The below information displays acceptable thresholds and information about the most LDAP related counters to review first.

 

·         MSExchange ADAccess Domain Controllers(*)\LDAP Read Time: Should be below 50ms on average where spikes should never be above 100ms.

Shows the time in milliseconds (ms) to send an LDAP read request to the specified domain controller and receive a response.

 

·         MSExchange ADAccess Domain Controllers(*)\LDAP Search Time: Should be below 50ms on average where spikes should never be above 100ms.

Shows the time (in ms) to send an LDAP search request and receive a response.

 

·         MSExchange ADAccess Process(*)\LDAP Read Time: Should be below 50ms on average where spikes should never be above 100ms.

Shows the time in milliseconds (ms) to send an LDAP read request to the specified domain controller and receive a response.

 

·         MSExchange ADAccess Process(*)\LDAP Search Time: Should be below 50ms on average where spikes should never be above 100ms.

Shows the time (in ms) to send an LDAP search request and receive a response.