All About Load Test Results and the Results DB (Part 4 – What Are the Numbers Really Telling Me?)

[アーティクル]
06/10/2014

This question comes up all of the time. Unfortunately, there isn’t really a single correct answer. However, there are a number of things that can be pointed out that will help lead you to an acceptable answer. The most important thing is to be sure you have defined the proper goals, objectives and success criteria (see this post). I also want to point out that I am not covering any information on how to use results to find issues or troubleshoot test runs. This post focuses solely on the mathematics of results and the expectations of the people receiving the results from test runs. Let’s look at a question and some of the responses I have seen before. Some of the answers look very reasonable at first, but they may have some caveats:

INITIAL QUESTION: In my performance report there are 3 columns, say Average Response Time, 95% Response Time and 99% Response Time. Which one should I use to report to my Business?

RESPONSES: Here are some responses I have seen and some thoughts to consider when reading them

I don’t think average response time is a good idea to report since ½ your users will see something worse. [The average response time does not mean that half the values are better or worse than the average. Averages can be skewed by just one value e.g. the average of (1 , 1, 2, 2, 1, 20) = 4.5. In this example, all the values except for one are below the average. The MEDIAN or 50^th percentile represents the exact middle value of a set of data.]
We often report the 80th percentile since that represents a vast majority of users without worrying about the extreme outliers. [the 80^th percentile only excludes the “greater than” outliers. This does not necessarily exclude all “extreme” outliers since you may have outliers on the low end of the scale. It also does not mean that there are “extreme” outliers. It just means that the 80^th percentile represents a value which 80% of data is either less than or equal to.]
It’s helpful to chart the results so that you can see the distribution. [I like this answer]
In terms of what to report back, the business should be telling you what they want to see based on their needs and those of their customers. [I like this answer]
Sharing thoughts on percentile values when reporting (to see how the values are calculated, read the bottom of this post):

1. If the standard deviation is < 5% for the individual result set of transactions/pages, we could take the average response time.
2. If the standard deviation is > 5 and < 10 % for the individual result set of transactions/pages, we could take the 75^th percentile response time.
3. If the standard deviation is > 10 and < 20 % for the individual result set of transactions/pages, we could take the 90^th percentile response time.
4. If the standard deviation is > 20 % for the individual result set of transactions/pages, we could take the 95^th percentile response time. [Keep in mind that if the variation is low (small std dev), you generally want to use the average because it does not exclude any data. When you switch to using percentiles, you are excluding some chosen percent of the data. If you want to get a measure of central tendency when there is higher variation, I would generally steer toward the median or some sort of truncated mean (i.e. remove the bottom and top x percent rather than just excluding the top x percent). If you want to get a measure against a criteria (such as an SLA), the chosen percentiles might make more sense because they will be saying x percent of calls are less than or equal to y. HOWEVER, you have to be careful when doing comparison reporting to always choose the exact same percentile value AND to exclude outliers using the same formula.]

I would love to hear some other thoughts on this subject as well.

Comments

Anonymous
June 12, 2014
Thanks for writing this post, it's nice to see this type of thing discussed. Most people just use an average and get on with it, not realising that it doesn't really help them. I think the best approach I've seen is to create graphs like these ones from jHiccup (www.azulsystems.com/.../jHiccup). www.azulsystems.com/.../3gb-hotspot-hiccup.gif www.azulsystems.com/.../3gb-zing-hiccup.gif The include percentiles overall (bottom graph) and per-interval (i.e. over time in the top graph). Once you understand them they give a really nice overview of latency and most importantly the don't hide outliers as an average can. I think your suggestion of using different values depending on the standard deviation (a, b, c, or d) could get a bit confusing.
Anonymous
June 12, 2014
Thanks Matt. I like the graphs you pointed out. I am working on a tool to help create all kinds of different reports from Visual Studio, but I keep running into two issues (Hopefully I will have some things to share with everyone very soon):

There are so many different ways to report things, and so many different things to report that it can become over-whelming to decide what to use and what to not use.
I am trying to build this as a side project since it is not part of my normal job so time is extremely limited. As for the Std Dev variants, I agree that it can be confusing but I included it since it was a suggestion that came from one of the other test teams in the company. They use it that way and it seems to work for them. The biggest point to take away from any of this is to ensure that YOU report on whatever you NEED and whatever you UNDERSTAND. And even more importantly, make sure that the things you report on are RELEVANT to the desired need (which is why I wrote Part #3).

Anonymous
June 13, 2014
I've just read Part #3, I really like the example you show of how to talk through the performance goals with a customer.
Anonymous
June 17, 2014
It's been a while since I have run load tests as it was a couple jobs ago. But I remember one thing when talking with the business side and that is they cared about two things.

Are our users having a great experience?
Is the site generating the end result? (most likely revenue) You then bucketize the reports into those two categories. Response time data (login, search results, add to cart, submit order, etc.) mainly is in the fist bucket whereas number of transactions data (max concurrent, max per hour, average per hour, etc...) This is where I think Application Insights run in conjunction with the tests is extremely valuable. Your Part #3 post is spot on as it illustrates the importance of getting the business to agree upon this stuff before you even create your tests.

Anonymous
June 17, 2014
The comment has been removed
Anonymous
June 24, 2014
Thanks. Glad you are enjoying this. Unfortunately your question is one that plagues many of us all the time. I will start with the de-facto answer of "it depends!" (sorry, I hate the answer, but it is so true). I am going to be publishing an article or two very soon on extrapolation and how it has bitten me and a couple of teammates in the past. I will attempt to put a few things you can consider when extrapolating, but there are many more than I am even aware of, so the best answer is to fully understand the application, the architecture, behaviors of similar architectures that you may be able to reference, and above all else, ADD DISCLAIMERS to any results you publish.

次の方法で共有

All About Load Test Results and the Results DB (Part 4 – What Are the Numbers Really Telling Me?)

Comments

その他のリソース