Measuring the performance of services is tricky. There is an almost irresistible desire to measure average performance. But measuring service performance using averages is pretty much guaranteed to provide misleading results. The best way (I know of anyway) to get accurate performance results when measuring service performance is to measure percentiles, not averages. So Do Not use averages or standard deviations, Do use percentiles. See below for the details.
1 Averages for service performance are typically wrong
So let's say you have your shiny new service and you want to know how its performing. You likely set up some kind of test bench and start firing off a bunch of requests to the service and record the latency and throughput for the requests over some period of time. But how should one present a summary of the performance data?
In most cases what I see people do next is calculate average latency and average throughput. If they are particularly fancy they might even throw in standard deviations for both.
Unfortunately in most cases both the average and the standard deviation don't accurately represent the performance of the system.
The reason for this is pretty straight forward - the average and standard deviation attempt to describe the characteristics of a set of data assuming they describe a normal curve. This is the famous bell shaped curve. But service performance is almost never normal. In fact performance distribution tends to be pretty flat for most requests and then fall off a cliff. this is not what you would call a normal distribution. There are lots of reasons for this.
For example, most services have some kind of caching and typically caching is a pretty good technique but cache misses are expensive. So while most requests will be serviced quickly out of a cache some number of requests will cause a cache miss and will be substantially more expensive to handle. This behavior isn't really a curve, it's more like a step function.
Other reasons are queuing behavior. Most services can be thought of as a series of queues and so long as the load is within the queue's capacity then everything is fine but once that capacity is exceeded then time outs, failures, etc. will start to happen since the system can't recover until incoming requests fall enough to let the system catch up. So pretty much the system just shuts down until the queues are cleared.
Now there are ways to torture normal distributions to get behavior that is closer to what one sees in service behavior. We can start to talk about Kurtosis, skew, etc. But these are all attempts to force the data to be summarized by some model and if that model isn't appropriate to the data set then the information the model is giving is just plain wrong. So if one is going to start playing around with different types of distributions then this still means one has to collect the same kind of data that percentiles (discussed below) require in order to prove that the distribution is accurate. Well if one is going to do all the work to collect percentile data then why not just use the percentile data?
But the punch line is this - characterizing a service's performance across many requests using averages will almost certainly produce misleading data. So please, just say no to averages.
2 Even small numbers of customers having a bad experience costs real money
O.k. o.k. so averages are wrong. But they are really easy to calculate and as long as they 'close enough' aren't we happy? This is actually something that has been studied and the answer is - no. To help frame this discussion consider the following:
Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO 2007 - 100 ms delays caused a 1% drop in sales at Amazon.
Performance Related Changes and their User Impact 2009 1 second delays saw a loss of 1.2% of revenue/user. This went to 2.8% at 1 second and 4.3% at 2 seconds.
Why Web Performance Matters: Is Your Site Driving Customers Away? 2010 Between 2 4 seconds 8% of users abandon the site, from 2 -6 seconds it's 25% and by 10 seconds it's 38%.
With this data in mind let's go back to look at those averages. If the average is say a 50 ms delay shouldn't everyone be happy? Well not if say 20% of users are seeing 100 ms plus latencies. This wouldn't show in the average and the standard deviation is largely meaningless anyway since it's describing probability for the wrong curve.
In that case the 20% of users seeing 100 ms plus latencies, using the Amazon number, 20% * 1% = 0.2% of sales just walked out the door. That isn't a healthy way to run a business. In fact major service companies measure the experience of their users up to 99.9% (as will be explored below) because bad experiences for even small numbers of users have significant financial consequences.
Put another way, it's cheaper to create systems that have predicable performance into the 3 9s than to lose sales caused by bad performance at the end of the performance curve.
So typically the way we will accurately represent system performance is using percentiles. The idea behind percentiles is pretty straight forward - what percentages of users had a particular experience?
Imagine, for example, that we ran a test 10 times and the latencies we got back were 1, 2, 1, 4, 50, 30, 1, 3, 2 & 1 ms. The first thing we would do is order the latencies from smallest to largest - 1, 1, 1, 1, 2, 2, 3, 4, 30, 50.
The median or 50th percentile is the best latency that 50% of the requests experienced. In this case we would count 1/2 the results or 10/2 = 5 results and that is the median. In this case it's 2 ms. So this means that 50% of the requests had a latency of 2ms or better.
The 90th percentile is the best latency seen by 90% of the requests. In this case that's the 9th result (0.9 * 10 = 9) which is 30 ms.
Typically the results will show a graph from 1 percentile through 90% in increments of 1% followed by 99.9% or higher if appropriate. In most cases the results have to be shown on a logarithmic scale to be easily viewable.
Throughput is measured in a similar way. The big difference is that for throughput one is measuring the number of requests completed over some window of time. 1 second is a pretty typical window. I usually just measure how many requests completed during a particular window.