Response time metric for SLA
Posted April 9, 2012on:
Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.
We often hear phrases like the following :
- “The response time was 5 seconds”
- “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
- “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”
Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.
Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?
You get the idea. For some reason, people tend to talk loosely about response times. Without going into details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.
For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?
Mean Response Time
This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.
- Figure 1: Steady Response Times
- Figure 2: Varying Response Times
But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).
Median Response Time
If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time. Surely, a response time specification that leaves out half the population cannot be a good measure.
90th or 95th percentile Response Time
In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.
A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.
- Figure 3: Response Time Histogram
It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.
To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.