Performance Blog

Response time metric for SLA

Posted on: April 9, 2012

Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.

We often hear phrases like the following :

  • “The response time was 5 seconds”
  • “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
  • “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”

Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.

Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?

You get the idea. For some reason, people tend to talk loosely about response times. Without going into  details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.

For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?

Mean Response Time

This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.

Steady Response Times
Figure 1: Steady Response Times
Figure 2: Varying Response Times
Figure 2: Varying Response Times

But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).

Median Response Time

If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time.  Surely, a response time specification that leaves out half the population cannot be a good measure.

90th or 95th percentile Response Time

In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.

A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.

Figure 3: Response Time Histogram
Figure 3: Response Time Histogram

It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.

To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.

About these ads

14 Responses to "Response time metric for SLA"

Hi, I like your article. Is there a reference/citation on “Therefore, it is common to see 95th percentile used for SLAs in web applications.”?

Not really. The larger web companies like to use 95% but it is very hard to achieve a reasonable time with this metric for smaller companies who do not have a global footprint for serving. I would recommend at least a 90% criterion.

Hi nice info. Thanks. I am new to performance test. Hope you can give some advice. If I have SLA saying “The system shall complete 90% of all user inquiry transactions within five (5) seconds”; is 90% percentile mean for 1 user? Or can it be for many concurrent users?

It is measured across all users – not on a per user basis

Hello,
I am new to performance testing and getting confused between, response time and avg page time. Which is the correct metric to report?

It doesn’t matter what you’re measuring – the point I was trying to make is that using average as a metric is not useful. A percentile gives a more accurate measurement of response time.

Hi, where can I get information about how to meassure response time? it is response time directly involve with application sizing? I mean, can I meassure response time withour meassuring application size?

Response time is typically measured on the client side i.e. the piece of code that is making the request. Take a timestamp before the request, and one after the response is got and the difference is the response time. I am not sure what you mean by ‘application sizing’. Please elaborate.

Thanks for your prompt response. By application sizing I mean the size of the application, for example if we need to measure it with function points, lines of code or any other method. Thanks.

I’m still not clear what you mean. Why do we care about lines of code for performance? If you’re talking about understanding which functions are being called, the best way to understand that is to run a profiler on the app under load.

Nice Article. I would like to know when we do performance testing say there are two aspects , one is measuring the response time from request to database and the second part is measuring the time taken for the web browser to render the page. How to differentiate and measure these two.

This is always hard. Without commercial tools, the only way to do this is if you measure the time taken for a db request from within your app and then log that. You will then need to post-process the log to aggregate the db response times.

For End 2 End Response time measurement, does this consider the bowser time (Browser time+Network+Server+Databse) or (Network+Server+Databse)

That really depends on how you are measuring the response time (which is really not the subject of this article). Unless you are using real browsers, it is usually the latter.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Pages

Latest Tweets

  • 15-20 min talks doesn't let speakers cover much - disappointed in IOT day thus far. @dataweeksf 1 day ago
  • Researchers Advance Artificial Intelligence for Player Goal Prediction in Gaming lnkd.in/bKzdHsd 1 week ago
  • RT @bgracely: And one more thing…. We've set up a special section of the Genius Bar for those of you that assume the Apple Watch is waterp… 1 week ago
  • Will iPay put Square out of business? 1 week ago

Categories

Archives

Follow

Get every new post delivered to your Inbox.

Join 229 other followers

%d bloggers like this: