Performance Blog

Posts Tagged ‘Performance

I just posted an article on how to performance load testing in production on the Box Tech blog – check it out.

The focus of that article is applying load to ensure that the system is stable and performs reasonably. This should be distinguished from Scalability Testing where the goal is to measure the scalability of the application for performance analysis as well as capacity planning.

To test scalability, it is important to increase the load methodically so that performance can be measured and plotted. For example, we can start with running 100 emulated users, then increment in terms of 100 until we reach 1000 users. This will give us 10 data points which is sufficient to give a good estimate of scalability. To compute a scalability metric, take a look at Gunther’s USL (Universal Scalability Law).

Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.

Web Application Architecture

The diagram below shows a high-level view of typical architectures of web applications.

The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.

The Front end refers to the web tier that generates the html response for the browser.

The Back end refers to the server components that are responsible for the business logic.

Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.

Front End Performance

When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:

  • Time taken to generate the base page
  • Browser parse time
  • Time to download all of the components on the page (css,js,images,etc.)
  • Browser render time of the page

For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.

Front end Performance Tools

Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.

Back End Performance

The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.

For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example,  database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.

Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.

Back End Performance Tools

Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.

Summary

Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.

 

Yet another year has gone by marked by yet another Velocity conference last week. This year the crowds were even bigger – if this conference continues to grow in this manner, it will soon have to be held in Moscone!

I gave myself the week-end to sleep over this instead of rushing to publish ASAP so that I could gather my notes and reflect on the conference.

The high order bits

Mobile

For me, the best day was the workshop day on Tuesday, specifically the mobile workshops in the afternoon.  I did not attend Maximiliano’s session last year so I am very glad I did this year. I learned a ton and hope to put it to use as I increase my focus on the mobile web. It was clear from this as well as the earlier session by Ariya that the Velocity audience has not yet started to grapple with optimizing the mobile experience.  Lots of very useful, meaty information in both these sessions, so check them out.

Statistics

It was refreshing to see the emphasis on statistics with the two sessions by John Rauser of Amazon. John is obviously a very seasoned speaker and his workshop was very well received.  It would be great to see a workshop next year that takes this a step further into a practical workshop on  how to apply statistics in analyzing performance data, including a discussion of confidence intervals.

I would be amiss if I did not also mention the Ignite session on Little’s Law. It was a great way to present this topic for those who have never heard of it, so do check it out.

Dynamic Optimization

It seems the list of companies and products entering this market is growing day by day. These products optimize your site using a variety of technologies. Last year, Strangeloop led the pack but this year there were many more. I was particularly impressed by Cotendo. The company seems to have made a rapid rise in a very short time with advanced functionality that only very large sites have. Ditto for CloudFlare – I liked the CEO’s Ignite talk as well. If you are in the market for these type of products, I definitely recommend checking them out.

The low order bits

Sponsored keynotes

The myriad sponsored talks. It is one thing to have a sponsored session track (in fact many sessions in this track were well worth attending), but another to have them be part of the Keynote sessions. Considering that keynotes took up half the day on both days, I found a big chunk of them were worthless.

The language

This conference also gets a low score for language. From when did it become okay to use foul language in conferences and especially in keynotes that were being steamed live? It seemed to start with Artur Bergman and many speakers after that seemed to think it was okay to drop the f-word every few minutes.

The number of women

If you looked around the room, there were very few women – I would estimate the female audience to be well less than 10%. I counted exactly 3 women speakers. At the Velocity summit earlier this year, the claim was that they wanted to increase the participation of women and minorities; I can’t help wonder what steps were taken to do that. With the new standards for foul language, good luck in pulling more women in.

Those who have worked with me know how much  I stress the importance of validation: validate your tools, workloads and measurements. Recently, an incident brought home yet again the importance of this tenet and I hope my narrating this story will prove useful to you as well.

For the purposes of keeping things confidential, I will use fictional names. We were tasked with running some performance measurements on a new version of a product called X. The product team had made considerable investment into speeding things up and now wanted to see what the fruits of their labor had produced. The initial measurements seemed good but since performance is always relative, they wanted to see comparisons against another version Y. So similar tests were spun up on Y and sure enough X was faster than Y. The matter would have rested there, if it weren’t for the fact that the news quickly spread and we were soon asked for more details.

Firebug to the rescue

At this point, we took a pause and wondered: Are we absolutely sure X is faster than Y? I decided to do some manual validation. That night, connecting via my local ISP from home, I used firefox to do the same operations that were being performed in the automated performance tests. I launched firebug and started looking at waterfalls in the Net panel.

As you can probably guess, what I saw was surprising. The first request returned a page that caused a ton of object retrievals. The onload time reported by firebug was only a few seconds, yet there was no page complete time!

The page seemed complete and I could interact with it. But the fact that firebug could not determine Page Complete was a little disconcerting. I repeated the exercise using HttpWatch just to be certain and it reported exactly the same thing.

Time to dig down deeper. Taking a look at the individual object requests, one in particular was using the Comet model and it never completed. On waiting a little longer, there were other requests being sent by the browser periodically. Neither of these request types however had any visual impact on the page. Since requests were continuing to be made, firebug obviously thought that the page was not complete.

Page Complete or Not?

This begged the question: how did the automated tests run and how did they determine when the page was done? There was a timeout set for each request, but if the request was terminating because of the timeout, we surely would have noticed since the response times reported would have been the timeout value. In fact, the response time being reported was less than half the timeout value.

So we started digging into the waterfalls of some of the automated test results. Lo and behold – a significant component of the response time was the HTTP Push (also known as HTTP Streaming) one.  There were also several of the sporadic requests that were being made well after the page was complete. This resulted in arbitrary response times for Y being reported.

It turned out that the automated tool was actually quite sophisticated. It doesn’t just use a standard timeout for the entire request. Instead it monitors the network and if no activity is detected for 3 seconds, it considers the request complete. So it captured some of the streaming and other post-PageComplete requests and returned when the pause between them was more than 3 seconds. That is why we thought we were seeing “valid” response times which looked reasonable and had us fooled!

Of course, this leads to the big discussion of when exactly do we consider  a http request as complete? I don’t want to get into that now as my primary purpose of this article is to point out the importance of validation in performance testing. If we had taken the time to validate the results of the initial test runs, this problem would have been detected a long time ago and lots of cycles could have been saved (not to mention the embarrassment of admitting to others that the initial results reported were wrong !)

One of the first things we performance engineers do with a new server application is to conduct a quick throughput experiment. The goal is to find the maximum throughput that the server can deliver. In many cases, it is important that the server be capable of delivering this throughput with a certain response time bound. Thus, we always qualify the throughput with an average and 90th percentile response time (i.e. we want 90% of the requests to execute within the stated time). Any decent workload should therefore measure both the throughput and response time.

Let us assume we have such a workload. How best to estimate the maximum throughput within the required response time bounds ? The easiest way to conduct such an experiment is to run a bunch of clients (emulated users, virtual users or vusers) to drive load against the target server without any think time. Here is how the flow from a vuser will look like :

Create Request ==> Send Request ==> Receive Response ==> Log statistics

This sequence of operations is executed repeatedly (without any pauses in between i.e. no think times) for a sufficient length of time to get statistically valid results. So, to find the maximum throughput, run a series of tests, each time increasing the number of clients. Simple, isn’t it ?

A little while ago, I realized that if one doesn’t have the proper training, this isn’t that simple. I came across such an experiment with the following results :

VUsers Throughput

Requests/sec

5000 88318
10000 88407
20000 88309
25000 88429
30000 88392
35000 88440

What is wrong with these results ?
Firstly, the throughput is roughly the same at all loads. This probably means that the system saturated even below the base load level of 5,000 Vusers. Recall, that the workload does not have a think time. When you have this many users repeatedly submitting requests, the server is certain to be overloaded. I must mention that the server in this case is a single system with 12 cores having hyper-threading enabled. A multi-threaded server application typically will use one or more threads to receive requests from the network, then hand the request to a worker thread for processing. Considering the context-switching, waiting, locking etc. one can assume that at most one can run 4x the number of cores or in this case about 96 server threads. Since each Vuser submits a request and waits for a response, it probably requires 2-2.5x the number of Vusers as the number of server threads to saturate a system. Using this rule of thumb, one would need to run a maximum of 200-250 Vusers.

Throughput with Correct Vusers

After I explained the above, the tests were re-run with the following results:

VUsers Throughput
1 1833
10 18226
50 74684
100 86069
200 88455
300 88375

Notice that the maximum throughput is still nearly the same as from the previous set, but it has been achieved with a much lower number of Vusers (aka clients). So does it really matter ? Doesn’t it look better to say that the server could handle 35000 connections rather than 300 ? No, it doesn’t. The reason becomes obvious if we take a look at the response times.

The Impact of Response Times

The graph below shows how the 90% Response Time varied for both sets of experiments :

90% Response Time (Minimal Vusers)

The response times for the first experiment with very large number of Vusers ranges in the hundreds of millisecs. When the number of Vusers was pared down to just reach saturation, the server responded hundred times faster ! Intuitively too, this makes sense. If the server is inundated with requests, they are just going to queue up. The longer a request waits for processing, the larger is it’s response time.

Summary

When doing throughput performance experiments, it is important to take into consideration the type of server application, the hardware characteristics etc. and run appropriate load levels. Otherwise, although you may be able to find out what the maximum throughput is, you will have no idea what the response time is.

Recently, an engineer came to me puzzled that the response times of some performance benchmark she was running were increasing. She had already looked at the usual trouble spots – resource utilizations on the app and database systems, database statistics, application server tunables, the network stack etc.  I asked her about the cpu metrics on the load driver systems (the machines which drive the load). Usually, when I ask this question, the answer is “I don’t know. Let me find out and get back to you”. But this engineer had looked at that as well. “It isn’t a problem. There is plenty of CPU left – I have 30% idle”.

Ah ah – I had spotted the problem. When we run benchmarks, we tend to squeeze every bit of performance we can out of the systems. This means running the servers as close to 100% utilized as possible. This mantra is sometimes carried over to the load driver systems as well. Unfortunately, that can result in severe performance degradation. Here’s why.

The load driver systems emulate users and generate requests to the system under test. They receive the responses and measure and record response times. A typical driver emulates hundreds to thousands of users. Each emulated user is then competing for system resources. Now suppose an emulated user has issued a read request to read the response from the server. It is very likely that this thread will be context switched out by the operating system as there are so many additional users it needs to serve. Depending on the number of CPUS on the system and the load, the original emulated user thread may get to execute with a considerable delay and consequently record a much larger response time. My rule of thumb is never to run the load generator systems more than 50% busy if the application is latency sensitive. In this particular case, the system was already 70% utilized.

Sure enough – when a new load driver system was added and the performance tests re-run, all the response time criteria passed and the engineer could continue scaling the benchmark.

Moral of the story – don’t forget to monitor your load driver systems and don’t be complacent if their utilization starts climbing above 50%.

Many web applications are now moving to the cloud where configurations are difficult to understand; see for example, Amazon Web Services (AWS) definition of EC2 Compute Unit.  How does one determine how many instances of what type are required to run an application ? Typical capacity planning exercises start by doing measurements. So for example, one might test the targeted app by deploying on say a ec2 m1.small type (1 EC2 Compute Unit) and see how many users it can support. Based on performance metrics gathered during the test, one can estimate how many instances will be required (assuming of course that the application scales horizontally).

The Tests

To test this simplistic model, I fired up an ec2 m1.small instance running Ubuntu and started the apache web server. I used another instance as the load driver to repeatedly fetch a single helloworld php page  from the web server and scaled up the number of users from 1 to 50 in increments of 10. The test was written and driven by Faban, a versatile open-source performance testing tool.

The Measurements

On Unix systems, tools like vmstat and mpstat can be used to measure cpu utilization (amongst other things).  Faban automatically runs these tools on any test for the same duration as the test run allowing one to monitor the resource utilization during the test. (Note that on Ubuntu, mpstat is not installed by default but is available as part of the sysstat package).

The Results

Here is the throughput graph as the number of virtual users was scaled.

PHP ThroughputThe throughput peaks at 20 users and then flattens out (actually falls a little bit). Looking at this graph, one would expect that the cpu saturated around 20 users (assuming no other bottleneck which is a reasonable assumption for this extremely simple application).

Here is a snippet of the vmstat output captured during the run at 50 users :

procs ———–memory———- —swap– —–io—- -system– —-cpu—-
r  b        swpd        free   buff     cache   si   so    bi     bo      in      cs us sy id wa
2  0              0 661096  18596 961772    0    0     6   104 1108  826  3  2 87  0
1  0              0 658116  18616 964028    0    0     0   237 2788 1966 26  9  9  0
3  0             0 655856  18632 966296    0    0     0   244 2699 1959 24 12  6  0
2  0             0 653376  18652 968700    0    0     0   236 2943 2069 24 11  9  0
1  0             0 651020  18668 970972    0    0     0   240 2842 1963 25 10  6  0
2  0            0 648680  18688 973224    0    0     0   241 2763 1954 24 11  9  0

The user time (column under us) averages 24.6% and the system time (column under sy) averages 10.6% for a total time of 35.2%. But the idle time (column under id) is only around 8% – no wonder the throughput stopped increasing. But what is the discrepancy here ? If the user and system time are only 35%, where is the remaining time going ?

To understand that, take a look at the mpstat output snippet below :

12:57:43 AM  CPU   %user   %nice  %sys %iowait    %irq   %soft  %steal   %idle   intr/s
12:57:53 AM    all      25.73    0.00    8.71      0.00    0.00    0.30      55.96    9.31   2763.26
12:58:03 AM   all      24.21    0.00   11.31      0.00    0.00    0.40     58.93    5.16   2661.41
12:58:13 AM    all     24.09    0.00   10.07    0.00    0.00    0.89     55.58    9.38   2904.84
12:58:23 AM   all     25.15    0.00    9.34      0.00    0.00    0.89     59.05    5.57   2824.95
12:58:33 AM   all     23.78    0.00    9.99     0.00    0.00    1.20      56.14    8.89   2760.54
12:58:43 AM   all     22.26    0.00   11.58     0.00    0.00    0.40    54.79   10.98   2835.83

We can see that the %user ,%sys  and %idle column values match those shown by vmstat. But we see an additional utilization column – %steal which ranges from 55% to 59%. If you add this value to the user, sys and idle, we get 100%. So that’s where the missing time has gone – to %steal.

Who is stealing my CPU ?

But what exactly is %steal ? It is the time when your application had something ready to run but the CPU was busy servicing some other instance. Clearly, this is not the only application running on this CPU. The m1.small instance is defined as providing “1 EC2 Compute Unit“, not 1 CPU.

In this case, the 1 instance was worth about 35% of the single CPU that was on this system (an Intel Xeon E5430 @2.66GHz).

When looking at cpu utilization on EC2 (or any virtualized environments based on Xen), keep this in mind. Always consider %steal in addition to the user and system time.


Pages

Latest Tweets

Categories

Archives