Archive for the ‘Performance’ Category
I just posted an article on how to performance load testing in production on the Box Tech blog – check it out.
The focus of that article is applying load to ensure that the system is stable and performs reasonably. This should be distinguished from Scalability Testing where the goal is to measure the scalability of the application for performance analysis as well as capacity planning.
To test scalability, it is important to increase the load methodically so that performance can be measured and plotted. For example, we can start with running 100 emulated users, then increment in terms of 100 until we reach 1000 users. This will give us 10 data points which is sufficient to give a good estimate of scalability. To compute a scalability metric, take a look at Gunther’s USL (Universal Scalability Law).
I’ve been working for awhile now to revamp the entire stats processing and graphing design in Faban. For those who haven’t heard of Faban, it is a performance tool that helps creation and running of workloads. Faban currently uses a technology called Fenxi to process and graph stats. Fenxi has given us lots of problems over time – it is poorly designed, lacks flexibility and doesn’t even seem to be maintained anymore. So I decided to get rid of it entirely.
I am really excited by the changes. I think this is one of the major enhancements to Faban since Akara (the original Faban architect) and I left Sun/Oracle.
So without much further adieu, here are the upcoming changes:
- New cooler looking dynamic graphs
I’m using jqPlot that produces nice looking graphs by default. The ability to zoom in on a section of the graph is really very very nice, in addition to actually seeing the datapoint values as you mouse over the graph. This one feature I think will make the Faban UI more modern.
- Provide support for graphing stats from Linux.
- New Viewer to graph all xan files
Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.
Web Application Architecture
The diagram below shows a high-level view of typical architectures of web applications.
The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.
The Front end refers to the web tier that generates the html response for the browser.
The Back end refers to the server components that are responsible for the business logic.
Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.
Front End Performance
When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:
- Time taken to generate the base page
- Browser parse time
- Time to download all of the components on the page (css,js,images,etc.)
- Browser render time of the page
For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.
Front end Performance Tools
Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.
Back End Performance
The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.
For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example, database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.
Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.
Back End Performance Tools
Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.
Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.
Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.
We often hear phrases like the following :
- “The response time was 5 seconds”
- “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
- “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”
Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.
Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?
You get the idea. For some reason, people tend to talk loosely about response times. Without going into details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.
For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?
Mean Response Time
This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.
- Figure 1: Steady Response Times
- Figure 2: Varying Response Times
But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).
Median Response Time
If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time. Surely, a response time specification that leaves out half the population cannot be a good measure.
90th or 95th percentile Response Time
In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.
A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.
- Figure 3: Response Time Histogram
It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.
To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.
We often hear the terms Load Testing or Performance Testing, but no one talks much about Scalability Testing. Before I go further, let me define these terms so you know what I am talking about :
- Load Testing refers to the kind of testing usually done by QA organizations to ensure that the application can handle a certain load level. Criteria are set to ensure that releases of a product meet certain conditions like the number of users they can support while delivering a certain response time.
- Performance Testing on the other hand, refers to testing done to analyze and improve the performance of an application. The focus here is on optimization of resource consumption by analyzing data collected during testing. Performance Testing to a certain extent should be done by developers but more elaborate, large scale testing may be conducted by a separate performance team. In some organizations, the performance team is a part of the QA function.
- Scalability Testing refers to performance testing that is focused on understanding how an application scales as it is deployed on larger systems and/or more systems or as more load is applied to it. The goal is to understand at what point the application stops scaling and identify the reasons for this. As such scalability testing can be viewed as a kind of performance testing.
In this article, we will consider how scalability testing should be done to ensure that the results are meaningful.
The first requirement for any performance testing is a well-designed workload. See my Workload Design paper for details on how to properly design a workload. Many developers and QA engineers typically craft a workload quickly by focusing on a couple of different operations (e.g. if testing a web application, a recording tool is used to create one or two scenarios). I will point out the pitfalls of this method in another post. So take care while creating your workload. Extra time invested in this step will more than pay off in the long run. Remember, your test results are only as good as the tests you create!
Designing Scalability Tests
Scalability tests should be planned and executed in a systematic manner to ensure that all relevant information is collected. The parameter by which load is increased obviously depends on the type of app – for web apps, this would typically be the number of simultaneous users making requests of the site. Think about what other parameters might change for your application. If the application accesses a database, will the size of the db change in some relation to the number of users accessing it? If it uses a caching tier, might it be reasonable to expect that the size of this cache will expand ? Consider the data accessed by your workload – how is this likely to change? Both the data generator and load generator drivers need to be implemented in a way that supports workload and data scaling.
Collecting Performance Data
When running the tests, ensure you can collect sufficient performance metrics so as to be able to understand what exactly is happening on the application infrastructure. One set of metrics is from the system infrastructure – cpu, memory, swap, network and disk i/o data. Another is from the software infrastructure – web,application, caching (memcached) and database servers all provide access to performance data. Don’t forget to collect data on the load driver systems as well. I have seen many a situation in which the driver ran out of memory or swap and it took awhile to figure this out because no one was looking at the driver stats ! All performance metrics should be collected for the same duration as the test run.
Running Scalability Tests
With planning done, it is time to run the performance tests. You want to start at a comfortable scale factor – say 100 users and increment by the same factor every time (e.g. 100 users at a time). Some tools let you run a single test while varying the load – although this may be acceptable for load testing, I would discourage such short-cuts for scalability testing. The goal is not just to get to the maximum load but to understand how the system behaves at every step. Without the detailed performance data, it is difficult to do scalability analysis. Do scaling runs to a point a little beyond when the system stops scaling (i.e throughput stays flat or worse starts to fall) or you run out of system resources.
When load testing an application, the first set of tests should focus on measuring the maximum throughput. This is especially true of multi-user, interactive applications like web applications. The maximum throughput is best measured by running a few emulated users with zero think time. This means that each emulated user sends a request, receives a response and immediately loops back to send the next request. Although this is artificial, it is the best way to quickly determine the maximum performance of the server infrastructure.
Once you have that throughput (say X), we can use Little’s Law to estimate the number of real simultaneous users that the application can support. In simple terms, Little’s Law states that :
N = X / λ
where N is the number of concurrent users, λ is the average arrival rate and X is the throughput. Note that the arrival rate is the inverse of the inter-arrival time i.e. the time between requests.
To understand this better, let’s take a concrete example from some tests I ran on a basic PHP script deployed in an apache server. The maximum throughput obtained was 2011.763 requests/sec with an average response time of 6.737 ms, an average think time of 0.003 secs when running 20 users. The arrival rate is the inverse of the inter-arrival time which is the sum of the response time and think time. In this case, X is .2011.763 and λ is 1/(0.006737 + 0.003). Therefore,
N = X / λ = 2011.763 * 0.009737 = 19.5885
This is pretty close to the actual number of emulated users which is 20.
Estimating Concurrent Users
This is all well and good, but how does this help us in estimating the number of real concurrent users (with non-zero think time) that the system can support ? Using the same example as above, let us assume that if this were a real application, the average inter-arrival time is 5 seconds. Using Little’s Law, we can now compute N as :
N = X /λ = 2011.763 * 5 = 10058 users.
In other words, this application running on this same infrastructure can support more than 10,000 concurrent users with an inter-arrival time of 5 seconds.
What does this say for think times ? If we assume that the application (and infrastructure) will continue to perform in the same manner as the number of connected users increase (i.e it maintains the average response time of 0.006737 seconds), the the average think time is 4.993 seconds. If the response time degrades as load goes up (which is usually the case after a certain point), then the number of users supported will also correspondingly decrease.
A well-designed application can scale linearly to support 10’s or 100’s of thousands of users. In the case of large websites like Facebook , Ebay and Flickr, the applications scale to handle millions of users. But obviously, these companies have invested tremendously to ensure that their applications and software infrastructure can scale.
Little’s Law can be used to estimate the maximum number of concurrent users that your application can support. As such, it is a handy tool to get a quick, rough idea. For example, if Little’s Law indicates that the application can only support 10,000 users but your target is really 20,000 users, you know you have work to do to improve basic performance.
Yet another year has gone by marked by yet another Velocity conference last week. This year the crowds were even bigger – if this conference continues to grow in this manner, it will soon have to be held in Moscone!
I gave myself the week-end to sleep over this instead of rushing to publish ASAP so that I could gather my notes and reflect on the conference.
The high order bits
For me, the best day was the workshop day on Tuesday, specifically the mobile workshops in the afternoon. I did not attend Maximiliano’s session last year so I am very glad I did this year. I learned a ton and hope to put it to use as I increase my focus on the mobile web. It was clear from this as well as the earlier session by Ariya that the Velocity audience has not yet started to grapple with optimizing the mobile experience. Lots of very useful, meaty information in both these sessions, so check them out.
It was refreshing to see the emphasis on statistics with the two sessions by John Rauser of Amazon. John is obviously a very seasoned speaker and his workshop was very well received. It would be great to see a workshop next year that takes this a step further into a practical workshop on how to apply statistics in analyzing performance data, including a discussion of confidence intervals.
I would be amiss if I did not also mention the Ignite session on Little’s Law. It was a great way to present this topic for those who have never heard of it, so do check it out.
It seems the list of companies and products entering this market is growing day by day. These products optimize your site using a variety of technologies. Last year, Strangeloop led the pack but this year there were many more. I was particularly impressed by Cotendo. The company seems to have made a rapid rise in a very short time with advanced functionality that only very large sites have. Ditto for CloudFlare – I liked the CEO’s Ignite talk as well. If you are in the market for these type of products, I definitely recommend checking them out.
The low order bits
The myriad sponsored talks. It is one thing to have a sponsored session track (in fact many sessions in this track were well worth attending), but another to have them be part of the Keynote sessions. Considering that keynotes took up half the day on both days, I found a big chunk of them were worthless.
This conference also gets a low score for language. From when did it become okay to use foul language in conferences and especially in keynotes that were being steamed live? It seemed to start with Artur Bergman and many speakers after that seemed to think it was okay to drop the f-word every few minutes.
The number of women
If you looked around the room, there were very few women – I would estimate the female audience to be well less than 10%. I counted exactly 3 women speakers. At the Velocity summit earlier this year, the claim was that they wanted to increase the participation of women and minorities; I can’t help wonder what steps were taken to do that. With the new standards for foul language, good luck in pulling more women in.