I recently checked in a feature that allows fairly extensive comparisons of different runs in Faban. Although the ‘Compare’ button has been part of the Results list view for awhile, it has been broken for a long time. It finally works!
When to use Compare
The purpose of this feature is to compare runs performed at the same load level (aka Scale in faban) and on the same benchmark rig. Perhaps you are tuning certain configs and/or code and are doing runs to analyze the performance differences between these changes. The Compare feature lets you look at multiple runs at the same time on multiple dimensions: throughput, average and 90% response times, average CPU utilization, etc. This gives a single page view that can quickly point out where one run differs from another.
How to use Compare
This is easy. On the results view in the dashboard, simply select the runs you want to compare using the check box at the left of each row. Then click the Compare button at the top of the screen.
The screen-shot below shows this operation:
The first part of the comparison report looks like the image below. The report combines tables with graphs to make the data relevant. For example, Run Information is a summary table that describes the runs, where as throughput is a graph that shows how the overall throughput varied during the length of the test for all runs.
How can I get this code?
The code is currently in the main branch of the faban code on github. Fork it and try it out. Once I get some feedback, I will fix any issues and cut a new binary release.
I just posted an article on how to performance load testing in production on the Box Tech blog – check it out.
The focus of that article is applying load to ensure that the system is stable and performs reasonably. This should be distinguished from Scalability Testing where the goal is to measure the scalability of the application for performance analysis as well as capacity planning.
To test scalability, it is important to increase the load methodically so that performance can be measured and plotted. For example, we can start with running 100 emulated users, then increment in terms of 100 until we reach 1000 users. This will give us 10 data points which is sufficient to give a good estimate of scalability. To compute a scalability metric, take a look at Gunther’s USL (Universal Scalability Law).
I’ve been working for awhile now to revamp the entire stats processing and graphing design in Faban. For those who haven’t heard of Faban, it is a performance tool that helps creation and running of workloads. Faban currently uses a technology called Fenxi to process and graph stats. Fenxi has given us lots of problems over time – it is poorly designed, lacks flexibility and doesn’t even seem to be maintained anymore. So I decided to get rid of it entirely.
I am really excited by the changes. I think this is one of the major enhancements to Faban since Akara (the original Faban architect) and I left Sun/Oracle.
So without much further adieu, here are the upcoming changes:
- New cooler looking dynamic graphs
I’m using jqPlot that produces nice looking graphs by default. The ability to zoom in on a section of the graph is really very very nice, in addition to actually seeing the datapoint values as you mouse over the graph. This one feature I think will make the Faban UI more modern.
- Provide support for graphing stats from Linux.
- New Viewer to graph all xan files
Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.
Web Application Architecture
The diagram below shows a high-level view of typical architectures of web applications.
The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.
The Front end refers to the web tier that generates the html response for the browser.
The Back end refers to the server components that are responsible for the business logic.
Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.
Front End Performance
When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:
- Time taken to generate the base page
- Browser parse time
- Time to download all of the components on the page (css,js,images,etc.)
- Browser render time of the page
For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.
Front end Performance Tools
Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.
Back End Performance
The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.
For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example, database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.
Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.
Back End Performance Tools
Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.
Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.
Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.
We often hear phrases like the following :
- “The response time was 5 seconds”
- “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
- “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”
Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.
Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?
You get the idea. For some reason, people tend to talk loosely about response times. Without going into details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.
For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?
Mean Response Time
This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.
- Figure 1: Steady Response Times
- Figure 2: Varying Response Times
But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).
Median Response Time
If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time. Surely, a response time specification that leaves out half the population cannot be a good measure.
90th or 95th percentile Response Time
In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.
A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.
- Figure 3: Response Time Histogram
It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.
To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.
We often hear the terms Load Testing or Performance Testing, but no one talks much about Scalability Testing. Before I go further, let me define these terms so you know what I am talking about :
- Load Testing refers to the kind of testing usually done by QA organizations to ensure that the application can handle a certain load level. Criteria are set to ensure that releases of a product meet certain conditions like the number of users they can support while delivering a certain response time.
- Performance Testing on the other hand, refers to testing done to analyze and improve the performance of an application. The focus here is on optimization of resource consumption by analyzing data collected during testing. Performance Testing to a certain extent should be done by developers but more elaborate, large scale testing may be conducted by a separate performance team. In some organizations, the performance team is a part of the QA function.
- Scalability Testing refers to performance testing that is focused on understanding how an application scales as it is deployed on larger systems and/or more systems or as more load is applied to it. The goal is to understand at what point the application stops scaling and identify the reasons for this. As such scalability testing can be viewed as a kind of performance testing.
In this article, we will consider how scalability testing should be done to ensure that the results are meaningful.
The first requirement for any performance testing is a well-designed workload. See my Workload Design paper for details on how to properly design a workload. Many developers and QA engineers typically craft a workload quickly by focusing on a couple of different operations (e.g. if testing a web application, a recording tool is used to create one or two scenarios). I will point out the pitfalls of this method in another post. So take care while creating your workload. Extra time invested in this step will more than pay off in the long run. Remember, your test results are only as good as the tests you create!
Designing Scalability Tests
Scalability tests should be planned and executed in a systematic manner to ensure that all relevant information is collected. The parameter by which load is increased obviously depends on the type of app – for web apps, this would typically be the number of simultaneous users making requests of the site. Think about what other parameters might change for your application. If the application accesses a database, will the size of the db change in some relation to the number of users accessing it? If it uses a caching tier, might it be reasonable to expect that the size of this cache will expand ? Consider the data accessed by your workload – how is this likely to change? Both the data generator and load generator drivers need to be implemented in a way that supports workload and data scaling.
Collecting Performance Data
When running the tests, ensure you can collect sufficient performance metrics so as to be able to understand what exactly is happening on the application infrastructure. One set of metrics is from the system infrastructure – cpu, memory, swap, network and disk i/o data. Another is from the software infrastructure – web,application, caching (memcached) and database servers all provide access to performance data. Don’t forget to collect data on the load driver systems as well. I have seen many a situation in which the driver ran out of memory or swap and it took awhile to figure this out because no one was looking at the driver stats ! All performance metrics should be collected for the same duration as the test run.
Running Scalability Tests
With planning done, it is time to run the performance tests. You want to start at a comfortable scale factor – say 100 users and increment by the same factor every time (e.g. 100 users at a time). Some tools let you run a single test while varying the load – although this may be acceptable for load testing, I would discourage such short-cuts for scalability testing. The goal is not just to get to the maximum load but to understand how the system behaves at every step. Without the detailed performance data, it is difficult to do scalability analysis. Do scaling runs to a point a little beyond when the system stops scaling (i.e throughput stays flat or worse starts to fall) or you run out of system resources.
Enterprise applications are typically tested for load, performance and scalability using a driver that emulates a real client by sending requests to the applications similar to what a real user would. For web applications, the client is usually a browser and the driver is a simple Http client. The emulated http clients can be extremely light-weight allowing one to run several hundred or even thousand (depending on the think time) driver agents on a single box. Of course, tools vary widely – some tools can be quite heavy-weight so it’s important to do a thorough evaluation first. But I digress.
As web pages got larger and incorporated more components, performance testing tools started including a recording tool which could capture the http requests as a user navigates the application using a regular browser. The driver agent would then “playback” the captured requests. Many tools also allow modification of the requests to allow for unique logins, cookies etc. to correctly emulate multiple users. Such a “record-and-playback” methodology is part of most enterprise load testing tools.
But at what cost?
If you’re thinking that this sounds too easy, what’s the catch … you’re right. There is a price to pay for this ease of use in both CPU and memory resources on the test driver systems. A real browser can consume 10s to 100s of megabytes of memory and significant CPU resources as well. And this is just for driving a single user! Realistically, how many browsers can you run on a typical machine, especially considering that driver boxes are typically older, slower hardware?
So what can we do to mitigate this problem?