Performance Blog

I recently checked in a feature that allows fairly extensive comparisons of different runs in Faban. Although the ‘Compare’ button has been part of the Results list view for awhile, it has been broken for a long time. It finally works!

When to use Compare

The purpose of this feature is to compare runs performed at the same load level (aka Scale in faban) and  on the same benchmark rig. Perhaps you are tuning certain configs and/or code and are doing runs to analyze the performance differences between these changes. The Compare feature lets you look at multiple runs at the same time on multiple dimensions: throughput, average and 90% response times, average CPU utilization, etc. This gives a single page view that can quickly point out where one run differs from another.

How to use Compare

This is easy. On the results view in the dashboard, simply select the runs you want to compare using the check box at the left of each row. Then click the Compare button at the top of the screen.

The screen-shot below shows this operation:

Screen Shot 2013-04-01 at 2.54.09 PM

Comparison Results

The first part of the comparison report looks like the image below. The report combines tables with graphs to make the data relevant. For example, Run Information is a summary table that describes the runs, where as throughput is a graph that shows how the overall throughput varied during the length of the test for all runs.

Screen Shot 2013-04-01 at 2.52.53 PM

 

 

 

 

 

 

 

 

 

 

 

How can I get this code?

The code is currently in the main branch of the faban code on github. Fork it and try it out. Once I get some feedback, I will fix any issues and cut a new binary release.

I just posted an article on how to performance load testing in production on the Box Tech blog – check it out.

The focus of that article is applying load to ensure that the system is stable and performs reasonably. This should be distinguished from Scalability Testing where the goal is to measure the scalability of the application for performance analysis as well as capacity planning.

To test scalability, it is important to increase the load methodically so that performance can be measured and plotted. For example, we can start with running 100 emulated users, then increment in terms of 100 until we reach 1000 users. This will give us 10 data points which is sufficient to give a good estimate of scalability. To compute a scalability metric, take a look at Gunther’s USL (Universal Scalability Law).

I’ve been working for awhile now to revamp the entire stats processing and graphing design in Faban. For those who haven’t heard of Faban, it is a performance tool that helps creation and running of workloads. Faban currently uses a technology called Fenxi  to process and graph stats. Fenxi has given us lots of problems over time – it is poorly designed, lacks flexibility and doesn’t even seem to be maintained anymore. So I decided to get rid of it entirely.

I am really excited by the changes. I think this is one of the major enhancements to Faban since Akara (the original Faban architect) and I left Sun/Oracle.

So without much further adieu, here are the upcoming changes:

  •  New cooler looking dynamic graphs

I’m using jqPlot that produces nice looking graphs by default. The ability to zoom in on a section of the graph is really very very nice, in addition to actually seeing the datapoint values as you mouse over the graph. This one feature I think will make the Faban UI  more modern.

  •  Provide support for graphing stats from Linux.
   The Fenxi model does not support multiple OS’s well. So I’ve got rid of it completely. Instead, support for Linux tools (I currently have vmstat, iostat in) is added by using the post-processing functionality already baked into faban. The post-processing script will produce a ‘xan’ file (named after Xanadu, the original internal name for Fenxi). The nice thing about the xan format is that it is highly readable by a human. Take a look at any of your current detail.xan files produced by Faban. Very easy to read, so I’m sticking with this format.
  • New Viewer to graph all xan files
Of course, the above 2 enhancements are not possible without a way to actually interpret the xan files and convert them to jqplot json format. So a new Viewer has been implemented that renders the xan file nicely – both tables and graphs.
I’m attaching a screen-shot of a sample Linux Vmstat output to whet your appetite for the new features.
Stay tuned. I hope to check everything in the next couple of weeks.
If you are a Faban user, please join the Faban User google group.
Tags:

Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.

Web Application Architecture

The diagram below shows a high-level view of typical architectures of web applications.

The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.

The Front end refers to the web tier that generates the html response for the browser.

The Back end refers to the server components that are responsible for the business logic.

Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.

Front End Performance

When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:

  • Time taken to generate the base page
  • Browser parse time
  • Time to download all of the components on the page (css,js,images,etc.)
  • Browser render time of the page

For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.

Front end Performance Tools

Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.

Back End Performance

The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.

For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example,  database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.

Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.

Back End Performance Tools

Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.

Summary

Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.

 

Service Level Agreements (SLAs) usually specify a response time criteria that must be met. Although SLAs can have a wide range of metrics like throughput, up time, availability etc., we will focus on response times in this article.

We often hear phrases like the following :

  • “The response time was 5 seconds”
  • “This product’s performance is much worse than slowpoke’s. It takes longer to respond.”
  • “Our whizbang product can perform 100 transactions/sec with a response time of 10 seconds or less”

Do you see anything wrong in these statements? Although they sound fine for general conversation, anyone interested in performance should really be asking what exactly do they mean.

Let’s take the first statement above and make the assumption that it refers to a particular page in a web application. When someone says that the response time is 5 seconds, does it mean that when this user typed in the URL of this page, the browser took 5 seconds to respond? Or does it mean that in an automated test repeatedly accessing this page, the average response time was 5 seconds? Or perhaps, the median response time was 5 seconds?

You get the idea. For some reason, people tend to talk loosely about response times. Without going into  details of how to measure the response time (that’s a separate topic), this article will focus on what is a meaningful response time metric.

For purposes of this discussion, let us assume we are measuring the response time of a transaction (which can be anything – web, database, cache etc.) What is the most meaningful measure for the response time of a transaction?

Mean Response Time

This is the most common measure of response time, but alas, usually is the most flawed as well. The mean or average response time simply adds up all the individual response times taken from multiple measurements and divides it by the number of samples to get an average. This may be fine if the measurements are fairly evenly distributed over a narrow range as in Figure 1.

Steady Response Times
Figure 1: Steady Response Times
Figure 2: Varying Response Times
Figure 2: Varying Response Times

But if the measurements vary quite a bit over a large range like in Figure 2, the average response time is not meaningful. Both figures have the same scale and show response times on the y axis for samples taken over a period of time (x axis).

Median Response Time

If the average is not a good representation of a distribution, perhaps the median is? After all, the median marks the 50th percentile of a distribution. The median is useful when the response times do have a normal distribution but have a few outliers. In this case, the median helps to weed out the outliers.The key here is few outliers. It is important to realize that if 50% of the transactions are within the specified time, that means the remaining 50% have a higher response time.  Surely, a response time specification that leaves out half the population cannot be a good measure.

90th or 95th percentile Response Time

In standard benchmarks, it is common to see 90th percentile response times used. The benchmark may specify that the 90th percentile response time of a transaction should be within x seconds. This means that only 10% of the transactions have a response time higher than x seconds and can therefore be a meaningful measure. For web applications, the requirements are usually even higher – after all, if 10% of your users are dissatisfied with the site performance, that could be a significant number of users. Therefore, it is common to see 95th percentile used for SLAs in web applications.

A word of caution – web page response times can vary dramatically if measured at the last mile (i.e. real users computers that are connected via cable or DSL to the internet). Figure 3 shows the distribution of response times for such a measurement.

Figure 3: Response Time Histogram
Figure 3: Response Time Histogram

It uses the same data as in Figure 2. The mean response time for this data set is 12.9 secs and the median is even lower at 12.3 secs. Clearly neither of these measures covers any significant range of the actual response times. The 90th percentile is 17.3 and the 95th is 18.6. These are much better measures for the response time of this distribution and will work better as the SLA.

To summarize, it is important to look at the distribution of response times before attempting to define an SLA. Like many other metrics, a one size fits all approach does not work. Response time measurements on the server side tend to vary a lot less than on the client. A 90th or 95th percentile response time requirement is a good choice to ensure that the vast majority of clients are covered.

We often hear the terms Load Testing or Performance Testing, but no one talks much about Scalability Testing. Before I go further, let me define these terms so you know what I am talking about :

  • Load Testing refers to the kind of testing usually done by QA organizations to ensure that the application can handle a certain load level. Criteria are set to ensure that releases of a product meet certain conditions like the number of users they can support while delivering a certain response time.
  • Performance Testing on the other hand, refers to testing done to analyze and improve the performance of an application. The focus here is on optimization of resource consumption by analyzing data collected during testing. Performance Testing to a certain extent should be done by developers but more elaborate, large scale testing may be conducted by a separate performance team. In some organizations, the performance team is a part of the QA function.
  • Scalability Testing refers to performance testing that is focused on understanding how an application scales as it is deployed on larger systems and/or more systems or as more load is applied to it. The goal is to understand at what point the application stops scaling and identify the reasons for this. As such scalability testing can be viewed as a kind of performance testing.

In this article, we will consider how scalability testing should be done to ensure that the results are meaningful.

Workload Definition

The first requirement for any performance testing is a well-designed workload. See my Workload Design paper for details on how to properly design a workload. Many developers and QA engineers typically craft a workload quickly by focusing on a couple of different operations (e.g. if testing a web application, a recording tool is used to create one or two scenarios). I will point out the pitfalls of this method in another post. So take care while creating your workload. Extra time invested in this step will more than pay off in the long run. Remember, your test results are only as good as the tests you create!

Designing Scalability Tests

Scalability tests should be planned and executed in a systematic manner to ensure that all relevant information is collected. The parameter by which load is increased obviously depends on the type of app – for web apps, this would typically be the number of simultaneous users making requests of the site. Think about what other parameters might change for your application. If the application accesses a database, will the size of the db change in some relation to the number of users accessing it? If it uses a caching tier, might it be reasonable to expect that the size of this cache will expand ? Consider the data accessed by your workload – how is this likely to change? Both the data generator and load generator drivers need to be implemented in a way that supports workload and data scaling.

Collecting Performance Data

When running the tests, ensure you can collect sufficient performance metrics so as to be able to understand what exactly is happening on the application infrastructure. One set of metrics is from the system infrastructure – cpu, memory, swap, network and disk i/o data. Another is from the software infrastructure – web,application, caching (memcached) and database servers all provide access to performance data. Don’t forget to collect data on the load driver systems as well. I have seen many a situation in which the driver ran out of memory or swap and it took awhile to figure this out because no one was looking at the driver stats ! All performance metrics should be collected for the same duration as the test run.

Running Scalability Tests

With planning done, it is time to run the performance tests. You want to start at a comfortable scale factor – say 100 users and increment by the same factor every time (e.g. 100 users at a time). Some tools let you run a single test while varying the load – although this may be acceptable for load testing, I would discourage such short-cuts for scalability testing. The goal is not just to get to the maximum load but to understand how the system behaves at every step. Without the detailed performance data, it is difficult to do scalability analysis. Do scaling runs to a point a little beyond when the system stops scaling (i.e throughput stays flat or worse starts to fall) or you run out of system resources.

Enterprise applications are typically tested for load, performance and scalability using a driver that emulates a real client by sending requests to the applications similar to what a real user would. For web applications, the client is usually a browser and the driver is a simple Http client. The emulated http clients can be extremely light-weight allowing one to run several hundred or even thousand (depending on the think time) driver agents on a single box. Of course, tools vary widely – some tools can be quite heavy-weight so it’s important to do a thorough evaluation first. But I digress.

As web pages got larger and incorporated more components, performance testing tools started including a recording tool which could capture the http requests as a user navigates the application using a regular browser. The driver agent would then “playback” the captured requests. Many tools also allow modification of the requests to allow for unique logins, cookies etc. to correctly emulate multiple users. Such a “record-and-playback”  methodology  is part of most enterprise load testing tools.

Today’s web applications are complex and sophisticated with tons of javascript that track and handle all sorts of mouse movements, multiple XHR  requests on the same page, persistent connections using the Comet model, etc. If javascript generates dynamic requests, composing the URLs on the fly, the recorded scripts will fail. Of course, if the performance testing tool provides a rich interface allowing the tester full flexibility to modify the load driver, it is still possible to create a driver that can drive these rich web2.0 applications.

Browser Agents

Increasingly, many in the performance community are abandoning the old-style http client drivers in favor of browser agents i.e. an agent/driver that runs an actual full-featured browser. The obvious advantage to going this route is the dramatic simplification of test scripts – you can give it a single URL and the browser will automatically fetch all of the components on the page.  If the page contains javascript that in turn generates more requests – no problem. The browser will handle it all.

But at what cost?

If you’re thinking that this sounds too easy, what’s the catch … you’re right. There is a price to pay for this ease of use in both CPU and memory resources on the test driver systems. A real browser can consume 10s to 100s of megabytes of memory and significant CPU resources as well. And this is just for driving a single user! Realistically, how many browsers can you run on a typical machine, especially considering that driver boxes are typically older, slower hardware?

So what can we do to mitigate this problem?

Emulated Browsers with Real Javascript Engine

A compromise solution is to use a thin browser that does not have all of the functionality of a real browser but does include a real javascript engine. An example is HtmlUnit, which is a Java library that is lighter-weight than a real browser like IE or Firefox. The caveat here is that your performance testing tool must provide the capability to make calls to arbitrary third-party libraries. Many tools have very simplistic scripting capability which may not allow using HtmlUnit.

Summary

Many people seem to think that just because they have javascript or XHR requests, they need to use a real browser for scalability testing. This is untrue – in almost all but the most complex cases, you can still use an emulated client (the exception is if you have requests that are generated from javascript based on complex logic that is not easy to reproduce). Keep in mind that the purpose of load/scalability testing is to measure throughput. To do so, you want the lightest possible client so you can run the maximum number of emulated users with the minimum amount of hardware. Using a real browser should be the last option to consider.

Pages

Latest Tweets

Categories

Archives

Follow

Get every new post delivered to your Inbox.

Join 225 other followers