Posts Tagged ‘web performance’
Going by the many posts in various LinkedIn groups and blogs, there seems to be some confusion about how to measure and analyze a web application’s performance. This article tries to clarify the different aspects of web performance and how to go about measuring it, explaining key terms and concepts along the way.
Web Application Architecture
The diagram below shows a high-level view of typical architectures of web applications.
The simplest applications have the web and app tiers combined while more complex ones may have multiple application tiers (called “middleware”) as well as multiple datastores.
The Front end refers to the web tier that generates the html response for the browser.
The Back end refers to the server components that are responsible for the business logic.
Note that in architectures where a single web/app server tier is responsible for both the front and back ends, it is still useful to think of them as logically separate for the purposes of performance analysis.
Front End Performance
When measuring front end performance, we are primarily concerned with understanding the response time that the user (sitting in front of a browser) experiences. This is typically measured as the time taken to load a web page. Performance of the front end depends on the following:
- Time taken to generate the base page
- Browser parse time
- Time to download all of the components on the page (css,js,images,etc.)
- Browser render time of the page
For most applications, the response time is dominated by the 3rd bullet above i.e. time spent by the browser in retrieving all of the components on a page. As pages have become increasingly complex, their sizes have mushroomed as well – it is not uncommon to see pages of 0.5 MB or more. Depending on where the user is located, it can take a significant amount of time for the browser to fetch components across the internet.
Front end Performance Tools
Front-end performance is typically viewed as waterfall charts produced by tools such as the Firebug Net Panel. During development, firebug is an invaluable tool to understand and fix client-side issues. However, to get a true measure of end user experience on production systems, performance needs to be measured from points on the internet where your customers typically are. Many tools are available to do this and they vary in price and functionality. Do your research to find a tool that fits your needs.
Back End Performance
The primary goal of measuring back end performance is to understand the maximum throughput that it can sustain.Traditionally, enterprises perform “load testing” of their applications to ensure they can scale. I prefer to call this “scalability testing“. Test clients drive load via bare-bones HTTP clients and measure the throughput of the application i.e. the number of requests per second they can handle. To increase the throughput, the number of client drivers need to be increased until the point where throughput stops to increase or worse stops to drop-off.
For complex multi-tier architectures, it is beneficial to break-up the back end analysis by testing the scalability of individual tiers. For example, database scalability can be measured by running a workload just on the database. This can greatly help identify problems and also provides developers and QA engineers with tests they can repeat during subsequent product releases.
Many applications are thrown into production before any scalability testing is done. Things may seem fine until the day the application gets hit with increased traffic (good for business!). If the application crashes and burns because it cannot handle the load, you may not get a second chance.
Back End Performance Tools
Numerous load testing tools exist with varying functionality and price. There are also a number of open source tools available. Depending on resources you have and your budget, you can also outsource your entire scalability testing.
Front end performance is primarily concerned with measuring end user response times while back end performance is concerned with measuring throughput and scalability.
I thought I’d continue the theme of my last post “A lesson in Validation” with a lesson in analysis. This one is mostly focused on networking – one of my primary new focus areas but anyone can benefit from the process and lessons learned.
Recently, I was analyzing the results of some tests run against Yahoo’s Malaysia Front Page. The response times were incredibly high and on digging down, it soon became apparent why. The time taken to retrieve the static objects was more than 100 times what it should have been.
Like most websites, Yahoo! uses geographically distributed CDNs to serve static content. Malaysia gets served out of Singapore which is pretty darned close so there should be no reason for extra long hops. A large response time to retrieve a static object can mean one of two things: the object is not being cached or it is getting routed to the wrong location.
Sure enough, nslookup showed that the ip address being used to access all the static objects was located in the USA. It was therefore no surprise that it was taking so long. Satisfied that I had found a routing problem, I contacted the DNS team and they said … you guessed it. “It’s not our problem”. They ran some of their own tests and stated that Malaysia was getting routed correctly to Singapore, therefore the problem must be with my tests.
Dig(1) to the rescue
Since I was using a third-party tool, I contacted the vendor to see if they could help. The support engineer promptly ran dig and found that all the requests were being resolved (but didn’t care that they weren’t being resolved correctly!) Here is the output of dig:
<<>> DiG 9.3.4 <<>> l1.yimg.com @GPNKULDB01 A +notrace +recurse ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1442 ;; flags: qr rd ra; QUERY: 1, ANSWER: 8, AUTHORITY: 2, ADDITIONAL: 2 ;; QUESTION SECTION: ;l1.yimg.com. IN A ;; ANSWER SECTION: l1.yimg.com. 3068 IN CNAME geoycs-l.gy1.b.yahoodns.net. geoycs-l.gy1.b.yahoodns.net. 3577 IN CNAME fo-anyycs-l.ay1.b.yahoodns.net. fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 184.108.40.206 fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 220.127.116.11 fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 18.104.22.168 fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 22.214.171.124 fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 126.96.36.199 fo-anyycs-l.ay1.b.yahoodns.net. 210 IN A 188.8.131.52 ;; AUTHORITY SECTION: ay1.b.yahoodns.net. 159379 IN NS yf1.yahoo.com. ay1.b.yahoodns.net. 159379 IN NS yf2.yahoo.com. ;; ADDITIONAL SECTION: yf2.yahoo.com. 1053 IN A 184.108.40.206 yf1.yahoo.com. 1053 IN A 220.127.116.11 ;; Query time: 0 msec ;; SERVER: 172.16.37.138#53(172.16.37.138) ;; WHEN: Wed Jan 12 11:15:42 2011 ;; MSG SIZE rcvd: 270
We now had the ip address of the DNS Resolver – 172.16.37.138. Where was this located? nslookup showed:
** server can't find 18.104.22.168.in-addr.arpa.: NXDOMAIN
No luck – this was a private IP address. Back I went to Tech Support: “Can you please let me know what the public IP address is where this private IP is NATed to?”
And here’s what I got: “I confirmed with our NOC team that the public IP Address of the DNS server located at “Kuala Lumpur, Malaysia – VNSL” site is a.b.c.d. Hope this is helpful for you.”
(I replaced the actual ip address above for security). I promptly did another nslookup which showed the host name with KaulaLumpur in it. So it seemed that there was no problem on the testing side after all. But not so fast … hostnames can be bogus!
Geo-Locating a Server
We had to find out where exactly this ip address was located in the world. So the ip address was plugged into http://whatismyipaddress.com/ip-lookup and it came back with:
The test server was actually located in Canada, while appearing to be from KaulaLampur! No wonder our DNS servers were routing the requests to the US.
What seemed to be a problem with our routing, turned out to be a problem of the tool’s routing !
Moral of the story: Don’t trust any tool and validate, validate, validate!
Those who have worked with me know how much I stress the importance of validation: validate your tools, workloads and measurements. Recently, an incident brought home yet again the importance of this tenet and I hope my narrating this story will prove useful to you as well.
For the purposes of keeping things confidential, I will use fictional names. We were tasked with running some performance measurements on a new version of a product called X. The product team had made considerable investment into speeding things up and now wanted to see what the fruits of their labor had produced. The initial measurements seemed good but since performance is always relative, they wanted to see comparisons against another version Y. So similar tests were spun up on Y and sure enough X was faster than Y. The matter would have rested there, if it weren’t for the fact that the news quickly spread and we were soon asked for more details.
Firebug to the rescue
At this point, we took a pause and wondered: Are we absolutely sure X is faster than Y? I decided to do some manual validation. That night, connecting via my local ISP from home, I used firefox to do the same operations that were being performed in the automated performance tests. I launched firebug and started looking at waterfalls in the Net panel.
As you can probably guess, what I saw was surprising. The first request returned a page that caused a ton of object retrievals. The onload time reported by firebug was only a few seconds, yet there was no page complete time!
The page seemed complete and I could interact with it. But the fact that firebug could not determine Page Complete was a little disconcerting. I repeated the exercise using HttpWatch just to be certain and it reported exactly the same thing.
Time to dig down deeper. Taking a look at the individual object requests, one in particular was using the Comet model and it never completed. On waiting a little longer, there were other requests being sent by the browser periodically. Neither of these request types however had any visual impact on the page. Since requests were continuing to be made, firebug obviously thought that the page was not complete.
Page Complete or Not?
This begged the question: how did the automated tests run and how did they determine when the page was done? There was a timeout set for each request, but if the request was terminating because of the timeout, we surely would have noticed since the response times reported would have been the timeout value. In fact, the response time being reported was less than half the timeout value.
So we started digging into the waterfalls of some of the automated test results. Lo and behold – a significant component of the response time was the HTTP Push (also known as HTTP Streaming) one. There were also several of the sporadic requests that were being made well after the page was complete. This resulted in arbitrary response times for Y being reported.
It turned out that the automated tool was actually quite sophisticated. It doesn’t just use a standard timeout for the entire request. Instead it monitors the network and if no activity is detected for 3 seconds, it considers the request complete. So it captured some of the streaming and other post-PageComplete requests and returned when the pause between them was more than 3 seconds. That is why we thought we were seeing “valid” response times which looked reasonable and had us fooled!
Of course, this leads to the big discussion of when exactly do we consider a http request as complete? I don’t want to get into that now as my primary purpose of this article is to point out the importance of validation in performance testing. If we had taken the time to validate the results of the initial test runs, this problem would have been detected a long time ago and lots of cycles could have been saved (not to mention the embarrassment of admitting to others that the initial results reported were wrong !)