Tracking a response time problem
Posted April 8, 2010on:
Recently, an engineer came to me puzzled that the response times of some performance benchmark she was running were increasing. She had already looked at the usual trouble spots – resource utilizations on the app and database systems, database statistics, application server tunables, the network stack etc. I asked her about the cpu metrics on the load driver systems (the machines which drive the load). Usually, when I ask this question, the answer is “I don’t know. Let me find out and get back to you”. But this engineer had looked at that as well. “It isn’t a problem. There is plenty of CPU left – I have 30% idle”.
Ah ah – I had spotted the problem. When we run benchmarks, we tend to squeeze every bit of performance we can out of the systems. This means running the servers as close to 100% utilized as possible. This mantra is sometimes carried over to the load driver systems as well. Unfortunately, that can result in severe performance degradation. Here’s why.
The load driver systems emulate users and generate requests to the system under test. They receive the responses and measure and record response times. A typical driver emulates hundreds to thousands of users. Each emulated user is then competing for system resources. Now suppose an emulated user has issued a read request to read the response from the server. It is very likely that this thread will be context switched out by the operating system as there are so many additional users it needs to serve. Depending on the number of CPUS on the system and the load, the original emulated user thread may get to execute with a considerable delay and consequently record a much larger response time. My rule of thumb is never to run the load generator systems more than 50% busy if the application is latency sensitive. In this particular case, the system was already 70% utilized.
Sure enough – when a new load driver system was added and the performance tests re-run, all the response time criteria passed and the engineer could continue scaling the benchmark.
Moral of the story – don’t forget to monitor your load driver systems and don’t be complacent if their utilization starts climbing above 50%.