Measuring CPU Utilization on EC2
Posted on: March 20, 2010
Many web applications are now moving to the cloud where configurations are difficult to understand; see for example, Amazon Web Services (AWS) definition of EC2 Compute Unit. How does one determine how many instances of what type are required to run an application ? Typical capacity planning exercises start by doing measurements. So for example, one might test the targeted app by deploying on say a ec2 m1.small type (1 EC2 Compute Unit) and see how many users it can support. Based on performance metrics gathered during the test, one can estimate how many instances will be required (assuming of course that the application scales horizontally).
The Tests
To test this simplistic model, I fired up an ec2 m1.small instance running Ubuntu and started the apache web server. I used another instance as the load driver to repeatedly fetch a single helloworld php page from the web server and scaled up the number of users from 1 to 50 in increments of 10. The test was written and driven by Faban, a versatile open-source performance testing tool.
The Measurements
On Unix systems, tools like vmstat and mpstat can be used to measure cpu utilization (amongst other things). Faban automatically runs these tools on any test for the same duration as the test run allowing one to monitor the resource utilization during the test. (Note that on Ubuntu, mpstat is not installed by default but is available as part of the sysstat package).
The Results
Here is the throughput graph as the number of virtual users was scaled.
The throughput peaks at 20 users and then flattens out (actually falls a little bit). Looking at this graph, one would expect that the cpu saturated around 20 users (assuming no other bottleneck which is a reasonable assumption for this extremely simple application).
Here is a snippet of the vmstat output captured during the run at 50 users :
procs ———–memory———- —swap– —–io—- -system– —-cpu—-
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 661096 18596 961772 0 0 6 104 1108 826 3 2 87 0
1 0 0 658116 18616 964028 0 0 0 237 2788 1966 26 9 9 0
3 0 0 655856 18632 966296 0 0 0 244 2699 1959 24 12 6 0
2 0 0 653376 18652 968700 0 0 0 236 2943 2069 24 11 9 0
1 0 0 651020 18668 970972 0 0 0 240 2842 1963 25 10 6 0
2 0 0 648680 18688 973224 0 0 0 241 2763 1954 24 11 9 0
The user time (column under us) averages 24.6% and the system time (column under sy) averages 10.6% for a total time of 35.2%. But the idle time (column under id) is only around 8% – no wonder the throughput stopped increasing. But what is the discrepancy here ? If the user and system time are only 35%, where is the remaining time going ?
To understand that, take a look at the mpstat output snippet below :
12:57:43 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
12:57:53 AM all 25.73 0.00 8.71 0.00 0.00 0.30 55.96 9.31 2763.26
12:58:03 AM all 24.21 0.00 11.31 0.00 0.00 0.40 58.93 5.16 2661.41
12:58:13 AM all 24.09 0.00 10.07 0.00 0.00 0.89 55.58 9.38 2904.84
12:58:23 AM all 25.15 0.00 9.34 0.00 0.00 0.89 59.05 5.57 2824.95
12:58:33 AM all 23.78 0.00 9.99 0.00 0.00 1.20 56.14 8.89 2760.54
12:58:43 AM all 22.26 0.00 11.58 0.00 0.00 0.40 54.79 10.98 2835.83
We can see that the %user ,%sys and %idle column values match those shown by vmstat. But we see an additional utilization column – %steal which ranges from 55% to 59%. If you add this value to the user, sys and idle, we get 100%. So that’s where the missing time has gone – to %steal.
Who is stealing my CPU ?
But what exactly is %steal ? It is the time when your application had something ready to run but the CPU was busy servicing some other instance. Clearly, this is not the only application running on this CPU. The m1.small instance is defined as providing “1 EC2 Compute Unit“, not 1 CPU.
In this case, the 1 instance was worth about 35% of the single CPU that was on this system (an Intel Xeon E5430 @2.66GHz).
When looking at cpu utilization on EC2 (or any virtualized environments based on Xen), keep this in mind. Always consider %steal in addition to the user and system time.

June 21, 2011 at 9:13 am
This is very helpful information.
Thank you.