Memcached Performance on Sun’s Nehalem System
Posted April 18, 2009on:
is the de-facto distributed caching server used to scale many web2.0
sites today. With the requirement to support a very large number of
users as sites grow, memcached aids scalability by effectively cutting
down on MySQL traffic and improving response times.
is a very light-weight server but is known not to scale beyond 4-6
threads. Some scalability improvements have gone into the 1.3 release
(still in beta). With the new Intel Nehalem based systems improved
hyper-threading providing twice as much performance as current systems,
we were curious to see how memcached would perform on these systems. So we ran some tests, the results of which are shown below :
memcached 1.3.2 does scale slightly better than 1.2.5 after 4 threads. However,
both versions reach their peak at 8 threads with 1.3.2 giving about 14%
better throughput at 352,190 operations/sec.
The improvements made to per-thread stats certainly have helped as we no longer see stats_lock at the top of the profile. That honor now goes to cache_lock.
With the increased performance of new systems making 350K ops/sec
possible, breaking up of this (and other) lock(s) in memcached is
necessary to improve scalability.
A single instance of memcached was run on a SunFire X2270
(2 socket Nehalem) with 48GB of memory and an Oplin 10G card. Several
external client systems were used to drive load against the server
using an internally developed Memcached benchmark. More on the
The clients connected to the server using a single 10 Gigabit Ethernet
link. At the maximum throughput of 350K, the network was about 52%
utilized and the server was 62% utilized. So there is plenty of
head-room on this system to handle a much higher load if memcached
could scale better. Of course, it is possible to run multiple instances
of memcached to get better performance and better utilize the system
resources and we plan to do that next. It is important to note that
utilizing these high performance systems effectively for memcached will
require the use of 10 GBE interfaces.
The Memcached benchmark we ran is based on Apache Olio – a web2.0 workload. I recently showcased
results from Olio on Nehalem systems as well. Since Olio is a complex
multi-tier workload, we extracted the memcached part to more easily
test it in a stand-alone environment. This gave rise to our Memcached benchmark.
benchmark initially populates the server cache with objects of different
sizes to simulate the types of data that real sites typically store in
- small objects (4-100 bytes) to represent locks and query results
- medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
- large objects (5-20 KBytes) to represent whole or partially generated pages
The benchmark then runs a mixture of operations (90% gets, 10% sets)
and measures the throughput and response times when the system reaches
steady-state. The workload is implemented using Faban,
an open-source benchmark development framework. It not only speeds
benchmark development, but the Faban harness is a great way to queue,
monitor and archive runs for analysis.
Stay tuned for further results.