Performance Blog

Archive for the ‘Sun’ Category

Following on the heels of our memcached performance tests
on SunFire X2270 ( Sun’s Nehalem-based server) running OpenSolaris, we
ran the same tests on the same server but this time on RHEL5. As
mentioned in the post presenting the first memcached results,
a 10GBE Intel Oplin card was used in order to achieve the high
throughput rates possible with these servers. It turned out that using
this card on linux involved a bit of work resulting in driver and kernel
re-builds.

  • With the default ixgbe driver from the RedHat
    distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface
    simply hung during the benchmark test.
  • This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple rings.
  • The kernel version 2.6.29 includes support for multiple rings but still doesn’t have the latest ixgbe driver which is 1.3.56-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 280K with some
    tuning.

Results Comparison

The system running OpenSolaris and memcached 1.3.2 gave us a maximum throughput of 350Kops/sec as previously reported. The same system running RHEL5 (with kernel 2.6.29) and the same version of memcached resulted in 280K ops/sec. OpenSolaris outperforms Linux by 25% !

Linux Tuning

The following Linux tunables were changed to try and get the best performance:

net.ipv4.tcp_timestamps = 0
net.core.wmem_default = 67108864
net.core.wmem_max = 67108864
net.core.optmem_max = 67108864
net.ipv4.tcp_dsack = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_window_scaling = 0
net.core.netdev_max_backlog = 300000
net.ipv4.tcp_max_syn_backlog = 200000

Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings): 
RSS=2,2 InterruptThrottleRate =1600,1600

OpenSolaris Tuning

The following settings in /etc/system were used to set the number of MSIX:
set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1

For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :

tx_queue_number=4, rx_queue_number=4

Finally, we bound the crossbow threads:

dladm set-linkprop -p cpus=12,13,14,15 ixgbe0

The first cut of a Java EE implementation of Olio is now checked into the repository. The file docs/java_setup.html gives instructions on how to build and setup this implementation. The implementation uses JSP, servlets, JPA for persistence, yahoo and Jmaki widgets for AJAX etc. The web application is located in webapp/java/trunk and the load driver, database and file loaders etc. are in workload/java/trunk.

Check it out.

As promised, here are more results running memcached on Sun’s X2270 (Nehalem-based server). In my previous post, I mentioned that we got 350K ops/sec running a single instance of memcached at which point the throughput was hampered by the scalability issues of memcached. So we ran two instances of memcached on the same server, each using 15GB of memory and tested both 1.2.5 and 1.3.2 versions. Here are the results :
2 instances performance

The maximum throughput was 470K ops/sec using 4 threads in memcached 1.3.2. Performance of 1.2.5 was just very slightly lower. At this throughput, the network capacity of the single 10gbe card was reached as the benchmark does a lot of small packet transfers. See my earlier post for a description of the server configuration and the benchmark. At the maximum throughput, the cpu was still only 62% utilized (73% in the case of 1.2.5). Note that with a single instance we were using the same amount of cpu but reaching a lower throughput rate which once again points to memcached scalability issues.

These are really exciting results. Stay tuned – there is more exciting information coming.

Memcached
is the de-facto distributed caching server used to scale many web2.0
sites today. With the requirement to support a very large number of
users as sites grow, memcached aids scalability by effectively cutting
down on MySQL traffic and improving response times.

Memcached
is a very light-weight server but is known not to scale beyond 4-6
threads. Some scalability improvements have gone into the 1.3 release
(still in beta). With the new Intel Nehalem based systems improved
hyper-threading providing twice as much performance as current systems,
we were curious to see how memcached would perform on these systems. So we ran some tests, the results of which are shown below :

Memcached Thruput with Threads

memcached 1.3.2 does scale slightly better than 1.2.5 after 4 threads. However,
both versions reach their peak at 8 threads with 1.3.2 giving about 14%
better throughput at 352,190 operations/sec.

The improvements made to per-thread stats certainly have helped as we no longer see stats_lock at the top of the profile. That honor now goes to cache_lock.
With the increased performance of new systems making 350K ops/sec
possible, breaking up of this (and other) lock(s) in memcached is
necessary to improve scalability.

Test Details

A single instance of memcached was run on a SunFire X2270
(2 socket Nehalem) with 48GB of memory and an Oplin 10G card. Several
external client systems were used to drive load against the server
using an internally developed Memcached benchmark. More on the
benchmark later.
The clients connected to the server using a single 10 Gigabit Ethernet
link. At the maximum throughput of 350K, the network was about 52%
utilized and the server was 62% utilized. So there is plenty of
head-room on this system to handle a much higher load if memcached
could scale better. Of course, it is possible to run multiple instances
of memcached to get better performance and better utilize the system
resources and we plan to do that next. It is important to note that
utilizing these high performance systems effectively for memcached will
require the use of 10 GBE interfaces.

Benchmark Details

The Memcached benchmark we ran is based on Apache Olio – a web2.0 workload. I recently showcased
results from Olio on Nehalem systems as well. Since Olio is a complex
multi-tier workload, we extracted the memcached part to more easily
test it in a stand-alone environment. This gave rise to our Memcached benchmark.

The
benchmark initially populates the server cache with objects of different
sizes to simulate the types of data that real sites typically store in
memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets)
and measures the throughput and response times when the system reaches
steady-state. The workload is implemented using Faban,
an open-source benchmark development framework. It not only speeds
benchmark development, but the Faban harness is a great way to queue,
monitor and archive runs for analysis.

Stay tuned for further results.

I introduced
Olio a little while ago as a toolkit to help web developers and
deployers as well as performance/operations engineers. Olio
includes a web2.0 application as well as the necessary software
required to drive load against it. Today, we are showcasing the first
major deployment of Olio on Sun’s newest Intel Nehalem based systems
– the SunFire
X2270
and the SunFire
X4270
. We tested 10,000 concurrent users (with a database of 1
million users) using over 1TB of storage in the unstructured object
store.

The diagram below shows the configuration we tested.

The Olio/PHP web application was
deployed on two X2270 systems. Since these systems are wickedly fast,
we also chose to run memcached on them. This eliminates the need of
having a separate memcached tier. The structured data in Olio resides
in MySQL. For this deployment, the database used MySQL
Replication
and was deployed using one Master node and 2 slave
nodes – all nodes were X4270 systems. The databases were created on
ZFS on
the internal drives on these systems. The unstructured data resides
on a regular filesystem created on the NAS Appliance AmberRoad – Sun
Storage 7210
.

I think this is a great solution for web2.0
applications – the new Nehalem servers are extremely powerful
allowing you to run a lot of users on each server, resulting in a
smaller footprint and easier deployment and maintenance. Of course,
this requires a little more effort in terms of tuning the software
stack to ensure it can scale and utilize the CPU effectively.

The
entire configuration, tuning informantion and performance results is
documented in details in a Sun Blueprints titled A
Web2.0 Deployment on OpenSolaris and Sun Systems.
So check it
out and let me know if you have any questions or comments.

Although rails is a great development environment for web applications, for a newbie the deployment of a rails application can be challenging due to the myriad dependencies on various gems, native libraries etc.

image_science is one such ruby library that provides an easy way to generate thumbnails. It is therefore quite popular in web2.0 type applications (there isn’t a site today that doesn’t let you upload photographs of yourself, your pets, gadgets, whatever).  It is a very simple implementation and available as a ruby gem and so easy to install. However, the real work is done by a native library called FreeImage and installing this on OpenSolaris is a little bit of work. Although, I use OpenSolaris here, the instructions apply to Solaris 10 as well if you are using ruby from Web Stack.

FreeImage

I found instructions from Joyent to build FreeImage on OpenSolaris but found them to be erroneous. To install FreeImage, do the following :

  • Download the source from the repository using the command :
  • cvs -z3 -d:pserver:anonymous@freeimage.cvs.sourceforge.net:/cvsroot/freeimage co -P FreeImage

  • Edit FreeImage/Makefile.solaris :
    • Change INSTALLDIR to /opt/local
    • Change all the lines for the install target as follows:
    • install:
              install -m 644 -u root -g root -f $(INSTALLDIR)/include Source/FreeImage.h
              install -m 644 -u root -g root -f $(INSTALLDIR)/lib $(STATICLIB)
              install -m 755 -u root -g root -f $(INSTALLDIR)/lib $(SHAREDLIB)
              ln -sf $(INSTALLDIR)/lib/$(SHAREDLIB) $(INSTALLDIR)/lib/$(LIBNAME)

  • Make all the install directories:
  • mkdir -p /opt/local/lib /opt/local/include

  • Ensure you have gcc in your PATH (it’s in /usr/sfw/bin).
  • Now we are ready to build the library:
  • gmake -f Makefile.solaris
    gmake -f Makefile.solaris install

If everything went smoothly, you should see the following files in /opt/local:
# ls /opt/local/include
FreeImage.h
# ls -l /opt/local/lib
total 13538
-rwxr-xr-x   1 root     root     2978480 Mar 17 13:35 libfreeimage-3.12.0.so
-rw-r–r–   1 root     root     3929756 Mar 17 13:35 libfreeimage.a
lrwxrwxrwx   1 root     root          22 Mar 17 13:43 libfreeimage.so.3 -> libfreeimage-3.12.0.so

ImageScience

Now that we have FreeImage installed, installing ImageScience itself is real easy. But first, make sure you have the latest version of rubygems (1.3.1). The default rubygems in OpenSolaris 2008.11 is 0.9.4.
# gem –version
0.9.4
# bash-3.2# gem install rubygems-update
Bulk updating Gem source index for: http://gems.rubyforge.org
Successfully installed rubygems-update-1.3.1
# update_rubygems

This will print a lot of messages, but when it’s complete you should have rubygems 1.3.1.
# gem –version
 1.3.1

We can now install the image_science gem. This will automatically install all dependent gems so the messages you see depends on what you have installed. On a OpenSolaris 2008.11 system you should see :
bash-3.2# gem install image_science
Successfully installed rubyforge-1.0.3
Successfully installed rake-0.8.4
Successfully installed hoe-1.11.0
Successfully installed ZenTest-4.0.0
Successfully installed RubyInline-3.8.1
Successfully installed image_science-1.1.3
6 gems installed

This will be followed by more messages indicating documentation for the above modules was also installed.
You are now ready to use image_science. Have fun !

We have just released the first binary version of Apache Olio for both the PHP and Rails implementation. Both implementations have been tested quite thoroughly now and we think they are robust enough for serious use – especially for performance testing the workloads.

I introduced Olio in a previous post. It is a toolkit that includes a sample web2.0 application implemented in both PHP and Rails that includes a load generator to drive load against the application.

Please visit the Olio site and download the kits. If you find it interesting, I invite you to come join the project.


Shanti's Photo

Pages

Latest Tweets

Categories

Archives