The Good, the Blog & the Ugly - Tim Cook's Weblog

Main | Next page »

http://blogs.sun.com/timc/date/20090427 Monday April 27, 2009

pstime - a mash-up of ps(1) and ptime(1)

I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities.

Recently I hit the same issue - I wanted to collect CPU consumption of mysqld.

To capture process CPU utilization over an interval on Solaris, about the best I can get is the output from a plain "prstat" command, which might look like:

mashie ) prstat -c -p `pgrep mysqld` 5 2
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu0    39    0   0:38:13  40% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu1    32    0   0:38:18  41% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18

I am after data from the second sample only (still not sure exactly how prstat gets data for the fist sample, which comes out almost instantaneously), so you can guess I will need some sed/perl that is a litte more complicated than I would prefer.

pstime reads PROCFS (i.e.. the virtualized file-system mounted on /proc) and captures CPU utilization figures for processes. It will report the %USR and %SYS either for a specific list of processes, or every process running on the system (i.e., running at both sample points). The start sample time is recorded in high resolution at the time a process' data is captured, and then again after N seconds, where N is the first parameter supplied to pstime.

The default output of pstime is expressed as either a percentage of whole system CPU, or CPU seconds, with four significant digits. Solaris itself records the original figures in nanosecond resolution, although we do not expect today's hardware to be that accurate.

Here is an example:

mashie ) pstime 10 `pgrep sysbench\|mysqld`
  UID    PID  %USR  %SYS COMMAND
mysql   7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/et
mysql  19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time=
mysql  19869 0.000 0.000 /bin/sh -p ./run-sysbench

Downloads

http://blogs.sun.com/timc/date/20090421 Tuesday April 21, 2009

Expanding Google's InnoDB Synchronization Improvements to Solaris

There is much excitement today at the launch of MySQL 5.4, so I will relate my story about a project I contributed to this new version.

When we started looking at performance improvements for MySQL, we were interested in "low hanging fruit", or fixes and changes that could reap measurable benefits for users in the short term.

An obvious candidate at that time was the now well-known Google SMP patch. I had seen Mark Callaghan present on this at the MySQL User Conference in 2008, and was interested to investigate.

I was pretty new to InnoDB at that time, and was soon to discover that InnoDB was possibly experiencing poor scalability around its mutexes and read-write locks because InnoDB had a private implementation of adaptive mutexes and read-write locks, and this was probably not the best implementation on all or even most platforms MySQL is available on.

Now InnoDB's "private" mutexes and rw-locks were a good way to get spin-locks on all platforms, which may be a win in many cases, but as the Google team had demonstrated, it could be improved on. Indeed, I knew that adaptive spin-locks are available on Solaris, and they offer an extra advantage - if the holder of a lock is found to be off CPU, we don't bother spinning, but instead put the thread wanting the lock straight to sleep.

So, I decided to undertake a couple of performance studies of InnoDB's locking, being:

  1. Apply the Google SMP patch to MySQL 5.1 and test
  2. Modify InnoDB in 5.1 to use POSIX mutexes and RW-locks and test

The second step turned out to be quite complicated. I could not even change all of InnoDB's RW-locks to POSIX ones, as the InnoDB sychronization objects offer functionality not available via POSIX. It also meant we would be diverging more significantly from the InnoDB in 5.1, so this option - although looking promising - was shelved.

This left the Google SMP patch. It also looked promising. It was a less dramatic change, and offered scaling benefits in all the testing I did.

There was one last snag though - the mutex and RW-lock improvments in the Google SMP patch would only be applied if you were building on x86/x64 with GCC 4.1 or later, as they relied on GCC's atomic built-ins.

You can consider that we have a two-dimensional matrix of platforms that MySQL supports, being a compiler, then an Operating System. To make a feature portable across this matrix, you need to find a portable API, write code that is portable, or write code that uses a choice of different portable API's depending on what is available.

Now we definitely wanted to get a similar benefit for InnoDB on SPARC, and not necessarily just with GCC. In any case, GCC did not offer all of the built-in atomics for SPARC at the time. Happily, there are atomic functions available in Solaris that fit the job fine. MySQL 5.4 uses the functions if you build on Solaris without a version of GCC that supports built-in atomics.

Just so you understand though, here is (a simplified version of) what happens when you build MySQL 5.4 on your chosen platform with your chosen compiler:

  • IF (compiler has GCC built-in atomics)
    use GCC built-in atomics
  • ELSE IF (OS has atomic functions)
    use atomic functions
  • ELSE
    use traditional InnoDB synchronization objects, based on pthread_mutex*.

Summary

As Neel points out in his blog, it was an exercise we learnt something from, even if we did develop functionality that will not be used. The important thing is we know we have improved the performance of MySQL, by extending the Google SMP improvements to all Solaris users, regardless of chosen compiler.

http://blogs.sun.com/timc/date/20090409 Thursday April 09, 2009

Testing the New Pool-of-Threads Scheduler in MySQL 6.0, Part 2

In my last blog, I introduced my investigation of the "Pool-of-Threads" scheduler in MySQL 6.0. Read on to see where I went next.

I now want to take a different approach to comparing the two schedulers. It is one thing to compare how the schedulers work "flat out" - with a transaction request rate that is limited only by the maximum throughput of the system under test. I would like to instead look at how the two schedulers compare when I drive mysqld at a consistent transaction rate, then vary only the number of connections over which the transaction requests are arriving. I will aim to come up with a transaction rate that sees CPU utilization somewhere in the 40-60% range.

This is more like how real businesses use MySQL every day, as opposed to the type of benchmarking that computer companies usually engage in. This will also allow me to look at how the schedulers run at much higher connection counts - which is where the pool-of-threads scheduler is supposed to shine.

Now, I will let you all know that I first conducted my experiments with mysqld and the load generator (sysbench) on the same system. I was again not sure this would be be the best methodology, primarily because I would end up having one operating system instance scheduling in some cases a very large number of sysbench threads along with the mysqld threads.

It turned out the results from this mode threw up some issues (like not being able to get my desired throughput with 2048 connections in pool-of-threads mode), so I repeated my experiments - the second set of results have the load generation coming from two remote systems, each with a dedicated 1 Gbit ethernet link to the DB server.

The CPU utilization I have captured was just the %USR plus %SYS for the mysqld process. This makes the two sets of metrics comparable.

Here are my results. First for experiments where sysbench ran on the same host as mysqld:

Then for experiments where sysbench ran on two remote hosts, each with a dedicated Gigabit Ethernet link to the database server:

As you can see, the pool-of-threads model does incur an overhead, both in terms of CPU consumption and response time, at low connections counts. As hoped though, the advantage swings in pool-of-threads' favour. This is particularly noticeable in the case where our clients are remote. It is arguable that an architecture involving many hundreds or thousands of client connections is more likely to have those clients located remote from the DB server.

Now, the first issue I have is that while pool-of-threads starts to win on response time, the response time is still increasing in a similar fashion to thread-per-connection's response time (note - the scale is logarithmic). This is not what I expected, so we have a scalability problem in there somewhere.

The second issue is where I have to confess - I only got one "lucky" run where my target transaction rate was achieved for pool-of-threads at 2048 connections. For many other runs, the target rate could not be achieved, as these raw numbers show:

connections tps mysqld
%usr
mysqld
%sys
mysqld
%cpu
avg-resp 95%-resp
2048962.2225.2314.9340.161943.782368.78
20481197.0030.5911.2041.79317.98435.19
2048836.5021.9811.0933.072259.362287.03
2048963.0026.4912.0738.561333.671128.93
2048992.2525.8115.0840.891851.172280.50
2048915.7124.1615.0539.212220.452342.06
2048919.5424.2515.0539.302210.952331.45
2048917.0924.1515.0539.202217.862321.40
2048875.0923.2013.2936.492188.692344.91
20481180.6231.3514.5745.921439.961772.86
20481185.8030.7414.2444.981185.711814.24
20481146.9030.3415.2345.571602.851842.14
20481141.4730.2015.2245.421612.341873.95
20481158.7430.4712.9943.46999.761870.35
20481177.5930.6714.9745.641403.221838.84

This indicates we have some sort of bottleneck right at or around the 2048 thread point. This is not what we want with pool-of-threads, so I will continue my investigation.

http://blogs.sun.com/timc/date/20090408 Wednesday April 08, 2009

Testing the New Pool-of-Threads Scheduler in MySQL 6.0

I have recently been investigating a bew feature of MySQL 6.0 - the "Pool-of-Threads" scheduler. This feature is a fairly significant change to the way MySQL completes tasks given to it by database clients.

To begin with, be advised that the MySQL database is implemented as a single multi-threaded process. The conventional threading model is that there are a number of "internal" threads doing administrative work (including accepting connections from clients wanting to connect to the database), then one thread for each database connection. That thread is responsible for all communication with that database client connection, and performs the bulk of database operations on behalf of the client.

This architecture exists in other RDBMS implementations. Another common implementation is a collection of processes all cooperating via a region of shared memory, usually with semaphores or other synchronization objects located in that shared memory.

The creation and management of threads can be said to be cheap, in a relative sense - it is usually significantly cheaper to create or destroy a thread rather than a process. However these overheads do not come for free. Also, the operations involved in scheduling a thread as opposed to a process are not significantly different. A single operating system instance scheduling several thousand threads on and off the CPUs is not much less work than one scheduling several thousand processes doing the same work.

Pool-of-Threads

The theory behind the Pool-of-Threads scheduler is to provide an operating mode which supports a large number of clients that will be maintaining their connections to the database, but will not be sending a constant stream of requests to the database. To support this, the database will maintain a (relatively) small pool of worker threads that take a single request from a client, complete the request, return the results, then return to the pool and wait for another request, which can come from any client. The database's internal threads still exist and operate in the same manner.

In theory, this should mean less work for the operating system to schedule threads that want CPU. On the other hand, it should mean some more overhead for the database, as each worker thread needs to restore the context of a database connection prior to working on each client request.

A smaller pool of threads should also consume less memory, as each thread requires a minimum amount of memory for a thread stack, before we add what is needed to store things like a connection context, or working space to process a request.

You can read more about the different threading models in the MySQL 6.0 Reference Manual.

Testing the Theory

Mark Callaghan of Google has recently had a look at whether this theory holds true. He has published his results under "No new global mutexes! (and how to make the thread/connection pool work)". Mark has identified (via this bug he logged) that the overhead for using Pool-of-Threads seems quite large - up to 63 percent.

So, my first task is see if I get the same results. I will note here that I am using Solaris, whereas Mark was no doubt using a Linux distro. We probably have different hardware as well (although both are Intel x86).

Here is what I found when running sysbench read-only (with the sysbench clients on the same host). The "conventional" scheduler inside MySQL is known as the "Thread-per-Connection" scheduler, by the way.

This is in contrast to Mark's results - I am only seeing a loss in throughput of up to 30%.

What about the bigger picture?

These results do show there is a definite reduction in maximum throughput if you use the pool-of-threads scheduler.

I believe it is worth looking at the bigger picture however. To do this, I am going to add in two more test cases:

  • sysbench read-only, with the sysbench client and MySQL database on separate hosts, via a 1 Gb network
  • sysbench read-write, via a 1 Gb network

What I want to see is what sort of impact the pool-of-threads scheduler has for a workload that I expect is still the more common one - where our database server is on a dedicated host, accessed via a network.

As you can see, the impact on throughput is far less significant when the client and server are separated by a network. This is because we have introduced network latency as a component of each transaction and increased the amount of work the server and client need to do - they now need to perform ethernet driver, IP and TCP tasks.

This reduces the relative overhead - in CPU consumed and latency - introduced by pool-of-threads.

This is a reminder that if you are conducting performance tests on a system prior to implementing or modifying your architecture, you would do well to choose a test architecture and workload that is as close as possible to that you are intending to deploy. The same is true if you are are trying to extrapolate performance testing someone else has done to your own architecture.

The Converse is Also True

On the other hand, if you are a developer or performance engineer conducting testing in order to test a specific feature or code change, a micro-benchmark or simplified test is more likely to be what you need. Indeed, Mark's use of the "blackhole" storage engine is a good idea to eliminate that processing from each transaction.

In this scenario, if you fail to make the portion of the software you have modified a significant part of the work being done, you run the risk of seeing performance results that are not significantly different, which may lead you to assume your change has negligible impact.

In my next posting, I will compare the two schedulers using a different perspective.

http://blogs.sun.com/timc/date/20090406 Monday April 06, 2009

New Feature for Sysbench - Generate Transactions at a Steady Rate

Perhaps I am becoming a regular patcher of sysbench...

I have developed a new feature for sysbench - the ability to generate transactions at a steady rate determined by the user.

This mode is enabled using the following two new options:
--tx-rate
Rate at which sysbench should attempt to send transactions to the database, in transactions per second. This is independent of num_threads. The default is 0, which means to send as many as possible (i.e., do not pause between the end of one transaction and the start of another. It is also independent of other options like --oltp-user-delay-min and --oltp-user-delay-max, which add think time between individual statements generated by sysbench.
--tx-jitter
Magnitude of the variation in time to start transactions at, in microseconds. The default is zero, which asks each thread to vary its transaction period by up to 10 percent (i.e. 10^6 / tx-rate * num-threads / 10). A standard pseudo-random number generator is used to decide each transaction start time.

My need for these options is simple - I want to generate a steady load for my MySQL database. It is one thing to measure the maximum achievable throughput as you change your database configuration, hardware, or num-threads. I am also interested in how the system (or just mysqld's) utilization changes, at the same transaction rate, when I change other variables.

An upcoming post will demonstrate a use of sysbench in this mode.

For the moment my new feature can be added to sysbench 0.4.12 (and probably many earlier versions) via this patch. These changes are tested on Solaris, but I did choose only APIs that are documented as also available on Linux. I have also posted my patch on sourceforge as a sysbench feature enhancement request.

http://blogs.sun.com/timc/date/20090114 Wednesday January 14, 2009

You Learn Something Every Day

Just learned how to save about a bazillion keystrokes over the remainder of my file-editing & programming life.

This is because I just learned that C-M-l (or Control-Meta-l, where "Meta" is the "Diamond" key on a Sun keyboard) is the (X)Emacs key sequence for "switch-to-other-buffer".

I have been doing this via Control-x, "b", Enter, or in other words, switch-buffer, then pressing Enter to accept the default, which has the same definition as "other-buffer". And I do it all the time.

D'oh...

By the way, I have been using (X)Emacs for approximately 20 years. I was lucky enough to find it when I first started on Unix, because I felt Vi was not powerful enough. Of course, any mention of Emacs and Vi in the same breath is likely to start a war, so I apologize to those who are not interested...

http://blogs.sun.com/timc/date/20081217 Wednesday December 17, 2008

MySQL 5.1 Memory Allocator Bake-Off

After getting sysbench running properly with a scalable memory allocator (see last post), I can now return to what I was originally testing - what memory allocator is best for the 5.1 server (mysqld).

This stems out of studies I have made of some patches that have been released by Google. You can read about the work Google has been doing here.

I decided I wanted to test a number of configurations based on the MySQL community source, 5.1.28-rc, namely:

  • The baseline - no Google SMP patch, default memory allocator (5.1.28-rc)
  • With Google SMP patch, mem0pool enabled, no custom malloc (pool)
  • With Google SMP patch, mem0pool enabled, linked with mtmalloc (pool-mtmalloc)
  • With Google SMP patch, mem0pool disabled, linked with tcmalloc (TCMalloc)
  • With Google SMP patch, mem0pool disabled, linked with umem (umem)
  • With Google SMP patch, mem0pool disabled, linked with mtmalloc (mtmalloc)

Here are some definitions, by the way:

mem0pool InnoDB's internal "memory pools" feature, found in mem0pool.c (NOTE: Even if this is enabled, other parts of the server will not use this memory allocator - they will use whatever allocator is linked with mysqld)
tcmalloc The "libtcmalloc_minimal.so.0.0.0" that is built from google-perftools-0.99.2
Hoard The Hoard memory allocator, version 3.7.1
umem The libumem library (included with Solaris)
mtmalloc The mtmalloc library (included with Solaris)

My test setup was a 16-CPU Intel system, running Solaris Nevada build 100. I chose to use only an x86 platform, as I was not able to build tcmalloc on SPARC. I also chose to run with the database in TMPFS, and with an innoDB buffer size smaller than the database size. This was to ensure that we would be CPU-bound if possble, rather than slowed by I/O.

If I built any package (no need for mtmalloc or umem), I used GCC 4.3.1, except for Hoard, which seemed to prefer the Sun Studio 11 C compiler (over Sun Studio 12 or GCC).

My test was a sysbench OLTP read-write run, of 10 minutes. Each series of runs at different thread counts is preceded by a database re-build and 20 minute warmup. Here are my throughput results for 1-32 SysBench threads, in transactions per second:

These results show that while the Google SMP changes are a benefit, the disabling of InnoDB's mem0pool does not seem to provide any further benefit for my configuration. My results also show that TCMalloc is not a good allocator for this workload on this platform, and Hoard is particularly bad, with significant negative scaling above 16 threads.

The remaining configurations are pretty similar, with mtmalloc and umem a little ahead at higher thread counts.

Before I get a ton of comments and e-mails, I would like to point out that I did some verification of my TCMalloc builds, as the results I got surprised me. I verified that it was using the supplied assembler for atomic routines, and I built it with optimization (-O3) and without.

I also discovered that TCMalloc was emitting this diagnostic when mysqld was starting up:

src/tcmalloc.cc:151] uname failed assuming no TLS support (errno=0)

I rectified this with a change in tcmalloc.cc, and called this configuration "TCMalloc -O3, TLS". It is shown against the other two configurations below.

I often like to have a look at what the CPU cost of different configurations are. This helps to demonstrate headroom, and whether different throughput results may be due to less efficient code or something else. The chart below lists what I found - note that this is system-wide CPU (user & system) utilization, and I was running my SysBench client on the same system.

Lastly, I did do one other comparison, which was to measure how much each memory allocator affected the virtual size of mysqld. I did not expect much difference, as the most significant consumer - the InnoDB buffer pool - should dominate with large long-lived allocations. This was indeed the case, and memory consumption grew little after the initial start-up of mysqld. The only allocator that then caused any noticable change was mtmalloc, which for some reason made the heap grow by 35MB following a 5 minute run (it was originally 1430 MB)

References

http://blogs.sun.com/timc/date/20081212 Friday December 12, 2008

Scalability and Stability for SysBench on Solaris

My mind is playing "Suffering Succotash..."

I have been working on MySQL performance for a while now, and the team I am in have discovered that SysBench could do with a couple of tweaks for Solaris.

Sidebar - sysbench is a simple "OLTP" benchmark which can test multiple databases, including MySQL. Find out all about it here , but go to the download page to get the latest version.

To simulate multiple users sending requests to a database, sysbench uses multiple threads. This leads to two issues we have identified with SysBench on Solaris, namely:

  • The implementation of random() is explicitly identified as unsafe in multi-threaded applications on Solaris. My team has found this is a real issue, with occasional core-dumps happening to our multi-threaded SysBench runs.
  • SysBench does quite a bit of memory allocation, and could do with a more scalable memory allocator.

Neither of these issues are necessarily relevant only to Solaris, by the way.

Luckily there are simple solutions. We can fix the random() issue by using lrand48() - in effect a drop-in replacement. Then we can fix the memory allocator by simply choosing to link with a better allocator on Solaris.

To help with a decision on memory allocator, I ran a few simple tests to check the performance of the two best-known scalable allocators available in Solaris. Here are the results ("libc" is the default memory allocator):

Throughput

To see the differences more clearly, lets do a relative comparison, using "umem" (A.K.A. libumem) as the reference:

Relative Throughput

So - around 20% less throughput at 16 or 32 threads. Very little difference at 1 thread, too (where the default memory allocator should be the one with the lowest synchronization overhead).

Where you see another big difference is CPU cost per transaction:

CPU Cost

I will just point out two other reasons why I would recommend libumem:

I have logged these two issues as sysbench bugs:

However, if you can't wait for the fixes to be released, try these:

http://blogs.sun.com/timc/date/20081013 Monday October 13, 2008

The Seduction of Single-Threaded Performance

The following is a dramatization. It is used to illustrate some concepts regarding performance testing and architecting of computer systems. Artistic license may have been taken with events, people and time-lines. The performance data I have listed is real and current however.

I got contacted recently by the Systems Architect of latestrage.com. He has been a happy Sun customer for many years, but was a little displeased when he took delivery of a beta test system of one of our latest UltraSPARC servers.

"Not very fast", he said.

"Is that right, how is it not fast?", I inquired eagerly.

"Well, it's a lot slower than one of the LowMarginBrand x86 servers we just bought", he trumpeted indignantly.

"How were you measuring their speed?", I asked, getting wary.

"Ahh, simple - we were compressing a big file. We were careful to not let it be limited by I/O bandwidth or memory capacity, though..."

What then ensues is a discussion about what was being used to test "performance", whether it matches latestrage.com's typical production workload and further details about architecture and objectives.

Data compression utilities are a classic example of a seemingly mature area in computing. Lots of utilities, lots of different algorithms, a few options in some utilities, reasonable portability between operating systems, but one significant shortcoming - there is no commonly available utility that is multi-threaded.

Let me pretend I am still in this situation of using compression to evaluate system performance, and I am wanting to compare the new Sun SPARC Enterprise T5440 with a couple of current x86 servers. Here is my own first observation about such a test, using a single-threaded compression utility:

Single-Threaded Throughput

Now if you browse down to older blog entries, you will see I have written my own multi-threaded compression utility. It consists of a thread to read data, as many threads to compress or decompress data as demand requires, and one thread to write data. Let me see whether I can fully exploit the performance of the T5440 with Tamp...

Well, this turned out to be not quite the end of the story. I designed my tests with my input file located on a TMPFS (in-memory) filesystem, and with the output being discarded. This left the system focusing on the computation of compression, without being obscured by I/O. This is the same objective that latestrage.com had.

What I found on the T5440 was that Tamp would not use more than 12-14 threads for compression - it was limited by the speed at which a single thread could read data from TMPFS.

So, I chose to use another dimension by which we can scale up work on a server - add more sources of workload. This is represented by multiple "Units of Work" in my chart below.

After completing my experiments I discovered that, as expected, the T5440 may disappoint if we restrict ourselves to a workload that can not fully utilize the available processing capacity. If we add more work however, we will find it handily surpasses the equivalent 4-socket quad-core x86 systems.

Multi-Threaded Throughput

Observing Single-Thread Performance on a T5440

A little side-story, and another illustration of how inadequate a single-threaded workload is at determining the capability of the T5440. Take a look at the following output from vmstat, and answer this question:

Is this system "maxed out"?

(Note: the "us", "sy" and "id" columns list how much CPU time is spent in User, System and Idle modes, respectively)

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr d0 d1 d2 d3   in   sy   cs us sy id 
 0 0 0 1131540 12203120 1  8  0  0  0  0  0  0  0  0  0 3359 1552 419  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3364 1558 431  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1478 420  0  0 99 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3354 1500 441  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1549 460  0  0 99 

Well, the answer is yes. It is running a single-threaded process, which is using 100% of one CPU. For the sake of my argument we will say the application is the critical application on the system. It has reached it's highest throughput and is therefore "maxed out". You see, when one CPU represents less than 0.5% of the entire CPU capacity of a system, then a single saturated CPU will be rounded down to 0%. In the case of the T5440, one CPU is 1/256th or 0.39%.

Here is a tip for watching a system that might be doing nothing, but then again might be doing something as fast as it can:

$ mpstat 3 | grep -v ' 100$'

This is what you might see:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0   48   204    4    2    0    0    0    0   127    1   1   0  99
 32    0   0    0     2    0    3    0    0    0    0     0    0   8   0  92
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0   49   205    5    3    0    0    0    0   117    0   1   0  99
 32    0   0    0     4    0    5    0    0    1    0     0    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   48   204    4    2    0    0    0    0   103    0   1   0  99
 32    0   0    0     3    0    4    0    0    0    0     3    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0

mpstat uses "usr", "sys", and "idl" to represent CPU consumption. For more on "wt" you can read my older blog.

For more on utilization, see the CPU/Processor page on solarisinternals.com

To read more about the Sun SPARC Enterprise T5440 which is announced today, go to Allan Packer's blog listing all the T5440 blogs.

Tamp - a Multi-Threaded Compression Utility

Some more details on this:

  • It uses a freely-available Lempel-Ziv-derived algorithm, optimised for compression speed
  • It was compiled using the same compiler and optimization settings for SPARC and x86.
  • It uses a compression block size of 256KB, so files smaller than this will not gain much benefit
  • I was compressing four 1GB database files. They were being reduced in size by a little over 60%.
  • Browse my blog for more details and a download

http://blogs.sun.com/timc/date/20080926 Friday September 26, 2008

Tamp - a Lightweight Multi-Threaded Compression Utility

UPDATE: Tamp has been ported to Linux, and is now at version 2.5

Packages for Solaris (x86 and SPARC), and a source tarball are available below.

Back Then

Many years ago (more than I care to remember), I saw an opportunity to improve the performance of a database backup. This was before the time of Oracle on-line backup, so the best choice at that time was to:

  1. shut down the database
  2. export to disk
  3. start up the database
  4. back up the export to tape

The obvious thing to improve here is the time between steps 1 and 3. We had a multi-CPU system running this database, so it occurred to me that perhaps compressing the export may speed things up.

I say "may" because it is important to remember that if the compression utility has lower throughput than the output of the database export (i.e. raw output; excluding any I/O operations to save that data) we may just end up with a different bottleneck, and not run any faster; perhaps even slower.

As it happens, this era also pre-dated gzip and other newer compression utilities. So, using the venerable old "compress", it actually was slower. It did save some disk space, because Oracle export files are eminently compressible.

So, I went off looking for a better compression utility. I was now more interested in something that was fast. It needed to not be the bottleneck in the whole process.

What I found did the trick - It reduced the export time by 20-30%, and saved some disk space as well. The reason why it saved time was that it was able to compress at least as fast as Oracle's "exp" utility was able to produce data to compress, and it eliminated some of the I/O - the real bottleneck.

More Recently

I came across a similar situation more recently - I was again doing "cold" database restores and wanted to speed them up. It was a little more challenging this time, as the restore was already parallel at the file level, and there were more files than CPUs involved (72). In the end, I could not speed up my 8-odd minute restore of ~180GB, unless I already had the source files in memory (via the filesystem cache). That would only work in some cases, and is unlikely to work in the "real world", where you would not normally want this much spare memory to be available to the filesystem.

Anyway, it took my restore down to about 3 minutes in cases where all my compressed backup files were in memory - this was because it had now eliminated all read I/O from the set of arrays holding my backup. This meant I had eliminated all competing I/O's from the set of arrays where I was re-writing the database files.

Multi-Threaded Lightweight Compression

I could not even remember the name of the utility I used years ago, but I knew already that I would need something better. The computers of 2008 have multiple cores, and often multiple hardware threads per core. All of the current included-in-the-distro compression utilities (well, almost all utilities) for Unix are still single-threaded - a very effective way to limit throughput on a multi-CPU system.

Now, there are a some multi-threaded compression utilities available, if not widely available:

  • PBZIP2 is a parallel implementation of BZIP2. You can find out more here
  • PIGZ is a parallel implementation of GZIP, although it turns out it is not possible to decompress a GZIP stream with more than one thread. PIGZ is available here.

Here is a chart showing some utilities I have tested on a 64-way Sun T5220. The place to be on this chart is toward the bottom right-hand corner.

Here is a table with some of the numbers from that chart:

Utility Reduction (%) Elapsed (s)
tamp 66.18 0.31
pigz --fast 71.18 1.04
pbzip2 --fast 77.17 4.17
gzip --fast 71.10 16.13
gzip 75.73 40.29
compress 61.61 18.21

To answer your question - yes, tamp really is 50-plus-times faster than "gzip --fast".

Tamp

The utility I have developed is called tamp. As the name suggests, it does not aim to provide the best compression (although it is better than compress, and sometimes beats "gzip --fast").

It is however a proper parallel implementation of an already fast compression algorithm.

If you wish to use it, feel free to download it. I will be blogging in the near future on a different performance test I conducted using tamp.

Compression Algorithm

Tamp makes use of the compression algorithm from Quick LZ version 1.40. I have tested a couple of other algorithms, and the code in tamp.c can be easily modified to use a different algorithm. You can get QuickLZ from here (you will need to download source yourself if you want to build tamp).

Resources

http://blogs.sun.com/timc/date/20080906 Saturday September 06, 2008

Installing Solaris from a USB Disk

I regularly do a full install of a Solaris Development release onto my laptop. Why full? Well, that is another story for another day, but it is not because the Solaris Upgrade software; including Live Upgrade; is lacking.

I decided I no longer see the sense of burning a DVD to do this; and I know that Solaris can boot from a USB device.

I used James C. Liu's blog as an inspiration, but the following is what I have found worked well to boot an install image located on a USB disk. You may also be interested in the Solaris Ready USB FAQ.

NOTE: This procedure only has a chance of working if you have a version of Solaris 10 or later that uses GRUB and has a USB driver that works with your drive.

  1. Set up an 8GB "Solaris2" partition on the USB drive using fdisk. Make it the active partition.
  2. Set up a UFS slice using all but the first cylinder of that 8GB as slice 0 using format. Run newfs. Mount.

    The first cylinder ends up being dedicated to a "boot" slice. I do not know what it is used for, perhaps avoidance of overwriting PC-style partition table & boot program.

  3. Mount the DVD ISO using lofiadm/mount (hint: google lofiadm solaris iso)
  4. Use cpio to copy the contents of the DVD ISO into the UFS partition on the USB drive, e.g:

    # cd <rootdir of DVD ISO>
    # find . | cpio -pdum <rootdir of USB filesystem>
    

  5. Run installgrub to install the stage1 & stage2 files from the DVD ISO onto the USB drive If the filesystem on your USB drive has mounted as /dev/dsk/c2t0d0s0 for example, then use:

    # cd <rootdir of DVD ISO>
    # /sbin/installgrub boot/grub/stage1 boot/grub/stage2 /dev/rdsk/c2t0d0s0
    

  6. Boot off the USB disk. It uses the same GRUB install that would be on a DVD.
  7. Now, I can not remember whether the next step was either:

    • Wait for the install to fail (unable to find distribution), or:

    • Exit/quit out of installation

    ...but you need to get to a shell.

  8. Manually mount the USB partition at /cdrom

    NOTE: your controller numbers are probably not as you expect at this point, so double-check what you are mounting.

  9. Re-start the install
    I used "suninstall". I think you can use "solaris-install" instead.

The install seemed to run fine from there, however it went through a sysconfig stage after the reboot.

Then I ended up with one teeny problem - my X server would not start.

I discovered some issues with fonts, and then decided to check the install log. I discovered a number of packages had reported status like:


Installation of <SUNWxwfnt> partially failed.
19997 blocks
pkgadd: ERROR: class action script did not complete successfully

Installation of <SUNWxwcft> partially failed.

Installation of <SUNW5xmft> partially failed.

Installation of <SUNW5ttf> partially failed.

Installation of <SUNWolrte> partially failed.

Installation of <SUNWhttf> partially failed.

I have since pkgrm/pkadd-ed these packages (using -R while running the laptop on an older release with the new boot environment mounted), and all is now well.

http://blogs.sun.com/timc/date/20080904 Thursday September 04, 2008

Building GCC 4.x on Solaris

I needed to build GCC 4.3.1 for my x86 system running a recent development build of Solaris. I thought I would share what I discovered, and then improved on.

I started with Paul Beach's Blog on the same topic, but I knew it had a couple of shortcomings, namely:

  • No mention of a couple of pre-requisites that are mentioned in the GCC document Prerequisites for GCC
  • A mysterious "cannot compute suffix of object files" error in the build phase
  • No resolution of how to generate binaries that have a useful RPATH (see Shared Library Search Paths for a discussion on the importance of RPATH).

I found some help on this via this forum post, but here is my own cheat sheet.

  1. Download & install GNU Multiple Precision Library (GMP) version 4.1 (or later) from sunfreeware.com. This will end up located in /usr/local.
  2. Download, build & install MPFR Library version 2.3.0 (or later) from mpfr.org. This will also end up in /usr/local.
  3. Download & unpack the GCC 4.x base source (the one of the form gcc-4.x.x.tar.gz) from gcc.gnu.org
  4. Download my example config_make script, edit as desired (you probably want to change OBJDIR and PREFIX, and you may want to add other configure options.
  5. Run the config_make script
  6. "gmake install" as root (although I instead create the directory matching PREFIX, make it writable by the account doing the build, then "gmake install" using that account).

You should now have GCC binaries that look for the shared libraries they need in /usr/sfw/lib, /usr/local/lib and PREFIX/lib, without anyone needing to set LD_LIBRARY_PATH. In particular, modern versions of Solaris will have a libgcc_s.so in /usr/sfw/lib.

If you copy your GMP and MPFR shared libraries (which seem to be needed by parts of the compiler) into PREFIX/lib, you will also have a self-contained directory tree that you can deploy to any similar system more simply (e.g. via rsync, tar, cpio, "scp -pr", ...)

http://blogs.sun.com/timc/date/20080421 Monday April 21, 2008

Comparing the UltraSPARC T2 Plus to Other Recent SPARC Processors

Update - now the UltraSPARC T2 Plus has been released, and is available in several new several Sun servers. Allan Packer has published a new collection of blog entries that provide lots of detail.

Here is my updated table of details comparing a number of current SPARC processors. I can not guarantee 100% accuracy on this, but I did quite a bit of reading...

Name UltraSPARC IV+® SPARC64TM VI UltraSPARCTM T1 UltraSPARCTM T2 UltraSPARCTM T2 Plus
Codename Panther Olympus-C Niagara Niagara 2 Victoria Falls
Physical
process 90nm 90nm 90nm 65nm 65nm
die size 335 mm2 421 mm2 379 mm2 342 mm2
pins 1368 1933 1831
transistors 295 M 540 M 279 M 503 M
clock 1.5 – 2.1 GHz 2.15 – 2.4 GHz 1.0 – 1.4 GHz 1.0 – 1.4 GHz 1.2 – 1.4 GHz
Architecture
cores 2 2 8 8 8
threads/core 1 2 4 8 8
threads/chip 2 4 32 64 64
FPU : IU 1 : 1 1 : 1 1 : 8 1 : 1 1 : 1
integration 8 × small crypto 8 × large crypto, PCI-E, 2 × 10Gbe 8 × large crypto, PCI-E, multi-socket coherency
virtualization domains1 hypervisor
L1 i$ 64K/core 128K/core 16K/core
L1 d$ 64K/core 128K/core 8K/core
L2 cache (on-chip) 2MB, shared, 4-way, 64B lines 6MB, shared, 10-way, 256B lines 3MB, shared, 12-way, 64B lines 4MB, shared, 16-way, 64B lines
L3 cache 32MB shared, 4-way, tags on-chip, 64B lines n/a n/a
MMU on-chip
on-chip, 4 × DDR2 on-chip, 4 × FB-DIMM on-chip, 2 × FB-DIMM
Memory Models TSO TSO TSO, limited RMO
Physical Address Space 43 bits 47 bits 40 bits
i-TLB 16 FA + 512 2-way SA 64 FA
d-TLB 16 FA + 512 2-way SA 64 FA 128 FA
combined TLB 32 FA + 2048 2-way SA
Page sizes 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 4M, 256M
Memory bandwidth2 (GB/sec) 9.6 25.6 60+ 32

Footnotes

  • 1 - domains are implemented above the processor/chip level
  • 2 - theoretical peak - does not take cache coherency or other limits into account

Glossary

  • FA - fully-associative
  • FPU - Floating Point Unit
  • i-TLB - Instruction Translation Lookaside Buffer (d means Data)
  • IU - Integer (execution) Unit
  • L1 - Level 1 (similarly for L2, L3)
  • MMU - Memory Management Unit
  • RMO - Relaxed Memory Order
  • SA - set-associative
  • TSO - Total Store Order

References:

http://blogs.sun.com/timc/date/20080409 Wednesday April 09, 2008

What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day. Bear with me as I wax lyrical on computer history...

Deep back in the origins of the computer, there was only one process (as well as one processor). There was no operating system, so in turn there were no concepts like:

  • scheduling
  • I/O interrupts
  • time-sharing
  • multi-threading

What am I getting at? Well, let me pick out a few of the advances in computing, so I can explain why interleaved multithreading is simply the next logical step.

The first computer operating systems (such as GM-NAA I/O) simply replaced (automated) some of the tasks that were undertaken manually by a computer operator - load a program, load some utility routines that could be used by the program (e.g. I/O routines), record some accounting data at the completion of the job. They did nothing during the execution of the job, but they had nothing to do - no other work could be done while the processor was effectively idle, such as when waiting for an I/O to complete.

Then muti-processing operating systems were developed. Suddenly we had the opportunity to use the otherwise wasted CPU resource while one program was stalled on an I/O. In this case the O.S. would switch in another program. Generically this is known as scheduling, and operating systems developed (and still develop) more sophisticated ways of sharing out the CPU resources in order to achieve the greatest/fairest/best utilization.

At this point we had enshrined in the OS the idea that CPU resource was precious, not plentiful, and there should be features designed into the system to minimize its waste. This would reduce or delay the need for that upgrade to a faster computer as we continued to add new applications and features to existing applications. This is analogous to conserving water to offset the need for new dams & reservoirs.

With CMT, we have now taken this concept into silicon. If we think of a load or store to or from main (uncached) memory as a type of I/O, then thread switching in interleaved multithreading is just like the idea of a voluntary context switch. We are not giving up the CPU for the duration of the "I/O", but we are giving up the execution unit, knowing that if there is another thread that can use it, it will.

In a way, we are delaying the need to increase the clock rate or pipe-lining abilities of the cores by taking this step.

Now the underlying details of the implementation can be more complex than this (and they are getting more complex as we release newer CPU architectures like the UltraSPARC T2 Plus - see the T5140 Systems Architecture Whitepaper for details), but this analogy to I/O's and context switches works well for me to understand why we have chosen this direction.

To continue to throw engineering resources at faster, more complicated CPU cores seems to be akin to the idea of the mainframe (the closest descendant to early computers) - just make it do more of the same type of workload.

See here for the full collection of UltraSPARC T2 Plus blogs

http://blogs.sun.com/timc/date/20080221 Thursday February 21, 2008

Margins in Consumer Telephony

Here is a little observation on telephone margins that is dear to my heart. Below is a list of rates (in US dollars per minute, taxes and other fees not shown) for various methods of calling from the US to a land-line in Australia. The last four options use VoIP.

Source Carrier Add-on Plan Add-on $/month Rate
Land-line AT&T none – Peak - $4.00
Land-line AT&T none – Off-peak - $2.76
Mobile AT&T none - $3.49
Mobile AT&T World Connect $3.99 $0.09
Land-line AT&T Occasional Calling $1.00 $1.75
Land-line AT&T Worldwide Value Calling $5.00 $0.09
Land-line Time-Warner Cable
- $0.10
Land-line Comcast

$0.09
Land-line Vonage

$0.05
Land-line AT&T CallVantage

$0.04
Land-line Callcentric

$0.0231
Land-line CallWithUs

$0.0148

As you may see, there is a 27000% range in these numbers. Even with that one carrier there is a 100x range. Plenty of opportunity for profit.

Hopefully it is useful to be aware there can be some very steep rates for ex-pat Aussies to call home if they are away from their preferred carrier.

I have been quite satisfied with CallWithUs, if anyone is interested. They even have a call-back feature if I want to call from my mobile.

While I'm on the topic, I should also mention this helpful message I got from my wireless (mobile) provider (although they are no longer my provider):

When you're on the go and don't have the info you need, AT&T 411 is here to help. Whether you're searching for a business or residence - dial 4-1-1 to get quick access to phone numbers and addresses. Plus, with AT&T 411 you can find movie times, driving directions and more. And it's just $1.79 per call plus standard airtime charges.*

Thanks for the reminder - I will be vigilant to avoid that $1.79 charge, and stick to 1-800-FREE411...


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.