The Good, the Blog & the Ugly - Tim Cook's Weblog

Main | Next page »

http://blogs.sun.com/timc/date/20090427 Monday April 27, 2009

pstime - a mash-up of ps(1) and ptime(1)

I have done some testing in the past where I needed to know the amount of CPU consumed by a process more accurately than I can get from the standard set of operating system utilities.

Recently I hit the same issue - I wanted to collect CPU consumption of mysqld.

To capture process CPU utilization over an interval on Solaris, about the best I can get is the output from a plain "prstat" command, which might look like:

mashie ) prstat -c -p `pgrep mysqld` 5 2
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu0    39    0   0:38:13  40% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.63, 0.33, 0.18
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7141 mysql     278M  208M cpu1    32    0   0:38:18  41% mysqld/45
Total: 1 processes, 45 lwps, load averages: 0.68, 0.34, 0.18

I am after data from the second sample only (still not sure exactly how prstat gets data for the fist sample, which comes out almost instantaneously), so you can guess I will need some sed/perl that is a litte more complicated than I would prefer.

pstime reads PROCFS (i.e.. the virtualized file-system mounted on /proc) and captures CPU utilization figures for processes. It will report the %USR and %SYS either for a specific list of processes, or every process running on the system (i.e., running at both sample points). The start sample time is recorded in high resolution at the time a process' data is captured, and then again after N seconds, where N is the first parameter supplied to pstime.

The default output of pstime is expressed as either a percentage of whole system CPU, or CPU seconds, with four significant digits. Solaris itself records the original figures in nanosecond resolution, although we do not expect today's hardware to be that accurate.

Here is an example:

mashie ) pstime 10 `pgrep sysbench\|mysqld`
  UID    PID  %USR  %SYS COMMAND
mysql   7141 44.17 3.391 /u/dist/mysql60-debug/bin/mysqld --defaults-file=/et
mysql  19870 2.517 2.490 sysbench --test=oltp --oltp-read-only=on --max-time=
mysql  19869 0.000 0.000 /bin/sh -p ./run-sysbench

Downloads

http://blogs.sun.com/timc/date/20090406 Monday April 06, 2009

New Feature for Sysbench - Generate Transactions at a Steady Rate

Perhaps I am becoming a regular patcher of sysbench...

I have developed a new feature for sysbench - the ability to generate transactions at a steady rate determined by the user.

This mode is enabled using the following two new options:
--tx-rate
Rate at which sysbench should attempt to send transactions to the database, in transactions per second. This is independent of num_threads. The default is 0, which means to send as many as possible (i.e., do not pause between the end of one transaction and the start of another. It is also independent of other options like --oltp-user-delay-min and --oltp-user-delay-max, which add think time between individual statements generated by sysbench.
--tx-jitter
Magnitude of the variation in time to start transactions at, in microseconds. The default is zero, which asks each thread to vary its transaction period by up to 10 percent (i.e. 10^6 / tx-rate * num-threads / 10). A standard pseudo-random number generator is used to decide each transaction start time.

My need for these options is simple - I want to generate a steady load for my MySQL database. It is one thing to measure the maximum achievable throughput as you change your database configuration, hardware, or num-threads. I am also interested in how the system (or just mysqld's) utilization changes, at the same transaction rate, when I change other variables.

An upcoming post will demonstrate a use of sysbench in this mode.

For the moment my new feature can be added to sysbench 0.4.12 (and probably many earlier versions) via this patch. These changes are tested on Solaris, but I did choose only APIs that are documented as also available on Linux. I have also posted my patch on sourceforge as a sysbench feature enhancement request.

http://blogs.sun.com/timc/date/20081013 Monday October 13, 2008

The Seduction of Single-Threaded Performance

The following is a dramatization. It is used to illustrate some concepts regarding performance testing and architecting of computer systems. Artistic license may have been taken with events, people and time-lines. The performance data I have listed is real and current however.

I got contacted recently by the Systems Architect of latestrage.com. He has been a happy Sun customer for many years, but was a little displeased when he took delivery of a beta test system of one of our latest UltraSPARC servers.

"Not very fast", he said.

"Is that right, how is it not fast?", I inquired eagerly.

"Well, it's a lot slower than one of the LowMarginBrand x86 servers we just bought", he trumpeted indignantly.

"How were you measuring their speed?", I asked, getting wary.

"Ahh, simple - we were compressing a big file. We were careful to not let it be limited by I/O bandwidth or memory capacity, though..."

What then ensues is a discussion about what was being used to test "performance", whether it matches latestrage.com's typical production workload and further details about architecture and objectives.

Data compression utilities are a classic example of a seemingly mature area in computing. Lots of utilities, lots of different algorithms, a few options in some utilities, reasonable portability between operating systems, but one significant shortcoming - there is no commonly available utility that is multi-threaded.

Let me pretend I am still in this situation of using compression to evaluate system performance, and I am wanting to compare the new Sun SPARC Enterprise T5440 with a couple of current x86 servers. Here is my own first observation about such a test, using a single-threaded compression utility:

Single-Threaded Throughput

Now if you browse down to older blog entries, you will see I have written my own multi-threaded compression utility. It consists of a thread to read data, as many threads to compress or decompress data as demand requires, and one thread to write data. Let me see whether I can fully exploit the performance of the T5440 with Tamp...

Well, this turned out to be not quite the end of the story. I designed my tests with my input file located on a TMPFS (in-memory) filesystem, and with the output being discarded. This left the system focusing on the computation of compression, without being obscured by I/O. This is the same objective that latestrage.com had.

What I found on the T5440 was that Tamp would not use more than 12-14 threads for compression - it was limited by the speed at which a single thread could read data from TMPFS.

So, I chose to use another dimension by which we can scale up work on a server - add more sources of workload. This is represented by multiple "Units of Work" in my chart below.

After completing my experiments I discovered that, as expected, the T5440 may disappoint if we restrict ourselves to a workload that can not fully utilize the available processing capacity. If we add more work however, we will find it handily surpasses the equivalent 4-socket quad-core x86 systems.

Multi-Threaded Throughput

Observing Single-Thread Performance on a T5440

A little side-story, and another illustration of how inadequate a single-threaded workload is at determining the capability of the T5440. Take a look at the following output from vmstat, and answer this question:

Is this system "maxed out"?

(Note: the "us", "sy" and "id" columns list how much CPU time is spent in User, System and Idle modes, respectively)

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr d0 d1 d2 d3   in   sy   cs us sy id 
 0 0 0 1131540 12203120 1  8  0  0  0  0  0  0  0  0  0 3359 1552 419  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3364 1558 431  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1478 420  0  0 99 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3354 1500 441  0  0 100 
 0 0 0 1131540 12203120 0  0  0  0  0  0  0  0  0  0  0 3366 1549 460  0  0 99 

Well, the answer is yes. It is running a single-threaded process, which is using 100% of one CPU. For the sake of my argument we will say the application is the critical application on the system. It has reached it's highest throughput and is therefore "maxed out". You see, when one CPU represents less than 0.5% of the entire CPU capacity of a system, then a single saturated CPU will be rounded down to 0%. In the case of the T5440, one CPU is 1/256th or 0.39%.

Here is a tip for watching a system that might be doing nothing, but then again might be doing something as fast as it can:

$ mpstat 3 | grep -v ' 100$'

This is what you might see:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    2   0   48   204    4    2    0    0    0    0   127    1   1   0  99
 32    0   0    0     2    0    3    0    0    0    0     0    0   8   0  92
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    1   0   49   205    5    3    0    0    0    0   117    0   1   0  99
 32    0   0    0     4    0    5    0    0    1    0     0    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   48   204    4    2    0    0    0    0   103    0   1   0  99
 32    0   0    0     3    0    4    0    0    0    0     3    0  14   0  86
 48    0   0    0     6    0    0    5    0    0    0     0  100   0   0   0

mpstat uses "usr", "sys", and "idl" to represent CPU consumption. For more on "wt" you can read my older blog.

For more on utilization, see the CPU/Processor page on solarisinternals.com

To read more about the Sun SPARC Enterprise T5440 which is announced today, go to Allan Packer's blog listing all the T5440 blogs.

Tamp - a Multi-Threaded Compression Utility

Some more details on this:

  • It uses a freely-available Lempel-Ziv-derived algorithm, optimised for compression speed
  • It was compiled using the same compiler and optimization settings for SPARC and x86.
  • It uses a compression block size of 256KB, so files smaller than this will not gain much benefit
  • I was compressing four 1GB database files. They were being reduced in size by a little over 60%.
  • Browse my blog for more details and a download

http://blogs.sun.com/timc/date/20080926 Friday September 26, 2008

Tamp - a Lightweight Multi-Threaded Compression Utility

UPDATE: Tamp has been ported to Linux, and is now at version 2.5

Packages for Solaris (x86 and SPARC), and a source tarball are available below.

Back Then

Many years ago (more than I care to remember), I saw an opportunity to improve the performance of a database backup. This was before the time of Oracle on-line backup, so the best choice at that time was to:

  1. shut down the database
  2. export to disk
  3. start up the database
  4. back up the export to tape

The obvious thing to improve here is the time between steps 1 and 3. We had a multi-CPU system running this database, so it occurred to me that perhaps compressing the export may speed things up.

I say "may" because it is important to remember that if the compression utility has lower throughput than the output of the database export (i.e. raw output; excluding any I/O operations to save that data) we may just end up with a different bottleneck, and not run any faster; perhaps even slower.

As it happens, this era also pre-dated gzip and other newer compression utilities. So, using the venerable old "compress", it actually was slower. It did save some disk space, because Oracle export files are eminently compressible.

So, I went off looking for a better compression utility. I was now more interested in something that was fast. It needed to not be the bottleneck in the whole process.

What I found did the trick - It reduced the export time by 20-30%, and saved some disk space as well. The reason why it saved time was that it was able to compress at least as fast as Oracle's "exp" utility was able to produce data to compress, and it eliminated some of the I/O - the real bottleneck.

More Recently

I came across a similar situation more recently - I was again doing "cold" database restores and wanted to speed them up. It was a little more challenging this time, as the restore was already parallel at the file level, and there were more files than CPUs involved (72). In the end, I could not speed up my 8-odd minute restore of ~180GB, unless I already had the source files in memory (via the filesystem cache). That would only work in some cases, and is unlikely to work in the "real world", where you would not normally want this much spare memory to be available to the filesystem.

Anyway, it took my restore down to about 3 minutes in cases where all my compressed backup files were in memory - this was because it had now eliminated all read I/O from the set of arrays holding my backup. This meant I had eliminated all competing I/O's from the set of arrays where I was re-writing the database files.

Multi-Threaded Lightweight Compression

I could not even remember the name of the utility I used years ago, but I knew already that I would need something better. The computers of 2008 have multiple cores, and often multiple hardware threads per core. All of the current included-in-the-distro compression utilities (well, almost all utilities) for Unix are still single-threaded - a very effective way to limit throughput on a multi-CPU system.

Now, there are a some multi-threaded compression utilities available, if not widely available:

  • PBZIP2 is a parallel implementation of BZIP2. You can find out more here
  • PIGZ is a parallel implementation of GZIP, although it turns out it is not possible to decompress a GZIP stream with more than one thread. PIGZ is available here.

Here is a chart showing some utilities I have tested on a 64-way Sun T5220. The place to be on this chart is toward the bottom right-hand corner.

Here is a table with some of the numbers from that chart:

Utility Reduction (%) Elapsed (s)
tamp 66.18 0.31
pigz --fast 71.18 1.04
pbzip2 --fast 77.17 4.17
gzip --fast 71.10 16.13
gzip 75.73 40.29
compress 61.61 18.21

To answer your question - yes, tamp really is 50-plus-times faster than "gzip --fast".

Tamp

The utility I have developed is called tamp. As the name suggests, it does not aim to provide the best compression (although it is better than compress, and sometimes beats "gzip --fast").

It is however a proper parallel implementation of an already fast compression algorithm.

If you wish to use it, feel free to download it. I will be blogging in the near future on a different performance test I conducted using tamp.

Compression Algorithm

Tamp makes use of the compression algorithm from Quick LZ version 1.40. I have tested a couple of other algorithms, and the code in tamp.c can be easily modified to use a different algorithm. You can get QuickLZ from here (you will need to download source yourself if you want to build tamp).

Resources

http://blogs.sun.com/timc/date/20080904 Thursday September 04, 2008

Building GCC 4.x on Solaris

I needed to build GCC 4.3.1 for my x86 system running a recent development build of Solaris. I thought I would share what I discovered, and then improved on.

I started with Paul Beach's Blog on the same topic, but I knew it had a couple of shortcomings, namely:

  • No mention of a couple of pre-requisites that are mentioned in the GCC document Prerequisites for GCC
  • A mysterious "cannot compute suffix of object files" error in the build phase
  • No resolution of how to generate binaries that have a useful RPATH (see Shared Library Search Paths for a discussion on the importance of RPATH).

I found some help on this via this forum post, but here is my own cheat sheet.

  1. Download & install GNU Multiple Precision Library (GMP) version 4.1 (or later) from sunfreeware.com. This will end up located in /usr/local.
  2. Download, build & install MPFR Library version 2.3.0 (or later) from mpfr.org. This will also end up in /usr/local.
  3. Download & unpack the GCC 4.x base source (the one of the form gcc-4.x.x.tar.gz) from gcc.gnu.org
  4. Download my example config_make script, edit as desired (you probably want to change OBJDIR and PREFIX, and you may want to add other configure options.
  5. Run the config_make script
  6. "gmake install" as root (although I instead create the directory matching PREFIX, make it writable by the account doing the build, then "gmake install" using that account).

You should now have GCC binaries that look for the shared libraries they need in /usr/sfw/lib, /usr/local/lib and PREFIX/lib, without anyone needing to set LD_LIBRARY_PATH. In particular, modern versions of Solaris will have a libgcc_s.so in /usr/sfw/lib.

If you copy your GMP and MPFR shared libraries (which seem to be needed by parts of the compiler) into PREFIX/lib, you will also have a self-contained directory tree that you can deploy to any similar system more simply (e.g. via rsync, tar, cpio, "scp -pr", ...)

http://blogs.sun.com/timc/date/20080421 Monday April 21, 2008

Comparing the UltraSPARC T2 Plus to Other Recent SPARC Processors

Update - now the UltraSPARC T2 Plus has been released, and is available in several new several Sun servers. Allan Packer has published a new collection of blog entries that provide lots of detail.

Here is my updated table of details comparing a number of current SPARC processors. I can not guarantee 100% accuracy on this, but I did quite a bit of reading...

Name UltraSPARC IV+® SPARC64TM VI UltraSPARCTM T1 UltraSPARCTM T2 UltraSPARCTM T2 Plus
Codename Panther Olympus-C Niagara Niagara 2 Victoria Falls
Physical
process 90nm 90nm 90nm 65nm 65nm
die size 335 mm2 421 mm2 379 mm2 342 mm2
pins 1368 1933 1831
transistors 295 M 540 M 279 M 503 M
clock 1.5 – 2.1 GHz 2.15 – 2.4 GHz 1.0 – 1.4 GHz 1.0 – 1.4 GHz 1.2 – 1.4 GHz
Architecture
cores 2 2 8 8 8
threads/core 1 2 4 8 8
threads/chip 2 4 32 64 64
FPU : IU 1 : 1 1 : 1 1 : 8 1 : 1 1 : 1
integration 8 × small crypto 8 × large crypto, PCI-E, 2 × 10Gbe 8 × large crypto, PCI-E, multi-socket coherency
virtualization domains1 hypervisor
L1 i$ 64K/core 128K/core 16K/core
L1 d$ 64K/core 128K/core 8K/core
L2 cache (on-chip) 2MB, shared, 4-way, 64B lines 6MB, shared, 10-way, 256B lines 3MB, shared, 12-way, 64B lines 4MB, shared, 16-way, 64B lines
L3 cache 32MB shared, 4-way, tags on-chip, 64B lines n/a n/a
MMU on-chip
on-chip, 4 × DDR2 on-chip, 4 × FB-DIMM on-chip, 2 × FB-DIMM
Memory Models TSO TSO TSO, limited RMO
Physical Address Space 43 bits 47 bits 40 bits
i-TLB 16 FA + 512 2-way SA 64 FA
d-TLB 16 FA + 512 2-way SA 64 FA 128 FA
combined TLB 32 FA + 2048 2-way SA
Page sizes 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 512K, 4M, 32M, 256M 8K, 64K, 4M, 256M
Memory bandwidth2 (GB/sec) 9.6 25.6 60+ 32

Footnotes

  • 1 - domains are implemented above the processor/chip level
  • 2 - theoretical peak - does not take cache coherency or other limits into account

Glossary

  • FA - fully-associative
  • FPU - Floating Point Unit
  • i-TLB - Instruction Translation Lookaside Buffer (d means Data)
  • IU - Integer (execution) Unit
  • L1 - Level 1 (similarly for L2, L3)
  • MMU - Memory Management Unit
  • RMO - Relaxed Memory Order
  • SA - set-associative
  • TSO - Total Store Order

References:

http://blogs.sun.com/timc/date/20080409 Wednesday April 09, 2008

What Drove Processor Design Toward Chip Multithreading (CMT)?

I thought of a way of explaining the benefit of CMT (or more specifically, interleaved multithreading - see this article for details) using an analogy the other day. Bear with me as I wax lyrical on computer history...

Deep back in the origins of the computer, there was only one process (as well as one processor). There was no operating system, so in turn there were no concepts like:

  • scheduling
  • I/O interrupts
  • time-sharing
  • multi-threading

What am I getting at? Well, let me pick out a few of the advances in computing, so I can explain why interleaved multithreading is simply the next logical step.

The first computer operating systems (such as GM-NAA I/O) simply replaced (automated) some of the tasks that were undertaken manually by a computer operator - load a program, load some utility routines that could be used by the program (e.g. I/O routines), record some accounting data at the completion of the job. They did nothing during the execution of the job, but they had nothing to do - no other work could be done while the processor was effectively idle, such as when waiting for an I/O to complete.

Then muti-processing operating systems were developed. Suddenly we had the opportunity to use the otherwise wasted CPU resource while one program was stalled on an I/O. In this case the O.S. would switch in another program. Generically this is known as scheduling, and operating systems developed (and still develop) more sophisticated ways of sharing out the CPU resources in order to achieve the greatest/fairest/best utilization.

At this point we had enshrined in the OS the idea that CPU resource was precious, not plentiful, and there should be features designed into the system to minimize its waste. This would reduce or delay the need for that upgrade to a faster computer as we continued to add new applications and features to existing applications. This is analogous to conserving water to offset the need for new dams & reservoirs.

With CMT, we have now taken this concept into silicon. If we think of a load or store to or from main (uncached) memory as a type of I/O, then thread switching in interleaved multithreading is just like the idea of a voluntary context switch. We are not giving up the CPU for the duration of the "I/O", but we are giving up the execution unit, knowing that if there is another thread that can use it, it will.

In a way, we are delaying the need to increase the clock rate or pipe-lining abilities of the cores by taking this step.

Now the underlying details of the implementation can be more complex than this (and they are getting more complex as we release newer CPU architectures like the UltraSPARC T2 Plus - see the T5140 Systems Architecture Whitepaper for details), but this analogy to I/O's and context switches works well for me to understand why we have chosen this direction.

To continue to throw engineering resources at faster, more complicated CPU cores seems to be akin to the idea of the mainframe (the closest descendant to early computers) - just make it do more of the same type of workload.

See here for the full collection of UltraSPARC T2 Plus blogs

http://blogs.sun.com/timc/date/20080213 Wednesday February 13, 2008

Utilization - Can I Have More Accuracy Please?

Just thought I would share another Ruby script - this one takes the output of mpstat, and makes it more like the output of mpstat -a, only the values are floating point. I wrote it to process mpstat -a that I got from a customer. It can also cope with the date (in Unix ctime format) being prepended to every line. Here is some sample output:

CPUs minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
 4 7.0 0.0 26.0 114.25 68.0 212.75 16.75 64.75 11.75 0.0 141.25 1.0 1.0 0.0 98.5
 4 0.75 0.0 929.75 2911.5 1954.75 10438.75 929.0 4282.0 715.0 0.0 6107.25 39.25 35.75 0.0 25.25
 4 0.0 0.0 892.25 2830.25 1910.5 10251.5 901.5 4210.0 694.5 0.0 5986.0 38.5 35.0 0.0 26.75
 4 0.0 0.0 941.5 2898.25 1926.75 10378.0 911.75 4258.0 698.0 0.0 6070.5 39.0 35.5 0.0 25.25
 4 0.0 0.0 893.75 2833.75 1917.75 10215.0 873.75 4196.25 715.25 0.0 5925.25 38.0 34.75 0.0 27.25

The script is here.

Interestingly, you can use this to get greater accuracy on things like USR and SYS than you would get if you just used vmstat, sar, iostat or mpstat -a. This depends on the number of CPUs you have in your system though.

Now, if you do not have a lot of CPUs, but still want greater accuracy, I have another trick. This works especially well if you are conducting an experiment and can run a command at the beginning and end of the experiment. This trick is based around the output of vmstat -s:

# vmstat -s
[...]
 54056444 user   cpu
 42914527 system cpu
1220364345 idle   cpu
        0 wait   cpu

Those numbers are "ticks" since the system booted. A tick is usually 0.01 seconds.

NEW: I have now uploaded a script that uses these statistics to track system-wide utilization.

http://blogs.sun.com/timc/date/20071102 Friday November 02, 2007

Comparing the UltraSPARC T2 to Other Recent SPARC Processors

This is now a placeholder. You probably want to read my updated blog on SPARC processor details to get the latest.

http://blogs.sun.com/timc/date/20070831 Friday August 31, 2007

How Event-Driven Utilization Measurement is Better than Sample-Based

...and how to measure both at the same time

With the delivery of Solaris 10, Sun made two significant changes to how system utilization is measured. One change was to how CPU utilisation is measured

Solaris used to (and virtually all other POSIX-like OS'es still) measure CPU utilisation by sampling it. This happened once every "clock tick". A clock tick is a kernel administrative routine which is executed once (on one CPU) for every clock interrupt that is received, which happens once every 10 milliseconds. At this time, the state of each CPU was inspected, and a "tick" would be added to each of the "usr", "sys", "wt" or "idle" buckets for that CPU.

The problem with this method is two-fold:

  • It is statistical, which is to say it is an approximation of something, derived via sampling
  • The sampling happens just before the point when Solaris looks for threads that are waiting to be woken up to do work.

Solaris 10 now uses microstate accounting. Microstates are a set of finer-grained states of execution, including USR, SYS, TRP (servicing a trap), LCK (waiting on an intra-process lock), SLP (sleeping), LAT (on a CPU dispatch queue), although these all fall under one of the traditional USR, SYS and IDLE. These familiar three are still used to report system-wide CPU utilisation (e.g. in vmstat, mpstat, iostat), however you can see the full set of states each process is in via "prstat -m".

The key difference in system-wide CPU utilization comes in how microstate accounting is captured - it is captured at each and every transition from one microstate to another, and it is captured in nanosecond resolution (although the granularity of this is platform-dependent). To put it another way it, it is event-driven, rather than statistical sampling.

This eliminated both of the issues listed above, but it is the second issue that can cause some significant variations in observed CPU utilization.

If we have a workload that does a unit of work that takes less than one clock tick, then yields the CPU to be woken up again later, it is likely to avoid being on a CPU when the sampling is done. This is called "hiding from the clock", and is not difficult to achieve (see "hide from the clock" below).

Other types of workloads that do not explicitly behave like this, but do involve processes that are regularly on and off the CPU can look like they have different CPU utilization on Solaris releases prior to 10, because the timing of their work and the timing of the sampling end up causing an effect which is sort-of like watching the spokes of a wheel or propeller captured on video. Another factor involved in this is how busy the CPUs are - the closer a CPU is to either idle or fully utilized, the more accurate sampling is likely to be.

What This Looks Like in the Wild

I was recently involved in an investigation where a customer had changed only their operating system release (to Solaris 10), and they saw an almost 100% increase (relative) in reported CPU utilization. We suspected that the change to event-based accounting may have been a factor in this.

During our investigations, I developed a DTrace utility which can capture CPU utilization that is like that reported by Solaris 10, then also measure it the same way as Solaris 9 and 8, all at the same time.

The DTrace utility, called util-old-new, is available here. It works by enabling probes from the "sched" provider to track when threads are put on and taken off CPUs. It is event-driven, and sums up nanoseconds the same way Solaris 10 does, but it also tracks the change in a system variable, "lbolt64" while threads are on CPU, to simulate how many "clock ticks" the thread would have accumulated. This should be a close match, because lbolt64 is updated by the clock tick routine, at pretty much the same time as when the old accounting happened.

Using this utility, we were able to prove that the change in observed utilisation was pretty much in line with the way Solaris has changed how it measures utilisation. The up-side for the customer was that their understanding of how much utilisation they had left on their system was now more accurate. the down side was that they now had to re-assess whether, and by how much, this changed the amount of capacity they had left.

Here is some sample output from the utility. I start the script when I already have one CPU-bound thread on a 2-CPU system, then I start up one instance of Alexander Kolbasov's "hide-from-clock", which event-based accounting sees, but sample-based accounting does not:

mashie[bash]# util-old-new 5
NCPUs = 2
Date-time              s8-tk/1000   s9-tk/1000      ns/1000
2007 Aug 16 12:12:14          508          523          540
2007 Aug 16 12:12:19          520          523          553
2007 Aug 16 12:12:24          553          567          754
2007 Aug 16 12:12:29          549          551          798
2007 Aug 16 12:12:34          539          549          810
^C

The Other Change in Utilization Measurement

By the way, the other change was to "hard-wire" the Wait I/O ("%wio" or "wt" or "wait time") statistic to zero. The reasoning behind this is that CPU's do not wait for I/O (or any other asynchronous event) to complete - threads do. Trying to characterize how much a CPU is not doing anything in more than one statistic is like having two fuel gauges on your car - one for how much fuel remains for highway driving, and another for city driving.

References & Resources

P.S. This entry is intended to cover what I have spoken about in my previous two entries. I will soon delete the previous entries.

http://blogs.sun.com/timc/date/20070712 Thursday July 12, 2007

nicstat - Update for Solaris & Linux

I have made a minor change to nicstat on Solaris and Linux. The way it schedules its work has been improved.

Use the links from my earlier entry on nicstat for the latest source and binaries.

I will write up a more detailed explanation along with a treatise on the merits of different scheduling methodologies in a post in the near future.

http://blogs.sun.com/timc/date/20070601 Friday June 01, 2007

Using DTrace to Capture Statement Execution Times in MySQL

Introduction

I have recently been engaged with a customer that is evaluating MySQL, in particular with its Memory (formerly known as Heap) engine, which stores all database data and metadata in RAM.

I was asked to assist with diagnosing if/whether/where statements were taking longer than 5 milliseconds to complete. Now, this was being observed from the viewpoint of the "client" - the client was a synthetic benchmark run as a Java program. It could be run either on a separate system or on the same system as the MySQL database, and a small number of transactions would be measured as taking longer than 5ms.

Now, there is more than one way to skin this cat, and it turns out that Jenny Chen has had a go at putting static probes into MySQL. For my (and the customer's) purposes however, we wanted to skip the step of re-compiling with our own probes, and just use what we can observe via use of the PID provider.

How Do We Do This?

Well, it is not trivial. However as it turns out, I have seen a bit of the MySQL code. I also had someone pretty senior from MySQL next to me, who helped confirm what was happening, as I used some "fishing" bits of DTrace to watch a mysqld thread call functions as we ran the "mysql" client and entered simple statements.

This allowed me to narrow down on a couple of vio_*() routines, and to build some pseudo-code to describe the call flow around reception of requests from a client, processing of the statement, then return of a result to the client.

It is not as simple as looking for the entry and return of a single function, because I wanted to capture the full time from when the mysqld thread returns from a read(2) indicating a request has arrived from a client through to the point where the same thread has completed a write(2) to send the response back. This is the broadest definition of "response time measure at the server".

The Result

The result of all of our measurements showed that there were no statements from the synthetic benchmark that took longer than 5 ms to complete. Here is an example of the output of my DTrace capture (a histogram of microseconds):

bash-3.00# ./request.d -p `pgrep mysqld`
^C

           value  ------------- Distribution ------------- count
             < 0 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@                    10691
             500 |@@@@@@@@@@@@@@@@@                        8677
            1000 |@                                        680
            1500 |                                         31
            2000 |                                         0

The Script

Feel free to use the DTrace script for your own purposes. It should work on MySQL 5.0 and 5.1.

The Upshot - Observability in the JVM

There is a nagging question remaining - why was the Java client seeing some requests run longer than 5 ms?.

There are three possible answers I can think of:

  1. There is latency involved in transmitting the requests and responses between the client and server (e.g. network packet processing).
  2. The thread inside the JVM was being pre-empted (thrown off the CPU) between taking its time measurements.
  3. The measurements (taken using System.nanoTime()) are not reliable.

http://blogs.sun.com/timc/date/20070316 Friday March 16, 2007

Simplifying "lockstat -I" Output (or Ruby Is Better Than Perl)

There, I said it. I have been a Perl scripter for nearly 20 years (since early version 3). I think Ruby has pretty much everything Perl has, and more, like:

  • It is really object-oriented (rather than Perl's bolt-on objects). I am much more likely to end up using this functionality.
  • It has operator overloading
  • It has a "case" statement
  • There is no Obfuscated Ruby Contest (though IMHO there could be one)

Anyway, enough religous argument. I want to thank my colleague Neel for putting me onto it, and now I will provide a simple example.

Post-Processing "lockstat -I"

The Ruby script below post-processes the output of "lockstat -I". Why you ask? - well, because the output of "lockstat -I" can be tens of thousands of lines, and it does not coalesce by CPU.

The script below will coalesce by CPU, and PC within a function (useful if you forgot the "-k" flag to lockstat). A very simple change will also make it coalesce by PIL, but I will leave that as an exercise for the reader.

Ruby Post-Processor Script

One thing about this is that I would write this script almost exactly the same way if I used Perl. That is another plus of Ruby - it is easy to pick-up for a Perl programmer.

#!/bin/env ruby -w
# lockstatI.rb -	Simplify "lockstat -I" output
#
# http://blogs.sun.com/timc, 16 Mar 2007
#
# SYNOPSIS
#	lockstat -I [-i 971 -n <nnnnnn> ] sleep <nnn> | lockstatI.rb

PROG = "lockstatI"

#-- Once we have printed values that cover this proportion, then quit
CUTOFF = 95.0

DASHES = ('-' * 79) + "\n"

#========================================================================
#	MAIN
#========================================================================

print "#{PROG} - will display top #{CUTOFF}% of events\n"

events = 0
period = 0
state = 0
counts = {}

#-- The classic state machine
ARGF.each_line do |line|
    next if line == "\n"
    case state
    when 0
	if line =~ /^Profiling interrupt: (\d+) events in (\d.*\d) seconds/
	then
	    puts line
	    state = 1
	    events = $1
	    period = $2
	    next
	end
    when 1
	if line == DASHES then
	    state = 2
	end
    when 2
	if line == DASHES then
	    break
	end
	f = line.split
	count = f[0]
	cpu, pil = f[5].split('+')

	#-- Coalesce PCs within functions; i.e. do not differentiate by
	#-- offset within a function.  Useful if "-k" was not specified
	#-- on the lockstat command.
	caller = f[6].split('+')[0]

	if pil then
	    caller = caller + "[" + pil + "]"
	end
	counts[caller] = counts[caller].to_i + count.to_i
    end
end

#-- Give me an array of keys sorted by descending value
caller_keys = counts.keys.sort { |a, b| counts[b] <=> counts[a] }

#-- Dump it out
printf "%12s  %5s  %5s  %s\n", "Count", "%", "cuml%", "Caller[PIL]"
cuml = 0.0
caller_keys.each do |key|
    percent = counts[key].to_f * 100.0 / events.to_f
    cuml += percent
    printf "%12d  %5.2f  %5.2f  %s\n", counts[key], percent, cuml, key
    if cuml >= CUTOFF then
	break
    end
end

Example Output

Free beer to the first person to tell me what this is showing. It was not easy to comprehend the 90,000 line "lockstat -I" output prior to post-processing it though. You get this problem when you have 72 CPUs...

lockstatI - will display top 95.0% of events
Profiling interrupt: 4217985 events in 60.639 seconds (69559 events/sec)
       Count      %  cuml%  Caller[PIL]
     1766747  41.89  41.89  disp_getwork[11]
     1015005  24.06  65.95  (usermode)
      502560  11.91  77.86  lock_set_spl_spin
       83066   1.97  79.83  lock_spin_try[11]
       62670   1.49  81.32  mutex_enter
       53883   1.28  82.60  mutex_vector_enter
       40847   0.97  83.57  idle
       40024   0.95  84.51  lock_set_spl[11]
       27004   0.64  85.15  splx
       17432   0.41  85.57  send_mondo_set[13]
       15876   0.38  85.94  atomic_add_int
       14841   0.35  86.30  disp_getwork

http://blogs.sun.com/timc/date/20070222 Thursday February 22, 2007

nicstat - now for Linux, too

Just a quick one to flag that I have released a version of nicstat for Linux (see below).

I do not have a myriad of Linux systems to test it on, so if anyone finds any issues, please let me know.

http://blogs.sun.com/timc/date/20070214 Wednesday February 14, 2007

nicstat - the Solaris Network Monitoring Tool You Did Not Know You Needed

Update

On the prompting of a number of colleagues, I have written a Linux version of nicstat. It is included in the links below.

History

A few years ago, a bloke I know by the name of Brendan Gregg wrote a Solaris kstat-based utility called nicstat. Last year, I decided I needed to use this utility to capture network statistics in testing I do. Then I got a request from a colleague in PAE to do something about nicstat not being aware of "e1000g" interfaces.

I have spent a bit of time adding to nicstat since then, so I thought I would make the improved version available.

Why Should I Still Be Interested?

nicstat is to network interfaces as "iostat" is to disks, or "prstat" is to processes. It is designed as a much better version of "netstat -i". Its differences include:

  • Reports bytes in & out as well as packets.
  • Normalizes these values to per-second rates.
  • Reports on all interfaces (while iterating)
  • Reports Utilization (rough calculation as of now)
  • Reports Saturation (also rough)
  • Prefixes statistics with the current time

How about an example?

eac-t2000-3[bash]# nicstat 5
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:17      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:17  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:05:17  e1000g1   225.7   176.2   905.0   922.5   255.4   195.6  0.33   0.00
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:05:22      lo0    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
17:05:22  e1000g0    0.06    0.15    1.00    0.80   64.00   186.0  0.00   0.00
17:05:22  e1000g1    0.00    0.00    0.00    0.00    0.00    0.00  0.00   0.00
eac-t2000-3[bash]# nicstat -i e1000g0 5 4
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
17:08:49  e1000g0    0.61    4.07    4.95    6.63   126.2   628.0  0.04   0.00
17:08:54  e1000g0    0.06    0.04    1.00    0.20   64.00   186.0  0.00   0.00
17:08:59  e1000g0   239.2    2.33   174.4   33.60  1404.4   71.11  1.98   0.00
17:09:04  e1000g0    0.01    0.04    0.20    0.20   64.00   186.0  0.00   0.00

References & Resources

The latest source and binaries are available here:

Note - the binaries will probably work on earlier releases of Solaris - as Solaris is just like that...


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.