Performance and Scalability Steve Sistare's Blog

Monday Nov 17, 2008

Lockstat(1M) is the Solaris tool of choice for identifying contended kernel locks that need to be optimized to improve system efficiency and scalability. However, until recently, you could not easily use lockstat to quantify the impact of hot locks on system utilization, or to state with certainty that a hot lock is limiting your performance. This could lead to wasted effort spent fixing a lock, with no performance benefit. For example, here is the output of lockstat showing a hot lock:

    # lockstat sleep 10
    Adaptive mutex spin: 9227825 events in 10.302 seconds (895755 events/sec)
    Count indv cuml rcnt     spin Lock                   Caller
    -------------------------------------------------------------------------
    569235   6%   6% 0.00       17 0x6009a065000          timeout_common+0x4
    ...
    

This says that some thread(s) could not immediately acquire the mutex 569235 times, so they had to busy-wait, and took an average of 17 retries (spins) to acquire the mutex. In absolute terms, is this bad? You cannot tell. Higher spin is worse than lower spin, but you cannot easily map this to time, and further, not all spins are created equal because of the exponential backoff algorithm in the mutex implementation.

The good news is that we have enhanced lockstat to show the spin time in Solaris 10 10/08 and OpenSolaris 2008.11. Now the output of lockstat looks like this:

    # lockstat sleep 10
    Adaptive mutex spin: 9181733 events in 10.135 seconds (905943 events/sec)
    Count indv cuml rcnt     nsec Lock                   Caller
    -------------------------------------------------------------------------
    557331   6%   6% 0.00    88454 0x60033047000          timeout_common+0x4
    ...
    

The new nsec column shows the average time spent busy-waiting and retrying to acquire a mutex. From this data, you can compute the CPU utilization spent busy-waiting using the formula:

    CPU_util = (Count * nsec) / (elapsed * ncpu * 1e9)

The elapsed time is shown in the header above, and you need to know the number of CPUs on your platform. In this example, ncpu=256, so we have:

    CPU_util = (557331 * 88454) / (10.135 * 256 * 1e9) = .019

Thus, 1.9% of all CPU time on the system was spent busy-waiting to acquire this lock, and if we eliminate this lock, we can recover 1.9% of the CPU cycles and perhaps increase our throughput by that much.

I have written a handy script called lockutil to do this calculation for you. lockutil reads a file produced by lockstat, and computes the CPU utilization spent busy-waiting for every lock in the file, including adaptive mutexes, spin locks, and thread locks. It can parse output produced by all of lockstat's myriad command line options, except for the -I profiling mode, which is not relevant. You must supply the CPU count. Example:

    # ... Start running a test ...
    # lockstat -kWP sleep 30 > lockstat_kWP.out
    # mpstat | wc -l
    513                       # includes one header line
    # cat lockstat_kWP.out
    Adaptive mutex spin: 1713116 events in 30.214 seconds (56700 events/sec)
    
    Count indv cuml rcnt     nsec Hottest Lock           Caller
    -------------------------------------------------------------------------------
    1170899  68%  68% 0.00 11444141 pcf                  page_create_wait
    202679  12%  80% 0.00   332996 pcf+0x380             page_free
    86959   5%  85% 0.00  8840242 pcf                    page_reclaim
    64565   4%  89% 0.00    25301 0x604ec13c650          ufs_lockfs_begin_getpage
    55914   3%  92% 0.00    24010 0x604ec13c650          ufs_lockfs_end
    29505   2%  94% 0.00     1284 0x70467650040          page_trylock
    24687   1%  95% 0.00     1845 0x70467650040          page_unlock
    ...
    #
    # lockutil -n 512 lockstat_kWP.out
     CPU-util  Lock                 Caller
        0.866  pcf                  page_create_wait
        0.004  pcf+0x380            page_free
        0.050  pcf                  page_reclaim
        0.000  0x604ec13c650        ufs_lockfs_begin_getpage
        0.000  0x604ec13c650        ufs_lockfs_end
        0.000  0x70467650040        page_trylock
        0.000  0x70467650040        page_unlock
        0.920  TOTAL                TOTAL
    

Ouch, busy-waiting for the pcf lock consumes 92% of the system! There's a lock that needs fixing if ever I saw one. We fixed it.

This begs the question: what can you do if you find a hot lock? If you are a kernel developer and the lock is in your code, then change your locking granularity or algorithm. If you are an application developer, then study the results carefully, because hot kernel locks sometimes indicate a non-scalable application architecture, such as too many processes using an IPC primitive such as a message queue or a fifo. Use the lockstat -s option to see the callstack for each hot lock and trace back to your system calls, or use the dtrace lockstat provider to aggregate lock spin time by system call made by your application. The lockstat provider has also been modified to report spin time rather than spin count in its probe arguments.

If you are an end-user, then search for the lock in the support databases, to see if a fix or tuning advice is available. A few resources are
http://sunsolve.sun.com
http://bugs.opensolaris.org
and public search engines. For example a google search for "lockstat timeout_common" finds this:
http://bugs.opensolaris.org/view_bug.do?bug_id=6311743
6311743 callout table lock contention in timeout and untimeout
which my group recently fixed, and will appear in coming Solaris releases. Advice to folks who file CRs: to make these searches more successful, include lockstat snippets in the publicly readable sections of a CR, such as "Description" and "Public Comments".

Monday Nov 03, 2008

What a difference firmware can make! We take it for granted, and as administrators we probably do not update our system's firmware as often as we should, but I was recently involved in a performance investigation where it made a huge difference.

On a 128 CPU T5240 server, the throughput of an application peaked around 90 processes, but declined as more processes were added, until at 128 processes the throughput was just 25% of its peak value. Classic and severe anti-scaling. The puzzling part was that the usual suspects were innocent. mpstat showed that 99% of the time was usr mode, so no kernel issues; plockstat did not show any contended userland mutexes; cpustat did not show increases in cache misses, TLB misses, or any other counter per process; and a collector/analyzer profile did not show hot atomic functions or CAS operations. It did show a marked increase in the cost of the SPARC save instruction at function entry as process count was raised. Curious.

We eventually upgraded the firmware, and the application scaled nicely up to 128 processes. If you want some advice and do not care about gory details, skip the next two paragraphs :)

It turns out that the hypervisor had a global lock that was limiting scalability, and the lock was eliminated by a firmware upgrade. Normally very little time is spent executing code in hyper-privileged mode on the Sun CMT servers. However, the hypervisor is responsible for maintaining "permanent" VA->PA mappings in the TLB. These mappings are used for the Solaris kernel nucleus, one 4MB mapping for text, and one 4MB mapping for data. Solaris cannot handle an MMU miss for these mappings, so when the processor traps to hypervisor for the miss, the hypervisor finds the mapping, stuffs it into the hardware TLB, and returns from the trap, so Solaris never sees the miss.

The above hypervisor action was protected in the old firmware by a single global lock. The application had a high TLB miss rate exceeding 200K/CPU/sec, so the permanent mappings were being continuously evicted - not an issue if we don't use the kernel much. But, the app also had a deep function stack, so it generated lots of spill/fill traps. These trap to kernel text, which is backed by a permanent mapping, which has been evicted, which causes a hypervisor trap, which hits the global lock. A perfect storm limiting scalability! Spill/fill is a lightweight trap that does not change accounting mode to sys, hence I did not see high sys time; instead, I saw high save time. In hindsight, I could have directly observed the instruction and stall cycles spent in hypervisor mode using:

# cpustat -s -c pic0=Instr_cnt,pic1=Idle_strands,hpriv,nouser 10 10

Should you care? This issue is specific to firmware in the CMT server line, and it depends on your model:

  • T5140,T5240 (2-socket 128-CPU): Definitely verify and upgrade your firmware if needed; get the latest version of patch 136936.
  • T5120,T5220 (1-socket, 64-CPU): I have not observed the scalability bottleneck on this smaller system, but you may get a small performance boost by upgrading; get the latest version of patch 136932.
  • T1000, T2000 (1-socket, 32-CPU) - probably not an issue, the system is too small.
  • T5440 (4-socket, 256-CPU): not an issue, as the first units shipped already contained a later version of the firmware containing the fix.

The CR is: 6669222 lock in mmu_miss can be eliminated to reduce contention
It was fixed in Sun System Firmware version 7.1.3.d.
To show the version of firmware installed on your system, log in to the service processor and verify you have version 7.1.3.d or later:

sc> showhost
Sun System Firmware 7.1.3.d 2008/07/11 08:55
...

To upgrade your firmware:

  1. Go to http://sunsolve.sun.com
  2. lick on Patches and Updates link
  3. Type the patch number in the PatchFinder Form (eg 136936 for the T5140 or T5240)
  4. Push Find Patch button
  5. Click on the "Download Patch" link near the top.
  6. Unzip the download and refer to the Install.info file for instructions

If you have never upgraded firmware before, read the documentation and be careful!