Morgan Herrington's Blog Adventures in Porting and Tuning

Wednesday Sep 03, 2008

After using the CPU performance counters on various processors for the last few years, I'm not surprised to occasionally find one which doesn't work. So, when looking at L1 cache refill statistics on a current Opteron, I assumed the worst when the event count was consistently zero. Of course, it wasn't that simple.

One clue to the complexity of using the many counters on the various CPUs is that each of the Solaris counter-based performance measurement tools (for example collect/analyzer, cputrack, and cpustat) all include the following footnote in their respective help messages:

    See Chapter 10 of the "BIOS and Kernel Developer's Guide for the
    Athlon 64 and AMD Opteron Processors", AMD publication #26094.

This document (and its revision, AMD publication #25759) explain that some of the performance counters use a unit mask which further specifies or qualifies the event. In the particular case of data cache refills, the unit mask specifies exactly which kind of refills are being counted, as described in the following table:

    0x01 Refill from system memory
    0x02 Refill from Shared-state line from L2 cache
    0x04 Refill from Exclusive-state line from L2
    0x08 Refill from Owned-state line from L2
    0x10 Refill from Modified-state line from L2

The problem for the naive user is that the default mask is 0x0, which means that no events are selected (and thus the counts will always be zero). The performance tools would be more user friendly if they warned that a counter was being monitored which could not possibly return any useful data (since the associated unit mask is clear). I presume they don't attempt this because of the complexity of tracking the quirks of many different supported CPU's.

To see the problem, consider the following command and output:

    $ cputrack -c DC_refill_from_L2  application
       time lwp      event      pic0
      1.015   1       tick         0
      2.015   1       tick         0
      2.178   1       exit         0
    
However, by specifying the unit mask (in this example, the union of all of the "refill from L2" flags), it becomes:
    $ cputrack -c DC_refill_from_L2,umask=0x1e application
       time lwp      event      pic0
      1.028   1       tick     47981
      2.018   1       tick     47225
      2.144   1       exit    101299
    

The problem is the same for collect/analyzer, but the syntax for specifying the unit mask is slightly different. As the documentation explains, it uses the hardware counter syntax:

    counter_name[~attribute=value]
which translates the example to the following:
    collect -h DC_refill_from_L2~umask=0x1e,hi application
    
This issue isn't a problem for most uses of collect which use the well known profiling counters like: cycles, insts, icm, etc; however, you need to pay attention when using the list of CPU-specific flags.