Morgan Herrington's Blog Adventures in Porting and Tuning

Thursday Sep 25, 2008

Most engineers rarely have to write assembly code, yet my experience has been that large applications (particularly if they've been around for a decade or so) contain at least a few functions of hand-written assembly code. Hopefully these functions will be inspected during a porting project to see if they can be rewritten in C or replaced with with calls to the appropriate graphics, atomic, or numeric library.

Since working with assembly is not very common, I won't dwell on the topic very often. However, in the last six months I've seen the same mistake in three unrelated projects (porting from 32-bit to 64-bit SPARC), so this particular topic deserves to be mentioned.

Some 64-bit assembly porting is fairly mechanical: converting to the 64-bit calling conventions, using 64-bit registers, adjusting the bias of stack offsets, and accounting for 64-bit sizes and alignments. However, the following snippet of code illustrates a slight quirk of the SPARC architecture that is the root cause of a particular porting problem. In the following code, %o2 and %o3 point to memory buffers and the pointer in %o4 marks the end of the copy:

top_of_loop:
    add     %o2,1,%o2       # inc to the next destination location
    ldsb    [%o3],%o1       # load byte from source buffer
    add     %o3,1,%o3       # inc to the next source location
    cmp     %o2,%o4         # check for end of loop
    bcs     top_of_loop     # if not done, then branch to top
    stb     %o1,[%o2-1]     # store to dest buffer in delay slot of branch
The add instruction updates all 64-bits of the output register, and the memory accesses don't need to change. However, this sequence doesn't quite work for 64-bit. Unfortunately, it works so much of the time that regression tests could easily miss the failure case.

The problem is that unlike the x64 architecture which has two sizes of compare instructions ("cmpl" and "cmpq"), SPARC has a single instruction which sets two different sets of condition codes. The conditional branch in the code sequence above inspects the 32-bit condition codes, so it jumps based on a 32-bit comparison.

To correctly base the branch on the the 64-bit condition codes, it needs to be rewritten to use the extended condition codes, %xcc:

    bcs     %xcc,top_of_loop
I can't explain why this particular architecture feature is so easily missed, but I can point out that the original sequence works correctly unless the memory buffer pointed to by %o2 crosses a 2GB boundary. At least for one application, the failure only occurred at a single customer site and only once every few weeks (and was, therefore, tough to diagnose).


Wednesday Sep 03, 2008

After using the CPU performance counters on various processors for the last few years, I'm not surprised to occasionally find one which doesn't work. So, when looking at L1 cache refill statistics on a current Opteron, I assumed the worst when the event count was consistently zero. Of course, it wasn't that simple.

One clue to the complexity of using the many counters on the various CPUs is that each of the Solaris counter-based performance measurement tools (for example collect/analyzer, cputrack, and cpustat) all include the following footnote in their respective help messages:

    See Chapter 10 of the "BIOS and Kernel Developer's Guide for the
    Athlon 64 and AMD Opteron Processors", AMD publication #26094.

This document (and its revision, AMD publication #25759) explain that some of the performance counters use a unit mask which further specifies or qualifies the event. In the particular case of data cache refills, the unit mask specifies exactly which kind of refills are being counted, as described in the following table:

    0x01 Refill from system memory
    0x02 Refill from Shared-state line from L2 cache
    0x04 Refill from Exclusive-state line from L2
    0x08 Refill from Owned-state line from L2
    0x10 Refill from Modified-state line from L2

The problem for the naive user is that the default mask is 0x0, which means that no events are selected (and thus the counts will always be zero). The performance tools would be more user friendly if they warned that a counter was being monitored which could not possibly return any useful data (since the associated unit mask is clear). I presume they don't attempt this because of the complexity of tracking the quirks of many different supported CPU's.

To see the problem, consider the following command and output:

    $ cputrack -c DC_refill_from_L2  application
       time lwp      event      pic0
      1.015   1       tick         0
      2.015   1       tick         0
      2.178   1       exit         0
    
However, by specifying the unit mask (in this example, the union of all of the "refill from L2" flags), it becomes:
    $ cputrack -c DC_refill_from_L2,umask=0x1e application
       time lwp      event      pic0
      1.028   1       tick     47981
      2.018   1       tick     47225
      2.144   1       exit    101299
    

The problem is the same for collect/analyzer, but the syntax for specifying the unit mask is slightly different. As the documentation explains, it uses the hardware counter syntax:

    counter_name[~attribute=value]
which translates the example to the following:
    collect -h DC_refill_from_L2~umask=0x1e,hi application
    
This issue isn't a problem for most uses of collect which use the well known profiling counters like: cycles, insts, icm, etc; however, you need to pay attention when using the list of CPU-specific flags.