Morgan Herrington's Blog Adventures in Porting and Tuning

Thursday Jun 28, 2007

Another tuning situation where I was involved as a sounding-board for my colleague Dan Souder dealt with an application which ran at only 1/10th its expected rate. [Rather than executing a particular test case in 18 seconds, it took 200 seconds.]

The first interesting clue was that when the code was compiled for debug, it ran as expected, but when compiled for production, it ran quite slowly. Unfortunately, collect and analyzer showed that the extra time was spread over a broad set of functions (rather than being concentrated in a few misbehaving hot-spots).

There are plenty of problems (like cache conflicts or code scheduling) which can cause a mostly CPU-bound application to slow down by a few percent (or even several tens of percent). But when that kind of application slows down by a factor of ten, it often means that that it's having some kind of bad interaction with the system (ie. paging, I/O bottleneck, TLB thrashing, FP underflow handling, etc). Rather than immediately jumping in with DTrace, it was simpler to first check for obvious problems using vmstat, truss, and trapstat.

trapstat revealed that the application was generating a huge number of lddf-unalign and stdf-unalign traps (for floating point loads and stores of incorrectly aligned addresses). It looked something like:

    # /usr/sbin/trapstat ./some_application
    vct name                |     cpu0
    ------------------------+---------
     20 fp-disabled         |        5
     24 cleanwin            |      794
     35 lddf-unalign        |  1085199
     36 stdf-unalign        |  1012567
     41 level-1             |       66
    ... rest of output elided ...

The time spent in the trap handler easily explained the performance symptom. Then some follow-on forensics showed that an application specific memory management layer was returning memory blocks which were only aligned to a 4-byte boundary (when compiled for production). Once that was reconfigured, the performance returned to normal.

Still, I was surprised by this because my experience had been that SPARC generated a SIGBUS in response to a misaligned access. The few times that I had ever tried to work around misalignment (rather than actually fixing it), I had to resort to either specifying the -misalign compiler option or issuing a ST_FIX_ALIGN trap to turn on the kernel trap handler.

A review of the compiler docs reminded me of the "i" (for interpret) variations to the -xmemalign option. This suggested that the kernel trapping behavior could be explained if the application had been compiled with the option "-xmemalign=8i". A more subtle explanation, though, is that the compilers have the following defaults:

  • -xmemalign=8i for all v8 architectures
  • -xmemalign=8s for all v9 architectures
which means that for 64-bit builds, the default behavior would be what I expected (a misaligned access would signal and cause a SIGBUS). But for 32-bit compiles, the default would have the kernel interpret any misaligned access (and thus account for the trap handling).

But even this didn't explain some of the tests I tried. In particular, for 64-bit executables, misaligned loads and stores of type double (on 4-byte boundaries) were never generating a SIGBUS, even when compiled with "-xmemalign=8s". The explanation for this is in Section A.25, "Load Floating-Point" of the SPARC Architecture Manual which contains the following note:

    LDDF requires only word alignment.  However, if the effective
    address is word-aligned but not double word-aligned, LDDF may
    cause an LDDF_mem_address_not_aligned exception.  In this case
    the trap handler software shall emulate the LDDF instruction
    and return.
So for the special case of 8-byte floating-point loads on a 4-byte boundary, the SPARC V9 architecture (not just a particular implementation) requires the misalignment to be handled. As far as I know, in all other cases on SPARC, the operand size should be no larger than the operand alignment (and this still holds true for integer accesses).

This has no impact on most applications because the compiler allocates type double on a double word boundary, and memory returned from malloc (or new) will also be at least double word aligned.

One interesting aspect to this is that if there had been fewer misaligned loads, the performance impact might not have been enough to trigger an investigation (thus leaving an undiagnosed performance problem). So, from a performance analysis perspective, it might be better if misaligned loads would signal, since that would immediately alert the developer that something was wrong.

Monday Jun 18, 2007

I will usually blog about my own experiences, but my co-workers Dan Souder and Bogdan Vasiliu included me on the periphery of a couple cases that I think were interesting and worth discussing.

The first of these cases involved a large application which correctly worked on multiple platforms (including Solaris/SPARC and Solaris/x64). But when it was compiled for Solaris 10 for 32-bit x86 using Sun Studio compilers, it generated floating point NaN's (IEEE format "Not A Number" values). The mystery was that when the calculation which exhibited the problem was traced in the debugger, the input values were correct and the assembly instructions looked right, but the result was a NaN. To compound the mystery, when the the same code sequence was transplanted into a small test program, it worked correctly.

After a bit of thrashing around on theories which proved baseless, one of the engineers did a Google search and discovered a bug report which described a similar failure signature. The test case was something like the following:

  #include <stdio.h>
  #include <math.h>

  typedef void (*ptr_to_void_function_t)();

  double
  double_function()
  {
     return 0.0;
  }

  int
  main(int argc, char* argv[])
  {
      ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;

      for (int i=0; i<6; i++) {
        pfunc();
        double dValue = exp(-5.0);
        printf("Iteration %d -> Value %g\r\n", i, dValue);
      }
      return 0;
  }
The main program calls double_function() (via the function pointer pfunc) as well as the the math library function exp(). Since it always uses a constant argument, there doesn't appear to me much of an opportunity for failure. Surprisingly, though, it produces:
  $ cc bug.c -lm
  $ a.out
  Iteration 0 -> Value 0.00673795
  Iteration 1 -> Value 0.00673795
  Iteration 2 -> Value 0.00673795
  Iteration 3 -> Value 0.00673795
  Iteration 4 -> Value 0.00673795
  Iteration 5 -> Value -NaN
Despite the location of the symptom (the "NaN"), there's nothing wrong with the exp() computation. The underlying problem is unique to the 32-bit ABI on the x86 architecture, which uses the 8087 floating point register stack to return floating point function results. The return value mechanism involves two parts: (1)the called function places the return value in st0 and (2) the calling function is responsible for consuming the value and removing it from the register stack. In this example, however, the user tricked the compiler by hiding the real function prototype behind a pointer cast (with no return value specified).
  ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;
Because of this, the compiler doesn't place any code in the calling function to clean up the register stack, so it "leaks" a value onto the stack each time it is called. By the sixth iteration, the register stack overflows.

So that explains what is happening, but how do you find where this occurs in a multi-million line application? To find this specific situation, it might be helpful for the compilers to issue a warning on this kind of cast. However, the example is syntactically legal C, so the tools are certainly not required to complain.

The compiler group explained that the compiler maintains the invariant that the floating-point-register stack must be empty upon entry to any function and should contain no more than one value upon exit. This suggested using DTrace to perform a consistency check on the FP stack at each function entry (or exit). The idea was to check the x86 ftag register, which holds a mask of the valid (active) floating-point-stack registers. In order to check for, say, 1 active entry, you would need to look for the binary mask 10000000 (hex 0x80). In theory, the DTrace command might have been something like:

  dtrace -n 'pid$target:a.out::entry /uregs[R_FTAG] == 0xF0/ { ustack(); }' -c application
This doesn't work, though, because the floating point registers (and corresponding status registers) are not available from within DTrace.

A work-around for this might have been to invoke a program (via "system()") which would inspect the FP status register (via /proc) on each function entry:

  dtrace -wn 'pid$target:a.out::entry {system("fpcheck %d", $pid); }' -c application
The downside to this is that aside from being quite slow (invoking a program for every function call), there also appears to be a conflict inspecting a process via /proc when it is also being controlled by DTrace.

The actual solution Dan used was to manually narrow the interval between the last known point where the FP stack was OK and the first known point where it contained a leaked value. Then he used dbx to look for the problem with some brute-force single-stepping. Because this section of code was not floating point intensive, he checked for any case where the FP stack contained multiple values. This corresponded to the ftag register containing a binary mask of 11000000, so this search was accomplished with a dbx conditional breakpoint of:

  stop cond $ftag==0xC0
As it turns out, the problem was not caused by an abusive cast in the source but rather a bug in a compiler optimization (since fixed) which incorrectly ignored an unused return value.

I found this case intriguing because it highlighted some of the strengths and limitations of DTrace, /proc, and dbx (not to mention demonstrating a bit of quirkiness about the x86).