Morgan Herrington's Blog Adventures in Porting and Tuning

Thursday Jun 28, 2007

Another tuning situation where I was involved as a sounding-board for my colleague Dan Souder dealt with an application which ran at only 1/10th its expected rate. [Rather than executing a particular test case in 18 seconds, it took 200 seconds.]

The first interesting clue was that when the code was compiled for debug, it ran as expected, but when compiled for production, it ran quite slowly. Unfortunately, collect and analyzer showed that the extra time was spread over a broad set of functions (rather than being concentrated in a few misbehaving hot-spots).

There are plenty of problems (like cache conflicts or code scheduling) which can cause a mostly CPU-bound application to slow down by a few percent (or even several tens of percent). But when that kind of application slows down by a factor of ten, it often means that that it's having some kind of bad interaction with the system (ie. paging, I/O bottleneck, TLB thrashing, FP underflow handling, etc). Rather than immediately jumping in with DTrace, it was simpler to first check for obvious problems using vmstat, truss, and trapstat.

trapstat revealed that the application was generating a huge number of lddf-unalign and stdf-unalign traps (for floating point loads and stores of incorrectly aligned addresses). It looked something like:

    # /usr/sbin/trapstat ./some_application
    vct name                |     cpu0
    ------------------------+---------
     20 fp-disabled         |        5
     24 cleanwin            |      794
     35 lddf-unalign        |  1085199
     36 stdf-unalign        |  1012567
     41 level-1             |       66
    ... rest of output elided ...

The time spent in the trap handler easily explained the performance symptom. Then some follow-on forensics showed that an application specific memory management layer was returning memory blocks which were only aligned to a 4-byte boundary (when compiled for production). Once that was reconfigured, the performance returned to normal.

Still, I was surprised by this because my experience had been that SPARC generated a SIGBUS in response to a misaligned access. The few times that I had ever tried to work around misalignment (rather than actually fixing it), I had to resort to either specifying the -misalign compiler option or issuing a ST_FIX_ALIGN trap to turn on the kernel trap handler.

A review of the compiler docs reminded me of the "i" (for interpret) variations to the -xmemalign option. This suggested that the kernel trapping behavior could be explained if the application had been compiled with the option "-xmemalign=8i". A more subtle explanation, though, is that the compilers have the following defaults:

  • -xmemalign=8i for all v8 architectures
  • -xmemalign=8s for all v9 architectures
which means that for 64-bit builds, the default behavior would be what I expected (a misaligned access would signal and cause a SIGBUS). But for 32-bit compiles, the default would have the kernel interpret any misaligned access (and thus account for the trap handling).

But even this didn't explain some of the tests I tried. In particular, for 64-bit executables, misaligned loads and stores of type double (on 4-byte boundaries) were never generating a SIGBUS, even when compiled with "-xmemalign=8s". The explanation for this is in Section A.25, "Load Floating-Point" of the SPARC Architecture Manual which contains the following note:

    LDDF requires only word alignment.  However, if the effective
    address is word-aligned but not double word-aligned, LDDF may
    cause an LDDF_mem_address_not_aligned exception.  In this case
    the trap handler software shall emulate the LDDF instruction
    and return.
So for the special case of 8-byte floating-point loads on a 4-byte boundary, the SPARC V9 architecture (not just a particular implementation) requires the misalignment to be handled. As far as I know, in all other cases on SPARC, the operand size should be no larger than the operand alignment (and this still holds true for integer accesses).

This has no impact on most applications because the compiler allocates type double on a double word boundary, and memory returned from malloc (or new) will also be at least double word aligned.

One interesting aspect to this is that if there had been fewer misaligned loads, the performance impact might not have been enough to trigger an investigation (thus leaving an undiagnosed performance problem). So, from a performance analysis perspective, it might be better if misaligned loads would signal, since that would immediately alert the developer that something was wrong.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed