The first interesting clue was that when the code was compiled for debug, it ran as expected, but when compiled for production, it ran quite slowly. Unfortunately, collect and analyzer showed that the extra time was spread over a broad set of functions (rather than being concentrated in a few misbehaving hot-spots).
There are plenty of problems (like cache conflicts or code scheduling) which can cause a mostly CPU-bound application to slow down by a few percent (or even several tens of percent). But when that kind of application slows down by a factor of ten, it often means that that it's having some kind of bad interaction with the system (ie. paging, I/O bottleneck, TLB thrashing, FP underflow handling, etc). Rather than immediately jumping in with DTrace, it was simpler to first check for obvious problems using vmstat, truss, and trapstat.
trapstat revealed that the application was generating a huge number of lddf-unalign and stdf-unalign traps (for floating point loads and stores of incorrectly aligned addresses). It looked something like:
# /usr/sbin/trapstat ./some_application
vct name | cpu0
------------------------+---------
20 fp-disabled | 5
24 cleanwin | 794
35 lddf-unalign | 1085199
36 stdf-unalign | 1012567
41 level-1 | 66
... rest of output elided ...
The time spent in the trap handler easily explained the performance symptom. Then some follow-on forensics showed that an application specific memory management layer was returning memory blocks which were only aligned to a 4-byte boundary (when compiled for production). Once that was reconfigured, the performance returned to normal.
Still, I was surprised by this because my experience had been that SPARC generated a SIGBUS in response to a misaligned access. The few times that I had ever tried to work around misalignment (rather than actually fixing it), I had to resort to either specifying the -misalign compiler option or issuing a ST_FIX_ALIGN trap to turn on the kernel trap handler.
A review of the compiler docs reminded me of the "i" (for interpret) variations to the -xmemalign option. This suggested that the kernel trapping behavior could be explained if the application had been compiled with the option "-xmemalign=8i". A more subtle explanation, though, is that the compilers have the following defaults:
- -xmemalign=8i for all v8 architectures
- -xmemalign=8s for all v9 architectures
But even this didn't explain some of the tests I tried. In particular, for 64-bit executables, misaligned loads and stores of type double (on 4-byte boundaries) were never generating a SIGBUS, even when compiled with "-xmemalign=8s". The explanation for this is in Section A.25, "Load Floating-Point" of the SPARC Architecture Manual which contains the following note:
LDDF requires only word alignment. However, if the effective
address is word-aligned but not double word-aligned, LDDF may
cause an LDDF_mem_address_not_aligned exception. In this case
the trap handler software shall emulate the LDDF instruction
and return.
So for the special case of 8-byte floating-point loads on a 4-byte
boundary, the SPARC V9 architecture (not just a particular implementation)
requires the misalignment to be handled. As far as I know, in all
other cases on SPARC, the operand size should be no larger than the
operand alignment (and this still holds true for integer
accesses).
This has no impact on most applications because the compiler allocates type double on a double word boundary, and memory returned from malloc (or new) will also be at least double word aligned.
One interesting aspect to this is that if there had been fewer misaligned loads, the performance impact might not have been enough to trigger an investigation (thus leaving an undiagnosed performance problem). So, from a performance analysis perspective, it might be better if misaligned loads would signal, since that would immediately alert the developer that something was wrong.