I will usually blog about my own experiences, but my co-workers
Dan Souder and Bogdan Vasiliu included me on the periphery of a couple
cases that I think were interesting and worth discussing.
The first of these cases involved a large application which correctly worked on multiple
platforms (including Solaris/SPARC and Solaris/x64). But when it was compiled
for Solaris 10 for 32-bit x86 using Sun Studio compilers,
it generated floating point NaN's (IEEE format "Not A Number" values).
The mystery was that when the calculation which exhibited
the problem was traced in the debugger, the input values were correct
and the assembly instructions looked right, but the result was a NaN.
To compound the mystery, when the the same code sequence
was transplanted into a small test program, it worked correctly.
After a bit of thrashing around on theories which proved baseless,
one of the engineers did a Google search and discovered a bug
report which described a similar failure signature. The test case
was something like the following:
#include <stdio.h>
#include <math.h>
typedef void (*ptr_to_void_function_t)();
double
double_function()
{
return 0.0;
}
int
main(int argc, char* argv[])
{
ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;
for (int i=0; i<6; i++) {
pfunc();
double dValue = exp(-5.0);
printf("Iteration %d -> Value %g\r\n", i, dValue);
}
return 0;
}
The main program calls
double_function() (via the function pointer
pfunc)
as well as the the math library function
exp(). Since it always uses a
constant argument, there doesn't appear to me much of an opportunity
for failure. Surprisingly, though, it produces:
$ cc bug.c -lm
$ a.out
Iteration 0 -> Value 0.00673795
Iteration 1 -> Value 0.00673795
Iteration 2 -> Value 0.00673795
Iteration 3 -> Value 0.00673795
Iteration 4 -> Value 0.00673795
Iteration 5 -> Value -NaN
Despite the location of the symptom (the "NaN"), there's nothing wrong with
the
exp() computation. The underlying problem is unique to the 32-bit
ABI on the x86 architecture, which uses
the 8087 floating point register stack to return floating point function results.
The return value mechanism involves two parts: (1)the called function places the
return value in
st0 and (2) the calling function is responsible for consuming
the value and removing it from the register stack. In this example, however,
the user
tricked the compiler by hiding the real function
prototype behind a pointer cast (with no return value specified).
ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;
Because of this, the compiler doesn't place any code in the calling function
to clean up the register stack, so it "leaks" a value onto the stack each
time it is called. By the sixth iteration, the register stack overflows.
So that explains what is happening, but how do you find where
this occurs in a multi-million line application? To find this specific
situation, it might be helpful for the compilers to issue a warning on
this kind of cast. However, the example is syntactically legal
C, so the tools are certainly not required to complain.
The compiler group explained that the compiler maintains
the invariant that the floating-point-register stack must be empty upon
entry to any function and should contain no more than one value upon exit.
This suggested using DTrace to perform a consistency check on the FP
stack at each function entry (or exit). The idea was to check the x86
ftag register, which holds a mask of the valid (active)
floating-point-stack registers. In order to check for, say, 1 active
entry, you would need to look for the binary mask 10000000 (hex 0x80).
In theory, the DTrace command might have been something like:
dtrace -n 'pid$target:a.out::entry /uregs[R_FTAG] == 0xF0/ { ustack(); }' -c application
This doesn't work, though, because the floating point registers (and
corresponding status registers) are not available from within DTrace.
A work-around for this might have been to invoke a program (via "system()")
which would inspect the FP status register (via /proc) on each function entry:
dtrace -wn 'pid$target:a.out::entry {system("fpcheck %d", $pid); }' -c application
The downside to this is that aside from being quite slow (invoking a program
for every function call), there also appears to be a conflict inspecting a process
via /proc when it is also being controlled by DTrace.
The actual solution Dan used was to manually narrow the interval between the last
known point where the FP stack was OK and the first known point where it
contained a leaked value. Then he used dbx to look for
the problem with some brute-force single-stepping. Because this
section of code was not floating point intensive, he checked for
any case where the FP stack contained multiple values. This
corresponded to the ftag register containing a binary mask
of 11000000, so this search was accomplished with a dbx
conditional breakpoint of:
stop cond $ftag==0xC0
As it turns out, the problem was not caused by an abusive cast in
the source but rather a bug in a compiler optimization (since fixed)
which incorrectly ignored an unused return value.
I found this case intriguing because it highlighted
some of the strengths and limitations of DTrace, /proc, and dbx
(not to mention demonstrating a bit of quirkiness about the x86).