Jonathan Adams's Weblog
Brokenness Hides Itself
When engineers get together and talk, one of the things they like to bring out and share is war stories; tales of debugging daring-do and the amazing brokenness that can be found in the process. I recently went through an experience that makes good war story material, and I thought I'd share it.
A couple weeks ago, there were multiple reports of svc.configd(1M)
failing repeatedly with one of:
svc.configd: Fatal error: invalid integer "10" in field "id"
svc.configd: Fatal error: invalid integer in database
Since I'm one of the main developers of svc.configd(1M), I started to
investigate. I first had the people hitting it send me their repositories,
but they all checked out as having no problems. The problem was only being
seen on prototype Niagara machines and some Netra T1s; the first
reproducible machine I got console access to was a Niagara box.
Figuring out what happened
Unfortunately, the box was running old firmware, which significantly restrained its usability; I spent more time fighting the machine than working on tracking down the problem. I finally boot neted the machine, mounted the root filesystem, and added a line:
sulg::sysinit:/sbin/sulogin </dev/console 2<>/dev/console >&2to /etc/inittab to get a shell very early in boot. After booting and logging into that shell, I was able attach mdb(1) to svc.configd(1M):
# mdb -p `pgrep svc.configd` Loading modules: [ svc.configd ld.so.1 libumem.so.1 libuutil.so.1 libc.so.1 ] >Now, I knew that svc.configd uses a utility function, uu_strtouint(), to do all of its integer-to-number conversions, and that the most likely cause of the failure seen was some kind of unprintable character in its first argument. So I set a breakpoint there and started it running:
> uu_strtouint::bp > :c mdb: stop at libuutil.so.1`uu_strtouint mdb: target stopped at: uu_strtouint: save %sp, -0x68, %sp > ::step[1] mdb: target stopped at: uu_strtouint+4: ld [%fp + 0x60], %l6 > $C 2 feefb968 libuutil.so`uu_strtouint+4(1cb4e4, feefba9c) feefb9d0 string_to_id+0x24(1cb4e4, feefba9c) feefba38 fill_child_callback+0x20(feefbbe4, 2) feefbaa0 sqlite_exec+0xd8(13de08, 2) feefbb18 backend_run+0x74(89b48, 169848) feefbb80 scope_fill_children+0x38(5e9f00, 1000) ... >So the first argument to uu_strtouint() is 1cb4e4. I attempted to print out the string there:
> 1cb4e4/s mdb: failed to read data from target: no mapping for address 0x2e: >"failed to read data from target"? A reported address of 0x2e instead of 0x1cb4e4? Clearly, the problem is affecting mdb(1) as well. Some additional tests in mdb(1) revealed the magnitude of the problem:
> 1=J
1
> 10=J
1000000001
> 11=J
1000000002
> 111=J
11000000003
> 101=J
11000000002
i.e. the results were completely busted, and had a lot of high-word bits set
unexpectedly. Interestingly, it looked like only multiple-digit numbers were
affected.
So I proceeded to investigate uu_strtouint(). The first step was to see how the function was failing; there are a number of different ways to get to uu_set_error(), which sets up libuutil's equivalent of errno. A simple breakpoint led to the following code segment:
269 if (strtoint(s, &val, base, 0) == -1) 270 return (-1); 271 272 if (val < min) { 273 uu_set_error(UU_ERROR_UNDERFLOW); 274 return (-1); 275 } else if (val > max) { 276 uu_set_error(UU_ERROR_OVERFLOW); 277 return (-1); 278 }The failure was occurring at line 276; i.e. an overflow. The value (for an input string of "10") was 0x10000000a, that is, 1*2^32 + 10. So something was going terribly wrong in strtoint(). Since mdb(1) was also failing, the problem was probably in some shared library routine. Looking at the disassembly, there are very few external calls:
> strtoint::dis ! grep call | sed 's/libuutil.so.1`//g' strtoint+0xc: call +8 <strtoint+0x14> strtoint+0x204: call +0x12a30 <PLT:__udiv64> strtoint+0x2b8: call +0x12988 <PLT:__umul64> strtoint+0x404: call -0xd30 <uu_set_error> strtoint+0x414: call -0xd40 <uu_set_error> strtoint+0x424: call -0xd50 <uu_set_error> strtoint+0x440: call -0xd6c <uu_set_error> strtoint+0x450: call -0xd7c <uu_set_error> strtoint+0x460: call -0xd8c <uu_set_error> >The first call is part of the PIC[2] calling sequence, and uu_set_error() is only called in the failure paths, which we knew weren't being hit. So __umul64() and __udiv64() are the next suspects. These are runtime support routines in libc, which the compiler inserts calls to when it wants to do 64-bit multiplications and divisions. The code for strtoint() only has one multiply and one divide, so it's easy to see where they occur:
103 multmax = (uint64_t)UINT64_MAX / (uint64_t)base;
104
105 for (c = *++s; c != '\0'; c = *++s) {
...
116 if (val > multmax)
117 overflow = 1;
118
119 val *= base;
120 if ((uint64_t)UINT64_MAX - val < (uint64_t)i)
121 overflow = 1;
122
123 val += i;
124 }
The division always occurs, so I looked at the multiply routine first;
disassembling it showed the following suspicious section:
> __umul64::dis ! sed 's/libc.so.1`//g' ... __umul64+0x38: cmp %l7, 0 __umul64+0x3c: call +0xc95e4 <PLT:.umul> __umul64+0x40: mov %i3, %i0 __umul64+0x44: mov %l6, %i1 ...For a function call, %o0-%o6 hold the arguments to the function, and afterwards, %o0 and %o1 hold the results. But here, there's no manipulation of the %os. In fact, the function doesn't reference them anywhere:
> __umul64::dis ! grep '%o' >Here's the smoking gun; we've hit some kind of compiler bug. The relevant source code is:
usr/src/lib/libc/sparc/crt/mul64.c:
36 extern unsigned long long __umul32x32to64(unsigned, unsigned);
...
70 unsigned long long
71 __umul64(unsigned long long i, unsigned long long j)
72 {
...
81 if (i1)
82 result = __umul32x32to64(i1, j1);
88 return (result);
89 }
usr/src/lib/libc/sparc/crt/muldiv64.il:
29 .inline __umul32x32to64,8
30 call .umul,2
31 nop
32 mov %o0, %o2
33 mov %o1, %o0
34 mov %o2, %o1
35 .end
From previous experience with the compiler's inlining (I codereviewed
the fix for another inliner-related bug,
6225876)
I knew that there is an optimization stage after the inline code is generated;
from the looks of it, that stage thought that the call had no effect on
register state, and so optimized away the argument set up. As a result,
the final return value of __umul64 is just some junk register state. So
I updated the original bug, and sent it over to the compiler people:6323803 compiler bug causes __*mul64 failure; svc.configd dies
After some investigation, the compiler folks accepted the bug, and noted that it only effects call statements in inline assembly. Checking the rest of the '.il' files in OS/Net, I verified that this was the only place where we used call.
I still needed to figure out why this was only effecting certain platforms, and how we were going to deal with the problem in the short-term, so that we weren't waiting on a compiler patch.
Why only sun4v and Netras?
I first wrote a small C program, to test:
% cat > math_test.c <<EOF
#include <stdio.h>
int
main(int argc, char *argv[])
{
unsigned long long x, y, z;
x = 10;
y = 1;
z = x * y;
printf("%llx\n", z);
return (0);
}
EOF
% cc -o math_test math_test.c
% ./math_test
a
% truss -t \!all -u '::__*mul64' ./math_test
/1@1: -> libc_psr:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1: <- libc_psr:__mul64() = 0
a
%
... (moving over to niagara machine) ...
% truss -t \!all -u '::__*mul64' ./math_test
/1@1: -> libc:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1: <- libc:__mul64() = 1
10000000a
%
A hah! For sun4u, libc_psr is overriding the libc definitions. So sun4v
won't work for two reasons:
- none of the sun4v libc_psrs override __*mul64
- even if they did, sun4v has a wacky "mount the correct libc_psr over /platform/sun4v/lib/libc_psr.so.1 which wouldn't occur until later in boot anyway.
The fact that these weren't included in the sun4v libc_psr is an oversight, but #2 means that it wouldn't have mattered if they did. The Netra T1's running into this problem are explained by the fact that there are a set of missing /platform/*/lib symlinks for the following platforms:
SUNW,Ultra-1-Engine SUNW,UltraAX-MP SUNW,UltraAX-e SUNW,UltraAX-e2 SUNW,UltraSPARC-IIi-Engine SUNW,UltraSPARC-IIi-cEngine SUNW,UltraSPARCengine_CP-20 SUNW,UltraSPARCengine_CP-40 SUNW,UltraSPARCengine_CP-60 SUNW,UltraSPARCengine_CP-80Which are all various "Netra T1" varieties. Since the link to the sun4u libc_psr is missing, they exhibit the same problem as the sun4v does. I filed 6324790 to cover adding the missing links, since it is a performance problem. (programs will not be able to take advantage of the faster mem{cpy,cmp,move,set} versions contained there)
The final question is "Why now?". What changed to make this a problem? Four days before the first reported incident, 6316914 was putback, which switched the build from the Studio 8 to the Studio 10 compilers. Because of the libc_psr masking, no-one noticed the problem until they put the bits on the (much rarer) platforms with the bug.
The Fix
To fix this, you simply move the __{u,}{mul,div}64 functions from libc_psr back into libc, using the v8plus versions that were in libc_psr. libc's assembly files are already being compiled in v8plus mode due to atomic_ops(3C), so it just required shuffling around some code, removing stuff from makefiles, and deleting the old, out of date code. This was done under bugid 6324631, integrated in the same build as the compiler switch, so only a limited number of OS/Net developers were effected. Life was all better. Well, almost.
The follow-on
In testing my fix, I did a full build, bfued a machine, and just dumped fixed binaries on other machines. The one thing I didn't test was BFUing from broken bits to fixed bits. And, of course, there was an unforeseen problem (bugid 6327152). To understand what went wrong, I'm going to have to do some background on how BFU works.
bfu is a power-tool whose essential job is to dump a full set of binaries over a running system, even if the new bits are incompatible with the ones running the system. To do this, it copies binaries, libraries, and the dynamic linker into a set of subdirectories of /tmp: /tmp/bfubin, /tmp/bfulib, and /tmp/bl. It then uses a tool called "bfuld" to re-write the "interepreter" information for the binaries in /tmp/bfubin, to point at the copied ld.so.1(1). It then sets LD_LIBRARY_PATH in the environment, to re-direct any executed programs to the copied libraries, and sets PATH=/tmp/bfubin. This gives BFU a protected environment to run in.
The problem is that auxiliary filters (like libc_psr) were not disabled, so programs running in the BFU environment were picking up the libc_psr from /platform. Once the *new* libc_psr was extracted, programs were no longer protected from the broken __*mul64() routines. Since things like scanf(3C) use __mul64 internally, this caused breakage all over the place, most noticeably in cpio(1).
The fix for this is reasonably simple; set LD_NOAUXFLTR=1 in the environment to prevent auxiliary filters from being used,[3] make a copy of libc_psr.so.1 into /tmp/bfulib, and use LD_PRELOAD=/tmp/bfulib/libc_psr.so.1 to override the bad libc functions. The latter part of this can be removed once we're sure no broken libcs are running around.
Conclusion
I hope you've enjoyed this. The bug ended up being surprisingly subtle (as many compiler bugs are), but luckily the fix was relatively simple. The Law of Unintended Consequences applies, as always.
Footnotes:
[1]
::steping over the "save" instruction is a standard SPARC debugging
trick; it makes the arguments to the function and the stack trace correct.
[2]
Position-Independent Code, which is how shared libraries are compiled.
[3]
Ironically, if we had done this *before* the compiler switch was done, BFU
would have immediately failed when run on the broken bits, and the whole
problem would have been noticed much more quickly.
[ Technorati
OpenSolaris
Solaris
]
Posted at 08:45PM Sep 28, 2005 by jwadams in Solaris | Comments[5]