Virtual scrolling works on Solaris 10...
Now I'm one step closer to my computing nervana, thanks to Casper. His new Xorg mouse driver for touchpad supports vertical and horizontal virtual scrolling as well as 4-way middle button. Now if only Solaris 10 supports suspend-resume on x86 (I heard something's coming in this area..) it would be nearly perfect... ( Dec 23 2004, 12:11:45 PM PST ) Permalinkvolatile, thread safety and memory ordering
I came across this excellent write up by Scott Meyers and Andrei Alexandrescu. A lot of people expect more things from volatile than what's defined in the standard and this is a good warning message for them. It also touches on the multiprocessor memory ordering issue and that's another area where a lot of casual programmers are not even aware of. Enjoy the reading! ( Dec 22 2004, 11:58:14 AM PST ) Permalink Comments [2]My another $.01 on Eric and Greg's debate.
From Greg's another reply :
I really need to write a article/essay about why Linux does
not have driver api stability. I touched on it in my
previous post, but in reading your response, and the
responses by others, you all seem to miss the main points.
It's not that we don't know how to create a binary api with
padding structures out, and offering up new functions, it's
the fact that because we have the source to all of our
drivers, we do not have to.
So now he changed his mind and admits that it's possible
to have a driver api stability (see my previous blog entry).
Fine. At least this time it looks a bit more reasonable on the face. Let me counter this argument by quoting one open source driver developer. Following is from a posting on comp.unix.solaris by Joerg Schilling (In case you don't know, he's the author of libscg, the SCSI generic driver which is used by many cd recording software).
From: js@cs.tu-berlin.de (Joerg Schilling) Newsgroups:
comp.unix.solaris Subject: Re: Anyone else seeing this management
trend favoring Linux?
There are other reasons why Linux & Linux are wrong:
Linux folks likes to force me to use /dev/hdc in order to send SCSI
commands to the first CD-ROM.
Well inside cdrecord all SCSI commands are routed though the highly
portable libscg (which started in August 1986 so it is 6 years older
than Linux but it is still binary compatible to the August 1986
version if you use SunOS on a 68020).
Libscg offers services that are just above the SCSI transport level
in the OS while /dev/hdc is a device node of a block device. Why
should I be forced to use a interface that is far too high in
layering and thus uses inapropriate names at /dev/* when Linux
already has a SCSI generic driver system that offers services at the
right layering level and allows (in contrary to /dev/hdc) to map
/dev/sg* entries to SCSI bus,target,lun?
The reason is that the linux kernel is completely non-orthogonal and
offers many superfluous services. It is possible to send SCSI
commands to ATAPI drives by using /dev/sg* fd's but with /dev/sg* and
ATAPI there is no DMA if the transfer length is not a multiple of
512.
---> Linux has no DMA abstraction layer as Solaris. If it had, then
everything would work the same. But the useless /dev/hd* SCSI
transport for ATAPI which has been introduced by Linus Torvalds only
in order to annouy users of CD/DVD writer did not have correct DMA in
the beginning too. Later after I made a bug report, the bug in
/dev/hd* has been fixed but as a fact of pure evilness, the _same_
bug has not been removed from the /dev/sg* interface.
>Then it struck me that the Linux disk driving naming convention (i.e.
>hda, hdb, hdc, etc.) offers a clue on why Solaris had no problem with
>three controllers and why Linux choked. With 3 SCSI controllers, you
>could put a whole scheissload of disks on the system, but how in the
>heck are you going to label them with the /dev/hdx convention? This is
>especially true if the 3 SCSI strings are only partly filled initially
>and more disks are added later on.
This problem has tradition on Linux and is _very_ probable with the /dev/sg*
(Generic SCSI transport) interface. Many people had external CD writers that
have been switched off some time. If you boot with the CD drive switched off,
the /dev/sg* interface that yesterday has been pointing to the CD writer now
may become the second HDD.....
Let me give a final and frustrated comclusion:
From various mail exchanged I had with many "prominent" Linux kernel people
(including Linus Torvalds and Alan Cox) it seems that the main reason why
Linux constantly breaks interfaces is not because the Linux kernel folks
are trying to do the best without a compromise but rather because they
did not yet understand what an interface is :-(
--
EMail:joerg@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
js@cs.tu-berlin.de (uni) If you don't have iso-8859-1
schilling@fokus.fraunhofer.de (work) chars I am J"org Schilling
URL: http://www.fokus.fraunhofer.de/usr/schilling ftp://ftp.berlios.de/pub/schily
Enough said.It's not a simple matter of binary interface. It's a matter of "stable" interface and proper abstraction layer. That doesn't change regardless of the availability of the driver source code. ( Sep 27 2004, 03:50:06 PM PDT ) Permalink Comments [1]
Solaris vs Linux debate between Eric Schrock and Greg K-H
Wow. There's an interesting debate going on between our Eric Schrock and Greg K-H on different philosophies of Solaris and Linux.Eric Schrock's first post:
http://blogs.sun.com/roller/page/eschrock/20040924#gpl_clarifications
And Greg K-H's rebuttal:
http://www.kroah.com/log/2004/09/23/#2004_09_23_sun_rebuttal
Then Eric's rebuttal of the rebuttal:
http://blogs.sun.com/roller/page/eschrock/20040924#rebutting_a_rebuttal
Eric's last post is very good, so I don't have much to add except for one thing. From Greg's rebuttal:
Here's why the Linux kernel does not have binary driver
compatibility, and why it never will:
...
* compiler versions and kernel options. If you
select something as simple as CONFIG_SMP, that means that core
kernel structures will be different sizes, and locks will either
be enabled, or compiled away into nothing. So, if you wanted to
ship a binary driver, you would have to build your driver for
that option enabled, and disabled. Now combine that with the
zillion different kernel options that are around that change the
way structures are sized and built, and you have a huge number of
binary drivers that you need to ship. Combine that with the
different versions of gcc which align things differently (and
turn on some kernel options themselves, based on different
features available in the compiler) and there's no way you can
successfully ship a binary kernel driver that will work for all
users. It's just an impossible dream of people who do not
understand the technology.
There you have it. A binary driver compatibility is "an impossible dream
of people who do not understand the technology". Yikes. So all those
commercial unix vendors (or even Microsoft) who provide the binary compatible
driver are idiots who do not understand the technology. Also, as a compiler writer, it's mystifying, to say the least, that he blames compiler versions for structure size or alignment changes. If you have a proper application binary interface which defines the structure layout, no matter what compiler you use, they won't change. If they do, then the compiler has just broken the ABI. If I didn't know better, I would have thought gcc doesn't even provide a proper and stable ABI that it adheres to - it does. So don't blame it on gcc - linux's binary driver (non-)compatiblity issue has nothing to do with gcc. ( Sep 24 2004, 02:22:22 PM PDT ) Permalink Comments [1]
Dbx tip of the day - tracing malloc/free or new/delete
On comp.unix.solaris, someone asked:I want to set a breakpoint such that it will stop when some specific heap address, say, 0x12345678, is allocated or freed. Is there any way to do this?After I replied an answer, someone else asked me how to do similar for new/delete in C++. It isn't too difficult to do, but it involves optimized code (malloc and free in libc are optimized), so it may not be obvious to someone who's not familiar with SPARC assembly or calling convention.
Enough intro. Here's my answers:
One possible solution is to set the breakpoint at just before returning from malloc, and right after free is called. Following is an example: $ cat t.c #includeUPDATE#include int main(void) { void * p = malloc(100); printf("%x\n", p); free(p); return 0; } $ cc -g t.c $ dbx ./a.out Reading a.out Reading ld.so.1 Reading libc.so.1 Reading libdl.so.1 Reading libc_psr.so.1 (dbx 1) stop in main (2) stop in main (dbx 2) run Running: a.out (process id 14122) stopped in main at line 5 in file "t.c" 5 void * p = malloc(100); (dbx 3) dis malloc 0x7fac1cb8: malloc : save %sp, -96, %sp 0x7fac1cbc: malloc+0x0004: call malloc+0xc ! 0x7fac1cc4 0x7fac1cc0: malloc+0x0008: sethi %hi(0x7a000), %o1 0x7fac1cc4: malloc+0x000c: inc 844, %o1 0x7fac1cc8: malloc+0x0010: add %o1, %o7, %o3 0x7fac1ccc: malloc+0x0014: ld [%o3 + 3788], %l0 0x7fac1cd0: malloc+0x0018: call _PROCEDURE_LINKAGE_TABLE_+0x3c 0x7fac1cd4: malloc+0x001c: mov %l0, %o0 0x7fac1cd8: malloc+0x0020: call _malloc_unlocked ! 0x7fac1cf4 0x7fac1cdc: malloc+0x0024: mov %i0, %o0 (dbx 4) dis 0x7fac1ce0: malloc+0x0028: mov %o0, %i0 0x7fac1ce4: malloc+0x002c: call _PROCEDURE_LINKAGE_TABLE_+0x48 0x7fac1ce8: malloc+0x0030: mov %l0, %o0 0x7fac1cec: malloc+0x0034: ret 0x7fac1cf0: malloc+0x0038: restore 0x7fac1cf4: _malloc_unlocked : save %sp, -96, %sp 0x7fac1cf8: _malloc_unlocked+0x0004: sethi %hi(0xffffdc00), %o0 0x7fac1cfc: _malloc_unlocked+0x0008: call _malloc_unlocked+0x10 0x7fac1d00: _malloc_unlocked+0x000c: sethi %hi(0x7a000), %o1 0x7fac1d04: _malloc_unlocked+0x0010: inc 999, %o0 (dbx 5) stopi at 0x7fac1cec -if $i0 == 0x20e38 (3) stopi at &malloc+0x34 -if $i0 == 0x20e38 (dbx 6) stop in free -if $o0 == 0x20e38 dbx: warning: 'free' has no debugger info -- will trigger on first instruction (4) stop in free -if $o0 == 0x20e38 (dbx 7) cont stopped in malloc at 0x7fac1cec 0x7fac1cec: malloc+0x0034: ret (dbx 8) up Current function is main 5 void * p = malloc(100); (dbx 9) cont 20e38 stopped in free at 0x7fac2b4c 0x7fac2b4c: free : save %sp, -96, %sp Current function is main 9 free(p); (dbx 10) cont execution completed, exit code is 0 (dbx 11) The first stopi point is at the "ret" instruction of malloc, and stops only when $i0 (the return value, %i0 in disassembly) is the value you want (in this case, 0x20e38). The second stop point is at the first instruction of free (stop in free) and only if $o0 (the first parameter) is the value you want. (dbx 8) promt caught malloc() from returning 0x20e38, and (dbx 10) caught free() being passed 0x20e38. Depending on what you want to do, apptrace might be useful also. with the same a.out: $ apptrace ./a.out apptrace: unexpected version: 3 a.out -> libc.so.1:atexit(func = 0x7fbaeaa8) = 0x0 a.out -> libc.so.1:atexit(func = 0x10c98) = 0x0 a.out -> libc.so.1:malloc(size = 0x64) = 0x20e38 a.out -> libc.so.1:printf(20e38 format = 0x10cb0, ...) = 6 a.out -> libc.so.1:free(ptr = 0x20e38) a.out -> libc.so.1:exit(status = 0) $ And of course, Dtrace in Solaris 10 can do what apptrace can do (and more). Another useful tool is dbx's rtc checking (if that's what you're trying to do). Just do: (dbx 1) check -access And it will do bunch of different runtime error checkings for memory allocation. Type "help check" on dbx command line to see more detail. > Can I do similar things to new and delete? New and delete calls malloc and free. So you'll see some additional calls on top of malloc and free but it's basically the same: $ cat t.c int main(void) { int * p = new int; *p = 10; delete p; return 0; } $ CC -g t.c $ dbx ./a.out Reading a.out Reading ld.so.1 Reading libCstd.so.1 Reading libCrun.so.1 Reading libm.so.1 Reading libc.so.1 Reading libdl.so.1 Reading libCstd_isa.so.1 Reading libc_psr.so.1 (dbx 1) si main (2) stop in main (dbx 2) r Running: a.out (process id 16911) stopped in main at line 2 in file "t.c" 2 int * p = new int; (dbx 3) si malloc dbx: warning: 'malloc' has no debugger info -- will trigger on first instruction (3) stop in malloc (dbx 4) c stopped in malloc at 0x7f8c1cb8 0x7f8c1cb8: malloc : save %sp, -96, %sp Current function is main 2 int * p = new int; (dbx 5) where [1] malloc(0x4, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x7f8c1cb8 [2] operator new(0x4, 0x7f93c008, 0x13988, 0xb00, 0x7f9eae3c, 0x4), at 0x7f9d74d8 =>[3] main(), line 2 in "t.c" (dbx 6) si free More than one identifier 'free'. Select one of the following: 0) Cancel 1) `libc.so.1`free 2) `libCstd.so.1`#__1cDstdIvalarray4Cl_Efree6M_v_ [non -g, demangles to: std::valarray ::free()] 3) `libCstd.so.1`#__1cDstdIvalarray4Ci_Efree6M_v_ [non -g, demangles to: std::valarray ::free()] 4) `libCstd.so.1`#__1cDstdIvalarray4CL_Efree6M_v_ [non -g, demangles to: std::valarray ::free()] 5) `libCstd.so.1`#__1cDstdIvalarray4CI_Efree6M_v_ [non -g, demangles to: std::valarray ::free()] a) All > 1 dbx: warning: 'free' has no debugger info -- will trigger on first instruction (4) stop in free (dbx 8) cont stopped in free at 0x7f8c2b4c 0x7f8c2b4c: free : save %sp, -96, %sp Current function is main 4 delete p; (dbx 9) where [1] free(0x28620, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x7f8c2b4c [2] operator delete(0x28620, 0x7f93c008, 0x13988, 0xb00, 0x7f9eae3c, 0x4), at 0x7f9d65b8 =>[3] main(), line 4 in "t.c" (dbx 10) Or an alternative is to set a breakint at constructor/destructor that you want to track: (dbx) stop in YourClass::YourClass -if this == ...pointer value... (dbx) stop in YourClass::~YourClass $ cat t.c #include class myclass { public: myclass(void) { printf("I'm here.\n"); }; ~myclass(void) { printf("I'm gone.\n"); }; }; int main(void) { myclass *p = new myclass(); delete p; return 0; } $ CC -g t.c $ dbx ./a.out Reading a.out Reading ld.so.1 Reading libCstd.so.1 Reading libCrun.so.1 Reading libm.so.1 Reading libc.so.1 Reading libdl.so.1 Reading libCstd_isa.so.1 Reading libc_psr.so.1 (dbx 1) si myclass::myclass (2) stop in myclass::myclass() (dbx 2) si myclass::~myclass (3) stop in myclass::~myclass() (dbx 3) r Running: a.out (process id 16929) stopped in myclass::myclass at line 4 in file "t.c" 4 myclass(void) { printf("I'm here.\n"); }; (dbx 4) w =>[1] myclass::myclass(this = 0x28838), line 4 in "t.c" [2] main(), line 8 in "t.c" (dbx 5) c I'm here. stopped in myclass::~myclass at line 5 in file "t.c" 5 ~myclass(void) { printf("I'm gone.\n"); }; (dbx 6) w =>[1] myclass::~myclass(this = 0x28838), line 5 in "t.c" [2] main(), line 10 in "t.c" (dbx 7) stop in myclass::myclass -if this == 0x28838 (4) stop in myclass::myclass() -if this == 0x28838 (dbx 8) stop in myclass::~myclass -if this == 0x28838 (5) stop in myclass::~myclass() -if this == 0x28838 (dbx 9) status (2) stop in myclass::myclass() *(3) stop in myclass::~myclass() (4) stop in myclass::myclass() -if this == 0x28838 (5) stop in myclass::~myclass() -if this == 0x28838 (dbx 10) handler -disable 2 3 (dbx 11) run Running: a.out (process id 16930) stopped in myclass::myclass at line 4 in file "t.c" 4 myclass(void) { printf("I'm here.\n"); }; (dbx 12) where =>[1] myclass::myclass(this = 0x28838), line 4 in "t.c" [2] main(), line 8 in "t.c" (dbx 13) c I'm here. stopped in myclass::~myclass at line 5 in file "t.c" 5 ~myclass(void) { printf("I'm gone.\n"); }; (dbx 14) where =>[1] myclass::~myclass(this = 0x28838), line 5 in "t.c" [2] main(), line 10 in "t.c" (dbx 15)
Looks like I learn new things whenever I blog something - our dbx guy told me about this dbx feature "stop returns" which effectively sets a breakpoint at all places that call the function (in reality, it will wait till the function is called and set a breapoint at the return address).
Anyway, if you use "stop returns" then $o0 will have the return address of malloc since the brekapoint will be at the call site, whereas in the above example, I set the breapoint at the return instruction so the return value is in $i0.
( Sep 13 2004, 11:10:09 AM PDT ) Permalink
SunRay and voting machine.
I carpool with my coworker, and having two software geeks in a car for 30 minutes often lead to interesting discussions. This morning's topic was Diebold and voting machine (if "Diebold" and "voting" don't ring any bell, check this, this or this out).Anyway, we were talking mostly about how we can design a voting system that will be safe, secure and fail-proof. And my coworker suggested using a SunRay based system, but he wasn't sure installing a sunray server in every voting station is a very good idea. I told him about our wonderful SunRay over WAN, and it was clear to both of us that SunRay over WAN would be a perfect platform for voting (that is, if SunRay supports a touch screen).
The benefits are countless. It's secure, rugged, reliable and cost effective. Setting up a voting station will be very easy and will not require any skilled technician - just connect power cables and network ports. Moreover, the SunRay itself can be reused for its original purpose after the voting - a good cost saving. The total vote count would be instant - heck, a real-time vote count would be possible (although it would be debatable whether to allow it or not). Power failure in the voting station won't affect the voting data - although it could prevent people from voting in that particular voting station. Security won't be a concern - since the sunray server doesn't need to talk to anyone but SunRays, and the data will be kept in the server. WanRay uses VPN to secure the connection, and I believe SunRay has its own encryption again over VPN, so it's quite secure.
With various security and management features in Solaris 10, I can easily imagine using thousands of SunRays and a couple of SunRay servers as a secure and cost effective voting system - after all, our Trusted Solaris and SunRay are used for highly sensitive and secure environment, so why not use the same technology for voting ?
( Sep 08 2004, 09:46:49 AM PDT ) Permalink Comments [4]
Talk about backward compatibility...
On the newsgroup comp.unix.solaris, I've encountered the following posting:>MANY applications from the SunOS 4.x days will still work, >without recompilation, on Solaris 10. WordPerfect 5.1 for SunOS worked on my SunBlade 150 with Solaris 9. Took a while to eat all those floppies though....and here SunOS refers to SunOS 4.x (since Solaris 2 is SunOS 5.x + etc).
When I sit down and thought about this, it wasn't too surprising but it still was amazing that our Solaris folks can keep the backward compatibility for over 10 year and across the major OS version upgrade - after all, 4.x to 5.x jump was a really big change.
BTW, if you're not familiar with SunOS 4.x, you may want to check this link for which SunOS version was when and which SunOS is which Solaris, etc.
Kudos to our Solaris folks !
( Aug 12 2004, 09:16:09 AM PDT ) Permalink
Debugging your shared library with LD_PRELOAD
This morning, I've replied to one posting on comp.unix.solaris about dbx and LD_PRELOAD (google groups doesn't seem to have my replies yet). It reminded me of my experience with dbx and LD_PRELOAD which I'd like to share.I needed to debug and test my shared library. LD_PRELOAD seemed to be the easiest way to do so (if you don't know what LD_PRELOAD is, you may want to take a look at manpage ld.so.1(1) or the wonderful Linker and Libraries Guide).
Without much thinking, I did the following:
$ LD_PRELOAD=...my library... $ dbx ...my program...And I immediately got the linker error:
ld.so.1: ....: fatal: ... wrong ELF class: ELFCLASS32Dang. The real dbx is 64bit on SPARC but my library is 32bit. Obviously a 32bit library and a 64bit executable don't mix together very well.
So I tried the following:
$ dbx ...my program... (dbx) LD_PRELOAD=...my library... (dbx) export LD_PRELOAD (dbx) runand it worked fine, or so it seemed. Dbx loads all libraries that the executable depends on in advance (if you have used dbx, you'll remember seeing those "Reading ....so.1" lines when dbx reads your program). But LD_PRELOAD'ed library is loaded when the program is executed, not in advance. That means dbx loaded both the original shared library and my shared library in LD_PRELOAD.
This didn't affect the execution of my program, but it affected which symbol dbx sees. Since both original and my own shared libraries are loaded, whenever I wanted to inspect a symbol in the library, dbx saw two different copies. Sometimes it asked me for which version to really use (like when setting breakpoints or calling functions), but sometimes it just used the original (like when printing global variables). This was confusing, and so I cried help to the dbx guy and he taught me the following solution:
(dbx) loadobject -list ... (dbx) loadobject -unload ...the original library...which unloaded the unnecessary original library and the problem was solved.
Well, technically the above was the ideal way. But I'm lazy and didn't want to do "loadobject" thing everytime I start my dbx. And so I ended up doing the following:
$ LD_PRELOAD_32=...my library... dbxwhich is sort of a work-around. This wouldn't have worked if my library were 64bit and interfered with dbx in some ways. I guess this is one of those cases where you prefer a little work around than a proper solution.
( Jul 30 2004, 11:24:52 AM PDT ) Permalink Comments [1]
Hidden benefits of the register window...
When SPARC ISA was first designed, the register window feature made sense given the first implementation characteristics such as unified cache and relatively shallow call depth. But as John Mashey pointed out in his posting here, it doesn't look like such a great idea nowadays.But there's some unexpected benefit of the register window that are not well known.
The register window makes the call stack trace explicit, straightfoward and transparent, thus makes it fairly simple to do a stack unwind and do some tricks in the exception handling code. This contrasts to some other architectures where it isn't very simple to do a stack unwind due to potentially missing frame pointer, or difficulty of recovering the callee save registers.
Another hidden benefit is the simpler register allocation in the compiler. The register window effectively leaves no caller save registers, thus the compiler has less things to worry about. Also there's less need for interprocedural register allocation or link time register allocation.
Because of the register window and non-executable stack, it is quite a bit more difficult to exploit the buffer overflow vulnerability on SPARC, although it doesn't make exploiting impossible.
( Jul 28 2004, 02:30:29 AM PDT ) Permalink
Don't try to trick the compiler.
This is yet another not-a-compiler-bug-but-a-user-bug story.Some Sun internal folks built an open source project hosted on sourceforge.net with our compiler, and the program produced different output when compiled with -xO4 or above. So they thankfully filed a bug (btw, we're happy to look at any bugs filed against us, even if it turns out to be a user error. So please don't hesitate to file a bug if you think it's compiler's fault).
A short analysis revealed the following:
In a file "r.h", there was a declaration like:
typedef struct ... {
...
} some_struct_t;
extern const some_struct_t some_struct;
But in "r.c", it wasn't declared "const", and in that file,
many functions modified this global variable some_struct.The original programmer seemed to have thought that since this global variable is modified only in r.c and all other files should just read the variable, it's a good way to force that.
At -xO4 or above, our compiler starts doing aggressive inlining. And the inlining exposed some redudant loads from one field of this global variable "some_struct" in a file "a.c". Of course, the compiler happily eliminated the second redundant load - since "some_struct" is declared "const", there's nothing for the compiler to worry about. Well, the only problem was there was a call to a function defined in "r.c" which modifies the field that the eliminated redundant load was accessing. So the variable wasn't really "const" at all. Of course, this program works just fine when compiled without optimization or low level optimization, since only inlining and redundant code elimination can reveal the problem.
Another interesting tidbit is that there was some if-def that removed this "const"ness when compiled on certain platform. I bet somebody was already bitten by exactly the same problem, and worked around it by removing constness for that platform. Maybe s/he thought it was a platform specific problem. I don't know.
Anyway, I guess the morale of this story is: don't try to trick your compiler.
( Jul 23 2004, 01:18:09 PM PDT ) Permalink Comments [1]
Buffer overflow, register window and register allocation.
I work on Sun's compiler, especially the SPARC code generator part. The inevitable (and sometimes boring, and sometimes the most interesting) part of my job is to evaluate bugs and (of course) fix them if I can. But as any engineers working on a complex software know, more often than not, a bug turns out to be an user error - in compiler's case, it could mean the user code has a bug.This is a story of one recent case of not-a-bug.
One of our largest ISVs filed a bug where their application receives SIGSEGV when the program is compiled at -xO4 or above with our S1S8 compiler. The program worked just fine with WS6U2 at the same optimization level, so the customer naturally thought this is a compiler bug. I can't fault them for that since they had experienced quite a few compiler bugs in the past.
Because the bug went away whenever you turned off the global register allocator, it was sent to me (since I was the author of the register allocator). This particular ISV application was one of the most difficult ones to deal with, because this ISV, like most other large ISVs, does not allow their code to be shipped to us, thus we have to rely on either their engineer or our support engineer working on their site.
Since there's always a possiblity of a user error, running dbx's rtc or purify like tools is one way to exclude some of the most common programming errors. Unfortunately, this application was too large and complex for dbx rtc or purify to handle correctly and produce a userful report.
The symptom was quite simple - the program gets SEGV and at the time of SEGV, the stack trace showed that one pointer parameter had upper 32bit of 64bit pointer "zero"ed. So obviously the caller of the function was the first suspect. Upon manual inspection of the disassembly, it was clear that the code was quite correct because the code looked like following:
add %fp,1xxx,%l0 ...bunch code including many calls... call problematic_func mov %l0,%o0
On dbx, %l0 contained a correct value right after the add but somehow the upper 32bit of %l0 got zeroed out when the control reached the problematic call. Subsequent dbx printout showed that %l0 gets changed after a call to a certain function.
Assuming save-restore are correctly placed, the only other way to modify %l0 is to change the register window save area. It just so happened that the %l0 is the first entry in the register window save area. Since SPARC is big-endian, the upper 32bit (MSB) is stored in the lower address. This all suggested the function in question was overwriting the first 32bit of the register window save area. This can happen, among others, if there's a buffer overflow on a local array. Because the compiler allocates stack space for local variables from the higher address to the lower address in the order of appearance in the source, the first variable is usually placed at the top, thus right below the current %fp (or the %sp of the caller). Of course, optimization can move stuff around and get rid of variables, and most scalar variables are allocated in the register so there's no guarantee for the above rule.
The preprocessed source code for the function in question looked like following:
returntype func(something *ptr,...) {
wchar_t a[81];
wchar_t b[81];
...initialize b by calling some initfunc...
for(i = 0;i < wslen(b);i++) {
...do some operation on b[i]...
}
b[i] = 0;
...more code...
}
The array "a" wasn't used in the function, so the compiler didn't bother to allocate it on the stack. Thus "b" was at the top of the stack. If b was to overflow, the window save area could be overwritten - i.e. b[81] = 0 would overwrite the upper 32bit of %l0 save area.
After hearing the above analysis, our support engineer looked at the code of the initfunc and found a bug as expected, and the bug was closed as not a bug.
One may wonder why this code worked fine in the past. That's because %l0 wasn't live across that particular function call. The morale of the story is that any slight change in the register assignment can reveal a user error.
( Jul 16 2004, 03:44:06 PM PDT ) Permalink Comments [6]

