Today's Page Hits: 2397
I have more hair and it isn't so grey. :->
This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
I never got the onboard ethernet working on that new box - kanigix. At times nge0 would come up, but then it decided to stay down. I've mainly stayed in WinXP (remember it is for playing games with my son or by myself) or gotten things over via an USB harddrive. You can imagine that gets old.
Well, I have a bug filed on this and I'm off to Connectathon 2007 on Tuesday. I want to be able to access the box while I'm gone. I added a serial console, so I would at least be able to do some simple commands to help debug the issue. And then, a thought smacked me, why can't I add an ethernet card to it? So I picked one at random from my pile of old cards - I ended up getting a cheap GigE card. I thought I was getting a solid 10/100 card. But WinXP installed a driver and the Sun Device Detection Tool said there was a driver. And now I've got a working system:
[tdh@kanigix ~]> ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
inet 127.0.0.1 netmask ff000000
rge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
inet 192.168.2.115 netmask ffffff00 broadcast 192.168.2.255
Aahhh. I'm really glad, because this machine is quiet and builds scream on it.
As last mentioned, I'm seeing this on boot:
Jan 28 20:40:54 kanigix genunix: [ID 104096 kern.warning] WARNING: system call missing from bind file Jan 28 20:40:54 kanigix genunix: [ID 684969 kern.warning] WARNING: Cannot mount /system/dfs
How do we debug this? Well, first we find the first message in the source: usr/src/uts/common/os/modconf.c. In mod_getsysent():
if ((sysnum = mod_getsysnum(mod_name)) == -1) {
cmn_err(CE_WARN, "system call missing from bind file");
return (NULL);
}
Okay, this is actually looking familiar - I think I hit it before. Now I did a diff from my code before and after the merge with the latest source - and I didn't see anything glaring. Bzzt, found it after some searching. If we look at the top of make_syscallname():
cmn_err(CE_WARN, "!Couldn't add system call \"%s %d\". "
"It conflicts with \"%s %d\" in /etc/name_to_sysnum.",
name, sysno, *cp, sysno);
So what does /etc/name_to_sysnum contain for entry 140?
adjtime 138 systeminfo 139 seteuid 141 forksys 142 fork1 143
So this is the reason I'm seeing the error on the machines. Now I have to figure out why BFU is messing up. (Note, I know I ran ACR afterwards on both of the boxes.)
I say it is BFU, but let's check. Inside the source workspace:
adjtime 138 systeminfo 139 sharefs 140 seteuid 141 [tdh@warlock os]> pwd /zoo/ws/onnv-gate/usr/src/uts/intel/os
And lets check the proto area:
[tdh@warlock os]> pwd /zoo/ws/onnv-gate/proto/root_i386/etc adjtime 138 systeminfo 139 seteuid 141 forksys 142
Ouch, BFU is innocent! What is the deal? I looked in the nightly logs and couldn't find anything. I looked in the previous nightly log, which was the fresh install:
/usr/bin/rm -f /zoo/ws/onnv-gate/proto/root_i386/etc/name_to_sysnum; install -s -m 644 -f /zoo/ws/onnv-gate/proto/root_i386/etc ../../intel/os/name_to_sysnum
Okay, the error is that by not clobbering everything and starting fresh, /etc/name_to_sysnum did not get remade. I'll claim that this is a bug in the makefile - there should be a dependency which gets checked. My solution? Rebuild from scratch.
One thing I would like you to notice is that I debugged this issue without actually looking at a core file or stepping through a debugger. That would have taken longer. Instead, I mainly relied on grep, diff, find, OpenGrok, and cscope. With a different bug, or with code which I knew was not working, I might have delved in with kmdb. I'm glad I did not have to.
A second thing to note, I can just add in the entry for 140 to /etc/name_to_sysnum to get the system running correctly.
I actually did that on one of the two systems, the other is rebuilding the source from scratch. Here are some neat interactions:
[tdh@mrx ~]> ls -la /etc/dfs/sharetab lrwxrwxrwx 1 root root 25 Jan 28 01:17 /etc/dfs/sharetab -> ../../system/dfs/sharetab [tdh@mrx ~]> ls -la /system/dfs/sharetab -r--r--r-- 1 root root 0 Jan 28 22:01 /system/dfs/sharetab [tdh@mrx ~]> share -F nfs / -o rw Could not share: /: no permission [tdh@mrx ~]> sudo !! sudo share -F nfs / -o rw [tdh@mrx ~]> ls -la /system/dfs/sharetab -r--r--r-- 1 root root 13 Jan 28 22:01 /system/dfs/sharetab [tdh@mrx ~]> cat /system/dfs/sharetab / -o nfs rw
Besides the fact that it is down in /system, there is no way for you to tell that this is not really a file, but an interface into memory. Also, this is just the prototype, I recently decided to not use a symlink and instead fix up GFS to understand a file without a parent directory.
Since I'm showcasing my In-Kernel Sharetab project at Connectathon 2007 on OpenSolaris, I thought I would get it into a mercurial download of the source code from OpenSolaris.org. It sounds easy, but I've got 66 files checked out for editing (existing and new) and the source control for OpenSolaris is different than that for Solaris. Okay, the OpenSolaris one is probably pulled over nightly from the Solaris gate, but they are different interfaces to the code.
The good news is that I have no desire to putback from the OpenSolaris codebase. I just need to identify how to get my changes pushed on top of the OpenSolaris codebase. Also, the last time I synced my changes were up to the codebase of a month ago. We had a major Flag day since then - and as far as I can tell, the OpenSolaris code reflects that change.
So the first thing I have to do is get my code in sync with the current nightly, which I suspect is very close for my purposes with the OpenSolaris code. I reparent my workspace, tell it to bringover, and then resolve the conflicts in 22 files. Not bad, I only had to step in twice for the automatic conflict resoloution.
A tool we use at Sun is 'wx' and it can create backups of active files in a workspace. So once I had a good merge, I told it to backup again (I took a backup before the merge as well in case I had to roll it back). That process created a tar file with just the copies of the active fileset.
I took that over to my OpenSolaris build machine and did a diff between all 66 files. It wasn't that bad, I had managed to sync up to a close enough copy. Once I saw there were no problems, I untarred into the workspace, made sure via diff that the files were what I said to use, and then started off a build. I also kicked off a build in the synced Solaris workspace. I wanted to make sure my merge hadn't broken anything.
The OpenSolaris build finished and it was a clean build. The Solaris one is still going. Does it mean OpenSolaris is easier to build, my home machine is faster, etc? No, in the OpenSolaris case, I told it not to clobber the existing stuff (i.e., an incremental build) and in the Solaris case I told it to start from scratch. Why? Well, I'm not checking in the OpenSolaris code, I knew exactly what I was changing. I had no clue what all changed in the merge for the Solaris workspace.
Now I've got to BFU a system with those bits and see if I get what I expect. The big issue for that is that the OpenSolaris build machine I used is behind a VPN firewall. It is 10 feet from the test box, but unless I want to lose all of my other sessions (including the work build). Luckily, we don't need complete access to it, we just need the BFU archives:
[tdh@warlock onnv-gate]> ls -la archives/i386/nightly/ total 569421 drwxr-xr-x 2 tdh staff 11 Jan 28 00:24 . drwxr-xr-x 3 tdh staff 3 Jan 28 00:23 .. -rw-r--r-- 1 tdh staff 64348 Jan 28 00:24 conflict_resolution.gz -rw-r--r-- 1 tdh staff 76112168 Jan 28 00:23 generic.kernel -rw-r--r-- 1 tdh staff 24045508 Jan 28 00:23 generic.lib -rw-r--r-- 1 tdh staff 2367696 Jan 28 00:23 generic.root -rw-r--r-- 1 tdh staff 1280000 Jan 28 00:23 generic.sbin -rw-r--r-- 1 tdh staff 178135616 Jan 28 00:24 generic.usr -rw-r--r-- 1 tdh staff 2580480 Jan 28 00:23 i86pc.boot -rw-r--r-- 1 tdh staff 4853760 Jan 28 00:23 i86pc.root -rw-r--r-- 1 tdh staff 1187840 Jan 28 00:23 i86pc.usr [tdh@warlock onnv-gate]> tar cf sht_bfu.tar archives [tdh@warlock onnv-gate]> ls -la sht_bfu.tar -rw-r--r-- 1 tdh staff 290636800 Jan 28 00:50 sht_bfu.tar [tdh@warlock onnv-gate]> bzip2 sht_bfu.tar Time spent in user mode (CPU seconds) : 94.17s Time spent in kernel mode (CPU seconds) : 0.71s Total time : 1:36.39s CPU utilisation (percentage) : 98.4% [tdh@warlock onnv-gate]> ls -la sht_bfu.tar.bz2 -rw-r--r-- 1 tdh staff 97782474 Jan 28 00:50 sht_bfu.tar.bz2
All I have to do is get this file over, unpack it, and run BFU. Okay, I've done that - no serial console, so I can't show you the steps.
The good news is that it boots and I can get into it. I've failed the brickify test.
The somewhat bad news is this:
[tdh@mrx ~]> dmesg | grep dfs Jan 28 01:21:33 mrx genunix: [ID 684969 kern.warning] WARNING: Cannot mount /system/dfs [tdh@mrx dfs]> cd /system/dfs [tdh@mrx dfs]> ls -la total 4 dr-xr-xr-x 2 root root 512 Jan 27 23:57 . drwxr-xr-x 5 root root 512 Jan 23 04:37 ..
I actually have to remove the /system/dfs from my prototype to get the final product. But I still need to know how I horked this all up. It isn't the symlink which is messed up, it is loading the sharefs module which is broken. Was it the merge? Or was it the blind copy into the OpenSolaris workspace? Or did a bug creep in? The other evil thought is that I've tested this on 64 bit sparc and 64 bit amd machines, but not 32 bit x86es. It could be a bug in the code.
Ouch, I put it on my new desktop (64bit AMD) and it panicked the box. I think it was not related - the backtrace was in usb (page fault in usb_ac:usb_ac_setup_connections+450 to be exact). As near as I can figure, it crashed while loading my Logitech QuickCam - see usb_ac.c. I just added that the other day, easy enough to pull for a stable system.
Note that I think it must be my QuickCam because it has a microphone and my SoundBlaster is PCI. And it is in some DEBUG code - which explains why I hadn't seen it before. I filed a new bug for it - Bug ID: 6518469 DEBUG build page faults when booting with an attached Logitech QuickCam.
Anyway, the system does put up a warning about mounting /system/dfs:
DEBUG enabled WARNING: system call missing from bind file WARNING: Cannot mount /system/dfs
I should have first booted this machine as a stock OpenSolaris install. Then I could have added my code. Anyway, it came up on the next boot. But I have no idea if it will crash at any time.
I'll look to see if there is a match to the backtrace of the core. As for my problem, I need to get the Solaris build finished and see how it works. I'll just work my way back through the steps (checking my backups for changes). And if that doesn't work, I'll show you the painful way of debugging a live kernel.
By the way, failure is a good experience. I've caught a potential problem with my demo 3 days before I would if I waited until the conference. I've also caught a bug before the QA people did - which they would do once I let them play with the code. In all, this has been a good use of 5 hours.