« January 2007 »
SunMonTueWedThuFriSat
 
1
2
3
6
7
16
20
27
30
31
   
       
Today
XML

Tom Haynes

loghyr.com
excfb.com

Blogs to Gander At

Navigation

Editing

AllMarks

Referers

Today's Page Hits: 2397

Powered by Roller Weblogger.

statcounter.com

clustrmaps.com

Locations of visitors to this page

technorati.com

www.alesti.org

Add to Alesti RSS Reader

South Park as I was 10 years ago

South Park Fantasy

South Park today

South Park Reality

I have more hair and it isn't so grey. :->

10 years ago, really

Toon Tom

Today, literally

Tom Today

Site notes

This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.

« Previous day (Jan 27, 2007) | Main | Next day (Jan 29, 2007) »
20070128 Sunday January 28, 2007
nge0 works in WinXP but not under OpenSolaris

I never got the onboard ethernet working on that new box - kanigix. At times nge0 would come up, but then it decided to stay down. I've mainly stayed in WinXP (remember it is for playing games with my son or by myself) or gotten things over via an USB harddrive. You can imagine that gets old.

Well, I have a bug filed on this and I'm off to Connectathon 2007 on Tuesday. I want to be able to access the box while I'm gone. I added a serial console, so I would at least be able to do some simple commands to help debug the issue. And then, a thought smacked me, why can't I add an ethernet card to it? So I picked one at random from my pile of old cards - I ended up getting a cheap GigE card. I thought I was getting a solid 10/100 card. But WinXP installed a driver and the Sun Device Detection Tool said there was a driver. And now I've got a working system:

[tdh@kanigix ~]> ifconfig -a
lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000 
rge0: flags=201000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,CoS> mtu 1500 index 2
        inet 192.168.2.115 netmask ffffff00 broadcast 192.168.2.255

Aahhh. I'm really glad, because this machine is quiet and builds scream on it.


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily
Debugging that issue with the in-kernel sharetab

As last mentioned, I'm seeing this on boot:

Jan 28 20:40:54 kanigix genunix: [ID 104096 kern.warning] WARNING: system call missing from bind file
Jan 28 20:40:54 kanigix genunix: [ID 684969 kern.warning] WARNING: Cannot mount /system/dfs

How do we debug this? Well, first we find the first message in the source: usr/src/uts/common/os/modconf.c. In mod_getsysent():

        if ((sysnum = mod_getsysnum(mod_name)) == -1) {
                cmn_err(CE_WARN, "system call missing from bind file");
                return (NULL);
        }

Okay, this is actually looking familiar - I think I hit it before. Now I did a diff from my code before and after the merge with the latest source - and I didn't see anything glaring. Bzzt, found it after some searching. If we look at the top of make_syscallname():

                cmn_err(CE_WARN, "!Couldn't add system call \"%s %d\". "
                    "It conflicts with \"%s %d\" in /etc/name_to_sysnum.",
                    name, sysno, *cp, sysno);

So what does /etc/name_to_sysnum contain for entry 140?

adjtime                 138
systeminfo              139
seteuid                 141
forksys                 142
fork1                   143

So this is the reason I'm seeing the error on the machines. Now I have to figure out why BFU is messing up. (Note, I know I ran ACR afterwards on both of the boxes.)

I say it is BFU, but let's check. Inside the source workspace:

adjtime                 138
systeminfo              139
sharefs                 140
seteuid                 141
[tdh@warlock os]> pwd
/zoo/ws/onnv-gate/usr/src/uts/intel/os

And lets check the proto area:

[tdh@warlock os]> pwd
/zoo/ws/onnv-gate/proto/root_i386/etc
adjtime                 138
systeminfo              139
seteuid                 141
forksys                 142

Ouch, BFU is innocent! What is the deal? I looked in the nightly logs and couldn't find anything. I looked in the previous nightly log, which was the fresh install:

/usr/bin/rm -f /zoo/ws/onnv-gate/proto/root_i386/etc/name_to_sysnum; install -s -m 644 -f /zoo/ws/onnv-gate/proto/root_i386/etc ../../intel/os/name_to_sysnum

Okay, the error is that by not clobbering everything and starting fresh, /etc/name_to_sysnum did not get remade. I'll claim that this is a bug in the makefile - there should be a dependency which gets checked. My solution? Rebuild from scratch.

One thing I would like you to notice is that I debugged this issue without actually looking at a core file or stepping through a debugger. That would have taken longer. Instead, I mainly relied on grep, diff, find, OpenGrok, and cscope. With a different bug, or with code which I knew was not working, I might have delved in with kmdb. I'm glad I did not have to.

A second thing to note, I can just add in the entry for 140 to /etc/name_to_sysnum to get the system running correctly.

I actually did that on one of the two systems, the other is rebuilding the source from scratch. Here are some neat interactions:


[tdh@mrx ~]> ls -la /etc/dfs/sharetab 
lrwxrwxrwx   1 root     root          25 Jan 28 01:17 /etc/dfs/sharetab -> ../../system/dfs/sharetab
[tdh@mrx ~]> ls -la /system/dfs/sharetab
-r--r--r--   1 root     root           0 Jan 28 22:01 /system/dfs/sharetab
[tdh@mrx ~]> share -F nfs / -o rw 
Could not share: /: no permission
[tdh@mrx ~]> sudo !!
sudo share -F nfs / -o rw
[tdh@mrx ~]> ls -la /system/dfs/sharetab
-r--r--r--   1 root     root          13 Jan 28 22:01 /system/dfs/sharetab
[tdh@mrx ~]> cat /system/dfs/sharetab
/       -o      nfs     rw      

Besides the fact that it is down in /system, there is no way for you to tell that this is not really a file, but an interface into memory. Also, this is just the prototype, I recently decided to not use a symlink and instead fix up GFS to understand a file without a parent directory.


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily
Ported, so to speak, my sharetab prototype to OpenSolaris

Since I'm showcasing my In-Kernel Sharetab project at Connectathon 2007 on OpenSolaris, I thought I would get it into a mercurial download of the source code from OpenSolaris.org. It sounds easy, but I've got 66 files checked out for editing (existing and new) and the source control for OpenSolaris is different than that for Solaris. Okay, the OpenSolaris one is probably pulled over nightly from the Solaris gate, but they are different interfaces to the code.

The good news is that I have no desire to putback from the OpenSolaris codebase. I just need to identify how to get my changes pushed on top of the OpenSolaris codebase. Also, the last time I synced my changes were up to the codebase of a month ago. We had a major Flag day since then - and as far as I can tell, the OpenSolaris code reflects that change.

So the first thing I have to do is get my code in sync with the current nightly, which I suspect is very close for my purposes with the OpenSolaris code. I reparent my workspace, tell it to bringover, and then resolve the conflicts in 22 files. Not bad, I only had to step in twice for the automatic conflict resoloution.

A tool we use at Sun is 'wx' and it can create backups of active files in a workspace. So once I had a good merge, I told it to backup again (I took a backup before the merge as well in case I had to roll it back). That process created a tar file with just the copies of the active fileset.

I took that over to my OpenSolaris build machine and did a diff between all 66 files. It wasn't that bad, I had managed to sync up to a close enough copy. Once I saw there were no problems, I untarred into the workspace, made sure via diff that the files were what I said to use, and then started off a build. I also kicked off a build in the synced Solaris workspace. I wanted to make sure my merge hadn't broken anything.

The OpenSolaris build finished and it was a clean build. The Solaris one is still going. Does it mean OpenSolaris is easier to build, my home machine is faster, etc? No, in the OpenSolaris case, I told it not to clobber the existing stuff (i.e., an incremental build) and in the Solaris case I told it to start from scratch. Why? Well, I'm not checking in the OpenSolaris code, I knew exactly what I was changing. I had no clue what all changed in the merge for the Solaris workspace.

Now I've got to BFU a system with those bits and see if I get what I expect. The big issue for that is that the OpenSolaris build machine I used is behind a VPN firewall. It is 10 feet from the test box, but unless I want to lose all of my other sessions (including the work build). Luckily, we don't need complete access to it, we just need the BFU archives:

[tdh@warlock onnv-gate]> ls -la archives/i386/nightly/
total 569421
drwxr-xr-x   2 tdh      staff         11 Jan 28 00:24 .
drwxr-xr-x   3 tdh      staff          3 Jan 28 00:23 ..
-rw-r--r--   1 tdh      staff      64348 Jan 28 00:24 conflict_resolution.gz
-rw-r--r--   1 tdh      staff    76112168 Jan 28 00:23 generic.kernel
-rw-r--r--   1 tdh      staff    24045508 Jan 28 00:23 generic.lib
-rw-r--r--   1 tdh      staff    2367696 Jan 28 00:23 generic.root
-rw-r--r--   1 tdh      staff    1280000 Jan 28 00:23 generic.sbin
-rw-r--r--   1 tdh      staff    178135616 Jan 28 00:24 generic.usr
-rw-r--r--   1 tdh      staff    2580480 Jan 28 00:23 i86pc.boot
-rw-r--r--   1 tdh      staff    4853760 Jan 28 00:23 i86pc.root
-rw-r--r--   1 tdh      staff    1187840 Jan 28 00:23 i86pc.usr
[tdh@warlock onnv-gate]> tar cf sht_bfu.tar archives
[tdh@warlock onnv-gate]> ls -la sht_bfu.tar 
-rw-r--r--   1 tdh      staff    290636800 Jan 28 00:50 sht_bfu.tar
[tdh@warlock onnv-gate]> bzip2 sht_bfu.tar
Time spent in user mode   (CPU seconds) : 94.17s
Time spent in kernel mode (CPU seconds) : 0.71s
Total time                              : 1:36.39s
CPU utilisation (percentage)            : 98.4%
[tdh@warlock onnv-gate]> ls -la sht_bfu.tar.bz2
-rw-r--r--   1 tdh      staff    97782474 Jan 28 00:50 sht_bfu.tar.bz2

All I have to do is get this file over, unpack it, and run BFU. Okay, I've done that - no serial console, so I can't show you the steps.

The good news is that it boots and I can get into it. I've failed the brickify test.

The somewhat bad news is this:

[tdh@mrx ~]> dmesg | grep dfs
Jan 28 01:21:33 mrx genunix: [ID 684969 kern.warning] WARNING: Cannot mount /system/dfs
[tdh@mrx dfs]> cd /system/dfs
[tdh@mrx dfs]> ls -la
total 4
dr-xr-xr-x   2 root     root         512 Jan 27 23:57 .
drwxr-xr-x   5 root     root         512 Jan 23 04:37 ..

I actually have to remove the /system/dfs from my prototype to get the final product. But I still need to know how I horked this all up. It isn't the symlink which is messed up, it is loading the sharefs module which is broken. Was it the merge? Or was it the blind copy into the OpenSolaris workspace? Or did a bug creep in? The other evil thought is that I've tested this on 64 bit sparc and 64 bit amd machines, but not 32 bit x86es. It could be a bug in the code.

Ouch, I put it on my new desktop (64bit AMD) and it panicked the box. I think it was not related - the backtrace was in usb (page fault in usb_ac:usb_ac_setup_connections+450 to be exact). As near as I can figure, it crashed while loading my Logitech QuickCam - see usb_ac.c. I just added that the other day, easy enough to pull for a stable system.

Note that I think it must be my QuickCam because it has a microphone and my SoundBlaster is PCI. And it is in some DEBUG code - which explains why I hadn't seen it before. I filed a new bug for it - Bug ID: 6518469 DEBUG build page faults when booting with an attached Logitech QuickCam.

Anyway, the system does put up a warning about mounting /system/dfs:

DEBUG enabled
WARNING: system call missing from bind file
WARNING: Cannot mount /system/dfs

I should have first booted this machine as a stock OpenSolaris install. Then I could have added my code. Anyway, it came up on the next boot. But I have no idea if it will crash at any time.

I'll look to see if there is a match to the backtrace of the core. As for my problem, I need to get the Solaris build finished and see how it works. I'll just work my way back through the steps (checking my backups for changes). And if that doesn't work, I'll show you the painful way of debugging a live kernel.

By the way, failure is a good experience. I've caught a potential problem with my demo 3 days before I would if I waited until the conference. I've also caught a bug before the QA people did - which they would do once I let them play with the code. In all, this has been a good use of 5 hours.


Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Copyright (C) 2007, Kool Aid Served Daily