Today's Page Hits: 1018
I have more hair and it isn't so grey. :->
This page validates as XHTML 1.0, and will look much better in a browser that supports web standards, but it is accessible to any browser or Internet device. It was created using techniques detailed at glish.com/css/.
I'm going to narrate a massive debug session I just had, in which I used my XDR dump routine to flush out bugs. I'm not going to give all of the code, and my approach is very free form. But that is very much like a massive debug session. Enjoy or not!
Some my upcall from the kernel to the sped daemon is working fine, but the return path is not. My 64 bit gpool-id is off:
# Apr 3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523570| gives |35865356092523570| and |35865356092523570| 10, 16, 64000, 35865356092523570 uid == 200096 20, 32, 2000, 35865356092523571 uid == 1066 30, 64, 1000, 35865356092523572 uid == 0 40, 2, 2000, 35865356092523573 subnet == 10.1.233.0/24 Apr 3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523571| gives |35865356092523571| and |35865356092523571| Apr 3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523572| gives |35865356092523572| and |35865356092523572| Apr 3 16:13:29 pnfs-3-15 ./sped[100781]: spedit: |35865356092523573| gives |35865356092523573| and |35865356092523573| We got a policy 64, -17890057, 18444853557553841283 for /nippy/bogey4 Apr 3 16:13:34 pnfs-3-15 nfs: WARNING: sped: 2 0 2 200096 10 33554432 /nippy/bogey4 0 0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 7: 0 0 0 2 0 0 0 0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 15: 0 0 0 2 0 3 D A0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 23: 0 0 0 A 2 0 0 0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 31: 0 0 0 D 2F 6E 69 70 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 39: 70 79 2F 62 6F 67 65 79 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 47: 34 0 0 0 0 0 0 0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 52: 0 0 0 0 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: We got a policy 64, -17890057, 18444853557553841283 for /nippy/bogey4 Apr 3 16:13:34 pnfs-3-15 ./sped[100781]: 4: 0 0 0 5
We can see it change from 35865356092523570 to 18444853557553841283. Hmm, it used to have one more change. Anyway, the value of -17890057 looks off. It should be one of 1000, 2000, or 64000. And perhaps the 64 is off as well.
Anyway, we can see that the input XDR looks okay. I've even added a byte counter for it.
And that lets us see that the last line must be the reply. The counter has started again and 4 bytes as a reply looks too small. Oh crud, what does a 5 mean as a status code?
/* * Return status codes */ #define SPED_OK 0 /* * Caller provided invalid arguments */ #define SPED_INVALID 1 /* * Internal error in daemon e.g. out of memory, can't return result */ #define SPED_INTERNAL 2 /* * Were layouts revoked? */ #define SPED_REVOKED 3 /* * Was there a matching rule, but not enough available gpools? */ #define SPED_MATCH_NOT_ENOUGH 4 /* * Was there no matching rules? */ #define SPED_NO_MATCH 5
Grin, so I thought we were going to find that the buffer was too small and something in the encoding was way off. Looks like the buffer is just right and now I need to debug why my input is not matching a policy. But, my debug code is correct here! What is being printed is the garbage in the local variable.
I added some code to zap the local variable and to also dump the input parameters:
Apr 3 16:37:20 pnfs-3-15 ./sped[100790]: 52: 0 0 0 0 Apr 3 16:37:20 pnfs-3-15 ./sped[100790]: Matching against: 200096 10 0 0 0 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogey5 (N/A)
Hmm, no XDR dump of the output. That means we are erroring out earlier, probably due to zapping the local variable with a memset!
It is coring now somewhere in here:
if (sar.gpoolids)
u = sar.gpoolids[0];
else
u = (uint64_t)0xe4e5c7c8;
printf("We got a policy %d, %d, %llu for %s\n",
sar.lh_unit_size, sar.lh_stripe_count,
sar.gpoolids[0], saa.path);
syslog(LOG_ERR, "We got a policy %d, %d, %llu for %s",
sar.lh_unit_size, sar.lh_stripe_count,
sar.gpoolids[0], saa.path);
We should be concerned that we generate u but never use it. My guess is that sar.gpoolids is now NULL instead of random junk. If we fix that, we get back to:
Apr 3 21:38:41 pnfs-3-15 ./sped[100841]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogey8 (N/A) Apr 3 21:38:41 pnfs-3-15 ./sped[100841]: We got a policy 0, 0, 3840264136 for /nippy/bogey8
Ignoring the need to fix the receiving code to detect a NULL array, what can we do to get a match?
Another unitialized variable in going from the stand alone policy debugger to the sped daemon. How did I find it?
(gdb) n
spe_eval_thunk (st={st_is_interior = B_FALSE, st_node = 0x8067af0},
pat=0x8047144, sa=0x8047078, prc=0x80470d8, bServer=B_TRUE)
at spedaemon.c:827
827 return (b);
(gdb) p b
$19 = B_TRUE
(gdb) n
spe_eval_attribute (si=0x80674e8, pat=0x8047144, prc=0x80470d8, bServer=B_TRUE)
at spedaemon.c:864
864 if (*prc != 0) {
(gdb) p *prc
$21 = -16785880
Why was *prc such an odd number, no error cases were hit.
int
spe_global_eval(policy_attributes_t *pat, uint64_t **gooies,
uint32_t *n_gooies, uint32_t *unit_size, uint32_t *stripes)
{
spe_policy_t *sp;
boolean_t b = FALSE;
int rc;
If we set rc = 0, it works.
Apr 3 22:54:24 pnfs-3-15 ./sped[100938]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogeya (N/A) Apr 3 22:54:24 pnfs-3-15 ./sped[100938]: Considering rule 0 Apr 3 22:54:24 pnfs-3-15 ./sped[100938]: Considering 200096 vs 200096 Apr 3 22:54:24 pnfs-3-15 ./sped[100938]: 7: 0 0 0 0 8 6 F0 38
An 8 byte reply looks wrong. And we have another braindead problem. (Please realize this is the first time I am testing code I wrote over a month ago. I had to task switch to a P1 bug.):
result = spe_global_eval(&pat, &sar.gpoolids,
&sar.n_gooies, &sar.lh_unit_size, &sar.lh_stripe_count);
if (!result) {
sped_door_return(result, &sar, proc);
}
0 is a valid match, everything else is an error. Either I planned to make it a boolean (in which case spe_global_eval is wrong) or I messed up the check here. I'll assume the check:
Hey, the policy match info is correct, but why no payload?
sped_door_return(result, NULL, NULL); /* NOTREACHED */The two sped_door_return instances are mixed up. Flip them. Okay, back to a result and a 4 byte payoff.
And proc has a bad initialization:<
switch (command) { case (SPED_DS_ZPOOL): proc = xdr_spe_alloc_caller_t;Should be: xdr_spe_alloc_res_t
I think we finally have good results:
Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: Matching against: 200096 10 0 0 33554432 0 (N/A) (N/A) (N/A) (N/A) (N/A) (N/A) /nippy/bogeyd (N/A) Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: We got a policy 64000, 16, 35865356092523570 for /nippy/bogeyd Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: 7: 0 0 0 0 0 0 0 1 Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: 15: 0 7F 6B 59 F1 A3 44 32 Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: 23: 0 0 0 0 0 0 0 0 Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: 31: 0 0 0 0 0 0 FA 0 Apr 3 23:14:02 pnfs-3-15 ./sped[100957]: 39: 0 0 0 10 0 0 0 0 Apr 3 23:14:02 pnfs-3-15 nfssrv: WARNING: sped: 64000 16 0 35865356092523570There are some take home points here:
- Try to unit test/debug your code when you write it.
- You'll remember shortcuts you took.
- You'll remember the code.
- Write debugging tools if necessary.
- If I had to step through a debugger every time, I would have gone crazy.
- The XDR dump routine allowed me to focus on problems.
- Remember, sometimes the debugging tools introduce other bugs (see the stuff with u above.
- You didn't see it, but I created small focused test cases when necessary.
- Small one-off files allow you to eliminate large chunks of code.
- Allowed me to use gdb, in which case I fought the code and not the debugger.
I've been working on a kernel to user space door call which encodes the data in an XDR stream. Normally, I view XDR output as it goes across the wire with snoop. As this stuff is not exposed, I ended up writing a little piece of code to dump the bits on the console. I know I had some problems when I wrote it, but I'm just going to give you the current code:
void
sped_xdr_dump(XDR *x)
{
int i;
int j;
char buf[100];
char str[10];
char cbuf[4];
j = 0;
memset(&cbuf, '\0', 4);
buf[0] = '\0';
for (i = 0; i < x->x_handy; i++) {
cbuf[0] = x->x_private[i];
sprintf(str, " %X", cbuf[0]);
strcat(buf, str);
j++;
if (j == 8) {
syslog(LOG_ERR, "%s", buf);
j = 0;
buf[0] = '\0';
}
}
if (j != 0) {
syslog(LOG_ERR, "%s", buf);
}
}
The crap with cbuf is where I consider this code to be fat. It is a lazy hack because I felt the bytes were running together. It is lazy because I don't need it. And it is where I should be looking for bugs. Anyway, with the following input:
> fec3fce8::print XDR
{
x_op = 1 (XDR_DECODE)
x_ops = 0xfee45bd8
x_public = 0
x_private = 0xfec3fdd0
x_base = 0xfec3fdd0
x_handy = 0x30
}
> 0xfec3fdd0/48B
0xfec3fdd0: 0 0 0 2 0 0 0 0
0 0 0 2 0 3 d a0
0 0 0 a 2 0 0 0
0 0 0 a 2f 6e 69 70
70 79 2f 68 32 62 0 0
0 0 0 0 0 0 0 0
As shown by mdb, which I don't like, I get the following output:
Apr 3 12:04:06 pnfs-3-15 last message repeated 1 time Apr 3 12:04:22 pnfs-3-15 sped[100920]: 0 0 0 2 0 0 0 0 Apr 3 12:04:22 pnfs-3-15 sped[100920]: 0 0 0 2 0 3 D FFFFFFA0 Apr 3 12:04:22 pnfs-3-15 sped[100920]: 0 0 0 A 2 0 0 0 Apr 3 12:04:22 pnfs-3-15 sped[100920]: 0 0 0 A 2F 6E 69 70 Apr 3 12:04:22 pnfs-3-15 sped[100920]: 70 79 2F 68 32 62 0 0 Apr 3 12:04:22 pnfs-3-15 sped[100920]: 0 0 0 0 0 0 0 0
Where is FFFFFFA0 coming from? Okay, I took the code and stuck it in a driver:
> cat y.c
#include <sys/systm.h>
#include <sys/sdt.h>
#include <rpc/types.h>
#include <rpc/svc.h>
#include <rpc/xdr.h>
#include <rpc/auth_sys.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
void
sped_xdr_dump(char *jack, int len)
{
int i;
int j;
char buf[100];
char str[10];
char cbuf[4];
j = 0;
memset(&cbuf, '\0', 4);
buf[0] = '\0';
for (i = 0; i < len; i++) {
cbuf[0] = jack[i];
sprintf(str, " %X", cbuf[0]);
strcat(buf, str);
j++;
if (j == 8) {
printf("%s\n", buf);
j = 0;
buf[0] = '\0';
}
}
if (j != 0) {
printf("%s\n", buf);
}
}
int
main(int argc, char *argv[])
{
char jack[] = {
0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x2, 0x0, 0x3, 0xd, 0xa0,
0x0, 0x0, 0x0, 0xa, 0x2, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0xa, 0x2f, 0x6e, 0x69, 0x70,
0x70, 0x79, 0x2f, 0x68, 0x32, 0x62, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
};
sped_xdr_dump(jack, sizeof(jack));
return (0);
}
> ./a.out
0 0 0 2 0 0 0 0
0 0 0 2 0 3 D FFFFFFA0
0 0 0 A 2 0 0 0
0 0 0 A 2F 6E 69 70
70 79 2F 68 32 62 0 0
0 0 0 0 0 0 0 0
So, I've captured the problem. Using gdb, I can get to the heart of the issue:
Breakpoint 3, sped_xdr_dump (jack=0x8047134 "", len=48) at x.c:32 32 cbuf[0] = jack[i]; 5: str = " D\000\000\000\000\000\000X\003" 4: j = 7 3: i = 15 2: jack[i] = -96 '?' 1: cbuf = "\r\000\000" (gdb) p jack[i-1] $1 = 13 '\r' (gdb) q
As tempting as it is to say that I'm grabbing a bit from the previous word, it is clear that the fat code is really working. And while I could do shenanigans with bit shifting, the problem isn't in the way I am explicitly pulling out the data. Nor is the " %X" an issue.
The real issue is that I've got the wrong type of data, I'm trying to print out signed data, and 'A0' has a high order bit set.
> diff y.c y1.c 21c21 < char cbuf[4]; --- > unsigned char cbuf[4]; > ./a.out 0 0 0 2 0 0 0 0 0 0 0 2 0 3 D A0 0 0 0 A 2 0 0 0 0 0 0 A 2F 6E 69 70 70 79 2F 68 32 62 0 0 0 0 0 0 0 0 0 0
Okay, we can do better than that. We need to get rid of cbuf:
> diff y.c y2.c 21d20 < char cbuf[4]; 24d22 < memset(&cbuf, '\0', 4); 28,29c26 < cbuf[0] = jack[i]; < sprintf(str, " %X", cbuf[0]); --- > sprintf(str, " %X", (unsigned char)jack[i]); > ./a.out 0 0 0 2 0 0 0 0 0 0 0 2 0 3 D A0 0 0 0 A 2 0 0 0 0 0 0 A 2F 6E 69 70 70 79 2F 68 32 62 0 0 0 0 0 0 0 0 0 0
And we need to clean-up the output:
> diff y2.c y3.c 26c26 < sprintf(str, " %X", (unsigned char)jack[i]); --- > sprintf(str, " %2X", (unsigned char)jack[i]); > ./a.out 0 0 0 2 0 0 0 0 0 0 0 2 0 3 D A0 0 0 0 A 2 0 0 0 0 0 0 A 2F 6E 69 70 70 79 2F 68 32 62 0 0 0 0 0 0 0 0 0 0
I might be able to make it even more either smarter or compact, but it is now throw away code. But even thow away code has to be functionally correct. When it isn't, look for areas where you are doing more work than you need to. If you remember you tried something and weren't convinced it would work, and it only worked partially, go back and question that code.
Okay, my snv 73 box is now a snv 85 box. Everything is working except for my punchin, but only because I need to bypass my Sun Ray 1G (and the Sun Ray Server 4.0 was dead easy to install). But the vnic_setup.sh script is still not working:
# ./vnic_setup.sh 0:1:4a:f2:31:34 Invalid link name: LINK # LD_LIBRARY_PATH=/opt/VirtualBox:/opt/VirtualBox/qtgcc/lib:. ; export LD_LIBRARY_PATH # ./vnic_setup.sh 0:1:4a:f2:31:34 Invalid link name: LINK # ./vnic_setup.sh 0:1:4a:f2:31:34 vnic1 vnic1 # ifconfig -a lo0: flags=2001000849mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 bge0: flags=201000843 mtu 1500 index 2 inet 192.168.2.130 netmask ffffff00 broadcast 192.168.2.255 ether 0:a:e4:34:2f:da lo0: flags=2002000849 mtu 8252 index 1 inet6 ::1/128
Well the VirtualBox manual tells me how to do it manually:
# /usr/lib/vna bge0 0:1:4a:f2:31:34 vnic0 # ifconfig vnic0 plumb # ifconfig -a lo0: flags=2001000849mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 bge0: flags=201000843 mtu 1500 index 2 inet 192.168.2.130 netmask ffffff00 broadcast 192.168.2.255 ether 0:a:e4:34:2f:da vnic0: flags=201000842 mtu 1500 index 3 inet 0.0.0.0 netmask 0 ether 0:1:4a:f2:31:34 lo0: flags=2002000849 mtu 8252 index 1 inet6 ::1/128 # /usr/lib/vna bge0 0:1:4a:f2:31:36 vnic1 # /usr/lib/vna bge0 0:1:4a:f2:31:38 vnic2 # ifconfig vnic1 plumb # ifconfig vnic2 plumb # ifconfig vnic0 192.168.2.150 destination 192.168.2.160 netmask 255.255.255.0 up # ifconfig vnic1 192.168.2.151 destination 192.168.2.161 netmask 255.255.255.0 up # ifconfig vnic2 192.168.2.152 destination 192.168.2.162 netmask 255.255.255.0 up # ifconfig -a lo0: flags=2001000849 mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 bge0: flags=201000843 mtu 1500 index 2 inet 192.168.2.130 netmask ffffff00 broadcast 192.168.2.255 ether 0:a:e4:34:2f:da vnic0: flags=201000851 mtu 1500 index 3 inet 192.168.2.150 --> 192.168.2.160 netmask ffffff00 ether 0:1:4a:f2:31:34 vnic1: flags=201000851 mtu 1500 index 4 inet 192.168.2.151 --> 192.168.2.161 netmask ffffff00 ether 0:1:4a:f2:31:36 vnic2: flags=201000851 mtu 1500 index 5 inet 192.168.2.152 --> 192.168.2.162 netmask ffffff00 ether 0:1:4a:f2:31:38 lo0: flags=2002000849 mtu 8252 index 1 inet6 ::1/128
And when the system comes up, it has an IP of 192.168.2.29. And I can't ping any of the 3.
It looks like I need to learn the CLI for VirtualBox. Here is a related article: Internal network does not work for OpenSolaris guests
As mentioned, I have a minimal install for my OpenSolaris test machines. Evidently I am missing something quite important for getting a host network up and running:
# ./vnic_setup.sh 0:1:4a:f2:31:34 ./vnic_setup.sh: line 42: /usr/lib/vna: No such file or directory
Hmm, I can either load the iso and get the real files or learn how to work with the NAT mode.So either I do a complete reinstall or I figure how to get just a couple of files over. Sounds like I should learn about NAT anyway.
With two machines and NAT, they are both getting the same address. I could use the Internal Network option, but I'm still going to have to reinstall. Hmm, when I do a machine, I select Core Configure and add the following:
BIND DNS Name server and tools BIND Name server Manifest Filebench Freeware Compression Tools Perl-Compatible Regular Expressions Freeware shells Freeware Other Utilities GLIB XCU4 Utilities GNU Autotools Secure Shell GNU which gcmn - Common GNU package gdb ggrep gtar Lint Libraries (root) Lint Libraries (usr) GNU binutils GNU diffutils Perl 5.6.1 (core) Perl 5.6.1 (non-core) GNU textinfo Libevent Get all of System and Network Admin Live Upgrade Software MDB (root) Programming Tools resource pool (root) Resource Pools in core software for resource pools Solaris Zones Vi IMproved autoconf bcc coreutils rsync ROCSEC_GSS Kerberos V5 KDC (root) Kerberos V5 Master KDC (root) Kerberos Version 5 support (kernel) NIS Server for Solaris (root) NIS Server for Solaris (usr) Interprocess Communication
Time to find what I need to add. First we need to look in the ISO image:
[tdh@warlock ~]> sudo lofiadm -a /zoo/isos/x86/snv85/solarisdvd.iso /dev/lofi/1 [tdh@warlock ~]> sudo mount -F hsfs /dev/lofi/1 /mnt [tdh@warlock Product]> cd /mnt/Solaris_11/Product/ [tdh@warlock Product]> grep lib/vna */pkgmap SUNWcsu/pkgmap:1 f none usr/lib/vna 0555 root bin 12592 13767 1204942578
Hmm, wait, I need to find that on my host system and not the guest machine. D'Oh!
[tdh@warlock lib]> uname -a SunOS warlock 5.11 snv_73 i86pc i386 i86pc [tdh@warlock lib]> sudo lofiadm -d /zoo/isos/x86/snv7 snv79/
I have the DVD, but 73 is ancient! But I'll check:
[tdh@warlock Product]> grep lib/vna SUNWcsu/pkgmap [tdh@warlock Product]>
Ugh, the biggest hassle is that I use this machine as a Sun Ray Server. Okay, time for a reinstall!
RTFM - at least the one which comes with the distribution. I think the online one does not have OpenSolaris support. And, I am right it does not.
Anyway, you do not need to retrieve the qt libraries and build, as I just did, it is supplied for you.
cd /opt/VirtualBox LD_LIBRARY_PATH=/opt/VirtualBox:/opt/VirtualBox/qtgcc/lib:. ./VirtualBox
Downloading and installing VirtualBox on my OpenSolaris box was a snap. Running and installing on it was a bit harder. First off, you need qt installed. Easy instructions are at VirtualBox on OpenSolaris. Then you need to tell it that something other than Right-Ctrl is your host key - you need to do this because Sun keyboards do not have that key . I chose my Right-Meta key. Imagine VirtualBox has your mouse and keyboard, your screensaver kicks in, and you can't get the system to understand you want to bring it to life.
Telling the tool how to load my OpenSolaris iso image was a bit counterintuitive. But the biggest problem I ended up having was only giving 8M to my graphics card. The Developer's install seemed okay with that, but the normal graphical install would puke on it. I was able to do a text install, until the point the install decided I was doing a NFS install and did not like the path I was giving. It also would not see the attached iso as a cdrom.
Kicking both tyres together - VirtualBox and Indiana was very helpful for this issue, Alan casually mentions you need 32M of Video Ram. I gave the system that and could then do the graphical install. And it correctly identified the iso as a cdrom. I'll bring that value down once I finish the install. I'm not going to install a GUI on this machine.
I got a pNFS community and client to run under VMWare on a XP box. Okay, so I made sure to make independent clones of the same machine as before. And this time I went from 512M to 1G of available RAM. The other thing I changed is that when building the cthon tests, I changed the config a bit.
I went from 512M to 1G because the archive updates were taking forever on the clones with 512M but went fast on the one with about 2G:
updating /platform/i86pc/boot_archive updating /platform/i86pc/amd64/boot_archive
I'm talking 45 minutes or more. Once I pushed the memory up, these updated much faster.
I used the following configuration for using gcc on a 64bit OpenSolaris:
[tdh@m-client cthon04]> diff tests.init ~/cthon04/tests.init 57c57 < PATH=/opt/SUNWspro/bin:/usr/ccs/bin:/sbin:/bin:/usr/bin:/usr/ucb:/etc:. --- > #PATH=/opt/SUNWspro/bin:/usr/ccs/bin:/sbin:/bin:/usr/bin:/usr/ucb:/etc:. 61c61 < #PATH=/opt/gnu/bin:/usr/ccs/bin:/sbin:/bin:/usr/bin:/usr/ucb:/etc:. --- > PATH=/usr/sfw/bin:/usr/ccs/bin:/sbin:/bin:/usr/bin:/usr/ucb:/etc:. 133c133 < CC=/opt/SUNWspro/bin/cc --- > #CC=/opt/SUNWspro/bin/cc 135c135 < #CC=/opt/gnu/bin/gcc --- > CC=/usr/sfw/bin/gcc 138c138 < CFLAGS=`echo -DSVR4 -DMMAP -DSOLARIS2X -DSTDARG` --- > #CFLAGS=`echo -DSVR4 -DMMAP -DSOLARIS2X -DSTDARG` 145c145 < #CFLAGS=`echo -DSVR4 -DMMAP -DSOLARIS2X -DSTDARG -m64` --- > CFLAGS=`echo -DSVR4 -DMMAP -DSOLARIS2X -DSTDARG -m64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE` 150c150 < LOCKTESTS=`echo tlocklfs tlock64` --- > #LOCKTESTS=`echo tlocklfs tlock64` 152c152 < #LOCKTESTS=`echo tlocklfs` --- > LOCKTESTS=`echo tlocklfs`
In the previous run, I hadn't set -D_LARGEFILE64_SOURCE and I didn't fix LOCKTESTS correctly. While the -D_LARGEFILE64_SOURCE might have been what was killing me, I don't think so.
The DS hung during the write/read of the 30 MB file. It was unresponsive on the console. I heard the disk chugging, I killed off Thunderbird and Firefox. And it did come back. My guess is that the 512M on the earlier systems was insufficient. I've had problems in the past with virtual machines that were trying large IO. (A real machine wrote a 50G file to a NFS simulator and they complained about the speed. They shut up when I had them go against the real box.)
So the experiment works. I'm working on VirtualBox on the side still.
I'm trying to get a NFSv4.1 (aka pNFS) DS, MDS, and client all running as VMWare machines on my XP box. I took a base nevada 85 system (with a Core Custom load - which only eats up 1.4G of disk space) and loaded the on-pnfs-draft19-onnv85-bfu-20080324.i386.tar.bz2 bfu bits (check out our pNFS Download page in OpenSolaris: http://www.opensolaris.org/os/project/nfsv41/downloads/).
I then cloned the resulting system into m-client, m-ds, and m-mds. I was able to configure everything up okay, but system is locking up during the NFS Cthon tests:
write/read 30 MB file
After some investigation, I don't think this is a pNFS issue. The m-ds machine is hanging, consistently. It will hang even if I don't run the test. It isn't dropping into kmdb and is totally unresponsive on the console.
Either I didn't clone the original machine correctly or I'm running out of resources. I've had at least 3 machines running concurrently in the past, so I doubt it is resources. Also, the machines each are limited to 512M of memory.
I may play with this a bit more or try VirtualBox, which can be hosted under Solaris and OpenSolaris. I can run it on my w2100z. It now has a whopping 16G of RAM and should be able to handle plenty of virtual machines.
If you fail to share a zfs pool on the mds, then nfs will not be enabled and therefore the mds auth records will not load.
# mdsadm -o add -t auth -a ip=10.1.233.117 adding: IP Addr - 10.1.233.117 Mar 20 20:37:41 pnfs-3-15 nfs: NFS Server not loaded # zpool create -f nippy /dev/dsk/c1t0d0s7 # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k mdb: failed to perform walk: unknown walk name
Above we see mdsadm clearly telling us we have an issue. And the fact that we can't use our simple mdb command to see the auth records is a worry.Just as I rebooted, I realized what the problem was. And we can check it out:
# mdsadm -o add -t auth -a ip=10.1.233.117 adding: IP Addr - 10.1.233.117 Mar 20 20:42:54 pnfs-3-15 nfs: NFS Server not loaded # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k mdb: failed to perform walk: unknown walk name # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k mdb: failed to perform walk: unknown walk name # zfs set sharenfs=rw,anon=0 nippy # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
We can see that once the share is enabled, we can walk the structures. We can redo the mdsadm command:
# mdsadm -o add -t auth -a ip=10.1.233.117 adding: IP Addr - 10.1.233.117 # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
What is wrong now? We need to enable the ds:
# dservadm enable
And:
# echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
{
dbe = 0xffffff02f4156f08
dev_addr = {
na_r_netid = 0xffffff02f61bfc80 "tcp"
na_r_addr = 0xffffff02efde8a88 "10.1.233.117.147.49"
}
dev_flags = 0x3
dev_infop = 0xffffff02f4157f78
dev_list_next = {
list_next = 0xffffff02f4157fb8
list_prev = 0xffffff02f4157fb8
}
}
In no particular order...
Here is a bad result
mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k mds #
And here is a good result:
# echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
{
dbe = 0xffffff02fbb21f08
dev_addr = {
na_r_netid = 0xffffff02f6d17b20 "tcp"
na_r_addr = 0xffffff02ec14c000 "10.1.233.119.217.62"
}
dev_flags = 0x3
dev_infop = 0xffffff02fbb22f78
dev_list_next = {
list_next = 0xffffff02fbb22fb8
list_prev = 0xffffff02fbb22fb8
}
}
Unfortunately, many problems have this symptom.
client # mount pnfs-3-14:/ /mnt Mar 19 11:29:21 pnfs-3-12 nfs: NOTICE: enabling pNFS on pnfs-3-14 client # cd /mnt client # cd client # snoop pnfs-3-14 pnfs-3-13 Using device bge0 (promiscuous mode) ^Cclient # cd
I typically open another window and do simple copies from /etc:
[pnfs-3-12 zippy]> cp /etc/motd . [pnfs-3-12 zippy]> cp /etc/motd kkk
It is only when I see valid traffic that I know I am set.
# snoop pnfs-3-12 pnfs-3-13 Using device bge0 (promiscuous mode) Mar 19 11:39:07 pnfs-3-12 nfs: NOTICE: enabling pNFS on pnfs-3-14 pnfs-3-12 -> pnfs-3-13 TCP D=55614 S=1014 Ack=93767047 Seq=74796989 Len=0 Win=49640 pnfs-3-12 -> pnfs-3-13 NFS C 4 (exchange_id ) EXCHANGE_ID Verf=11E050B247E141B2 COID=F6C0 PNFS_DS NONE
ds # zpool create -f dpool /dev/dsk/c1t0d0s7 ds # dservadm addmds 10.1.233.120.8.1 ds # dservadm addpool dpool ds # dservadm enable
If you don't, then you won't see traffic going from the client to the DS.
client # snoop pnfs-3-12 pnfs-3-13 Using device bge0 (promiscuous mode) Mar 19 11:19:26 pnfs-3-12 last message repeated 2 times
So go back and do:
ds # zfs set sharenfs=on dpool
Note you will also not see a record for the auth in the mds:
mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
ds # dservadm disable
And
# mdsadm -o add -t auth -a ip=10.1.233.119 adding: IP Addr - 10.1.233.119
ds # dservadm enable
mds # mdsadm -o add -t auth -a ip=10.1.233.119 adding: IP Addr - 10.1.233.119 mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
Whoops, forgot to disable the ds!
ds # dservadm disable
Still nothing!
mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
Okay, issues again:
mds # mdsadm -o add -t auth -a ip=10.1.233.119 adding: IP Addr - 10.1.233.119 nfssys:: File exists mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
And finally:
ds # dservadm enable
Which yields:
# echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
{
dbe = 0xffffff02f4493f08
dev_addr = {
na_r_netid = 0xffffff02f76f28e0 "tcp"
na_r_addr = 0xffffff02f51ec820 "10.1.233.119.254.189"
}
dev_flags = 0x3
dev_infop = 0xffffff02f4494f78
dev_list_next = {
list_next = 0xffffff02f4494fb8
list_prev = 0xffffff02f4494fb8
}
}
Hmm, I wasn't able to reproduce the double record thing... Oh yes I was!
ds # dservadm disable ds # dservadm enable
And
mds # echo '::walk Device_entry_cache | ::print struct rfs4_dbe data | ::print mds_device_t' | mdb -k
{
dbe = 0xffffff02f4493e18
dev_addr = {
na_r_netid = 0xffffff02f76f2fc0 "tcp"
na_r_addr = 0xffffff02f51ecb20 "10.1.233.119.212.217"
}
dev_flags = 0x3
dev_infop = 0xffffff02f4494f78
dev_list_next = {
list_next = 0xffffff02f4494fb8
list_prev = 0xffffff02f4494fb8
}
}
{
dbe = 0xffffff02f4493f08
dev_addr = {
na_r_netid = 0xffffff02f76f28e0 "tcp"
na_r_addr = 0xffffff02f51ec820 "10.1.233.119.254.189"
}
dev_flags = 0x3
dev_infop = 0xffffff02f4494f78
dev_list_next = {
list_next = 0
list_prev = 0
}
}
The point is with multiple auth records you may not get the results you want. I'd reboot them both at this point. :->
client # ./server -p /nippy -m /mnt/pnfs-3-14 pnfs-3-14 Start tests on path /mnt/pnfs-3-14/pnfs-3-12.test [y/n]? y sh ./runtests -a -t /mnt/pnfs-3-14/pnfs-3-12.test Starting BASIC tests: test directory /mnt/pnfs-3-14/pnfs-3-12.test (arg: -t) mkdir: Failed to make directory "/mnt/pnfs-3-14/pnfs-3-12.test"; Permission denied Can't make directory /mnt/pnfs-3-14/pnfs-3-12.test basic tests failed Tests failed, leaving /mnt/pnfs-3-14 mounted
So:
mds #share -@nippy /nippy rw "" mds # zfs set sharenfs=rw,anon=0 nippy
I've got a bug for which I do not have a reproducible test case. But I'm pretty confident I found what is going wrong. I can run regression tests to show I haven't broken anything - but those same regression tests never tripped the bug in the first place.
The fix is:
------- usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c -------
Index: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c
23c23
< * Copyright 2007 Sun Microsystems, Inc. All rights reserved.
---
> * Copyright 2008 Sun Microsystems, Inc. All rights reserved.
27c27
< #pragma ident "@(#)nfs4_stub_vnops.c 1.3 07/10/25 SMI"
---
> #pragma ident "%Z%%M% %I% %E% SMI"
1751a1757,1769
> * Someone is already working on it. We
> * need to back off and let them proceed.
> *
> * We return EBUSY so that the caller knows
> * something is going on. Note that by that
> * time, the umount in the other thread
> * may have already occured.
> */
> if (was_locked) {
> return (EBUSY);
> }
>
> /*
1762,1763c1780
< if (was_locked == FALSE &&
< !mutex_tryenter(&net->net_tree_lock)) {
---
> if (!mutex_tryenter(&net->net_tree_lock)) {
1814c1831
< } else if (was_locked == FALSE) {
---
> } else {
In English, the lock detection used to only handle when the lock was not being held.
What I want to do is force was_locked to be true at this point in both the original code (to verify I can trigger the panic at will) and also in my fix (to verify I have fixed the correct bug).
I can do that by adding the following code:
# wx diffs ------- usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c ------- 122a123,124 > int nfsv4_mm_was_locked = FALSE; > 1750a1753,1755 > if (nfsv4_mm_was_locked) > was_locked = TRUE;
This is a global in the nfs module which by default does not force was_locked to be set. I can use mdb to change it on the fly:
# mdb -kw Loading modules: [ unix genunix specfs dtrace cpu.generic cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci ufs mpt ip hook neti sctp arp usba fctl nca lofs cpc random zfs nfs fcip logindmux ptm sppp ] > nfsv4_mm_was_locked::print 0 > nfsv4_mm_was_locked/W 1 nfsv4_mm_was_locked: 0 = 0x1 > nfsv4_mm_was_locked::print 0x1 > $q
Note that I do not want to add special code to check the environment, add something in /etc/default/nfs, or anything else which requires changing anything on a system. I leave it entirely in the kernel and I use mdb to control it.
So I have a new daemon and I have it loaded:
# ./sped 10, 16, 64000, 1647890964l uid == 501 20, 32, 2000, 1647890964l uid == 1066 30, 64, 1000, 1647890964l uid == 0 40, 2, 2000, 1647890964l subnet == 192.168.2.0/24
And I have a door upcall which starts in do_rfs4_op_mknod(). A door is an IPC mechanism used to communicate from the kernel to user land. Also, since the sped hands out policies for open creation, this should be a good point to call it from. Notice the foreshadowing invoked by the word "should" and remember that foreshadowing is the sign of good journalism (I think this is a quote from Bloom County?).
Okay, it didn't work - the sped is not being called. I can whip out kmdb and struggle through trying to see if the door upcall is invoked. But first, we should do a sanity check to see if do_rfs4_op_mknod() is even being called:
# dtrace -m nfssrv > /tmp/dtrace.log dtrace: description 'nfssrv' matched 2262 probes ^C# grep do /tmp/dtrace.log 0 51547 mds_do_lookup:entry 0 51548 mds_do_lookup:return 0 52741 do_rfs4_op_getattr:entry 0 52742 do_rfs4_op_getattr:return 0 52379 mds_findopenowner:entry 0 52380 mds_findopenowner:return 0 51593 mds_do_opennull:entry 0 52127 do_rfs4_set_attrs:entry 0 52128 do_rfs4_set_attrs:return 0 51777 vop_fid_pseudo:entry 0 51778 vop_fid_pseudo:return 0 52641 do_41_deleg_hack:entry 0 52642 do_41_deleg_hack:return 0 51591 mds_do_open:entry 0 51592 mds_do_open:return 0 51594 mds_do_opennull:return 0 52741 do_rfs4_op_getattr:entry 0 52742 do_rfs4_op_getattr:return 0 52741 do_rfs4_op_getattr:entry 0 52742 do_rfs4_op_getattr:return 1 52127 do_rfs4_set_attrs:entry 1 52128 do_rfs4_set_attrs:return 1 51547 mds_do_lookup:entry 1 51548 mds_do_lookup:return 1 52741 do_rfs4_op_getattr:entry 1 52742 do_rfs4_op_getattr:return
Remember, the rfs component of the function name means that is in the nfssrv module and not the nfs module. Also, note that you won't see "mds_" names in code drops from Nevada. You will need to download the pNFS tree over at: OpenSolaris Project: NFS version 4.1 pNFS
Okay, it was never called. So either dtrace is hosed or something is going wrong. What is being called? mds_createfile()
0 51593 mds_do_opennull:entry 0 51987 mds_createfile:entry 0 52647 nfsauth4_access:entry 0 52377 nfsauth_access:entry 0 52378 nfsauth_access:return 0 52648 nfsauth4_access:return 0 52399 nfs4_ntov_table_init:entry 0 52400 nfs4_ntov_table_init:return
I'm off to build a new nfssrv to see if making my upcall from mds_createfile() works.
With the OpenSolaris pnfs binary drops, things are still pretty raw. (Hey, we could hold onto the code until it is fully baked!) A common problem is telling which machines plays which role - i.e., I've got a pNFS community set up, I've done some testing, and I've rebooted the machines. How do I remember which machine is which?
The dservadm command will help you. If it shows nothing, you have either a client or the MDS:
# dservadm listpools zpools:
Okay, we find the DS:
# dservadm listpools
zpools:
train
# uname -a
SunOS silent 5.11 pnfs-curr i86pc i386 i86pc
Now, how do I tell which of the remaining two are the MDS and client? Well, dservadm can tell me a bit more:
# dservadm listmds
mds:
192.168.2.119.8.1
# host 192.168.2.119
119.2.168.192.in-addr.arpa domain name pointer stealth
Don't worry about the extra numbers at the end of the mds identifier - just realize the first four are octets. Okay, we can now start back up with our testing.
I've given up trying to explain to people what it is I do for a living. But I think I'm going to keep on trying to explain what a BakeAThon is to them.
How can I do one without the other? Well, I can abstract the process.
So a BakeAThon (and a ConnectAThon) is an interoperability testing event. And the best way to describe it is that you have 10 different people with a set of rules for a game (if you say AD&D edition 3 rules, you have another set of problems to deal with). Each of them has read the rules and believes that they know all of the intricacies. But none of them have played the game with anyone else.
So they all get together and start to play each other. And they start to argue about each and every move. Sometimes it is pretty obvious whose interpretation is wrong. And sometimes they call someone else over to help decide.
As soon as player A is done with player B, they start with player C. Except sometimes they are also playing with player D at the same time. Or player B comes back to see if they have gotten rule 5.3 correct now.
Sometimes they all vote on how to interpret a rule and even change the rule book. And sometimes player F was at the bathroom when that happened and causes the debate to start back up again.
Then they all go away again for 3 months, promising to play games against each other remotely. They meet back up again at the next BakeAThon - sometimes there is a new player or someone didn't show up. But they are willing to chime in over email.
But they have to start all over from scratch because no one played remotely and they've been busy playing with themselves.
A further complication is that some people only play defense and some only play offense. Sometimes you get a team where they split those duties. So when you talk to one person about how they run their offense, they shrug and say that they only do defense. And the problem that arises here is when the team's offense only plays against their defense - they get pretty good at it and understand some simple shortcuts that make it easy. But when they play another team, those same shortcuts cause problems.
So a BakeAThon is pretty much like that. The major difference is that the competitiveness isn't in winning a game but in getting the game adopted by other people. I.e., Foo Inc. and Bar Inc. may have differences and fight over customers, but while at a BakeAThon, they work together to make NFS a better protocol.
The best way to rewrite the partition table on a Mac Mini is with the OS X boot DVD. Insert it, reboot, and press C until it boots from the DVD. At that point, get to where you can start Disk Utility, then:
I also partitioned it as type Unix, but I think you can skip this as long as you get the partition table written. I couldn't figure how to eject the Tiger disk at this point. So, reboot!
At this point, your disk is hosed as far as booting. You need to get the DVD out and the best way to do that is to be pressing your mouse button while it boots. (I did the left mouse button on a Logitech mouse.) When the DVD ejects, insert your OpenSolaris DVD. I can't remember if it booted then or I had to reboot. If you need to reboot, remember to press C.
The other thing to note is that I selected the standard installation (and not the developer's installation). I partitioned the drive as follows:
0 / 20480 1 swap 4096 3 /a1 20480 4 /a2 20480 5 /zap 38632 6 ---- ----- 7 /export/home 10240
/a1 and /a2 allow me to do Live Upgrades if I need to. I can also scavenge /export/home if needed (i.e., I'd have a NFS homedir server). These are mainly test machines.
And when it boots and is networked:
partition> p Current partition table (original): Total disk cylinders available: 14590 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 root wm 7054 - 9664 20.00GB (2611/0/0) 41945715 1 swap wu 3 - 525 4.01GB (523/0/0) 8401995 2 backup wm 0 - 14589 111.77GB (14590/0/0) 234388350 3 unassigned wm 4443 - 7053 20.00GB (2611/0/0) 41945715 4 unassigned wm 1832 - 4442 20.00GB (2611/0/0) 41945715 5 unassigned wm 9665 - 14589 37.73GB (4925/0/0) 79120125 6 unassigned wm 0 0 (0/0/0) 0 7 home wm 526 - 1831 10.00GB (1306/0/0) 20980890 8 boot wu 0 - 0 7.84MB (1/0/0) 16065 9 alternates wu 1 - 2 15.69MB (2/0/0) 32130