Tuesday October 14, 2008 When taking a Brompton on Eurostar make sure you have a cover for it or they won't let you on the train. Quite why I don't know since I do have a cover and believe it or not could not be bothered to get into an argument with them.
Once covered they have no problem with them. The next gripe would be google maps. It does not offer routes for cyclists and at least in France does not show hills. Specifically the hill between the station and the Sun Office in Velizy.
Friday October 10, 2008 Go on. 7 ½ hours before you need to use SunSolve login and then just leave that tab alone until you need it. Why? Because you can!
As promised the horribly short idle time out has been increased from 30 minutes to 8 hours and the session time from 2 hours to 24.
Also I as have just been reminded it effects blogs.sun.com too, sweet.
While scsi.d is good for looking at scsi packets and seeing those raw CDBs not many people are really interested in what a SCSI packet looks like, well not enough people if you ask me. However what is much more interesting is how long the scsi packets are taking. Now scsi.d tells you this for each packet but aggregating the data would be are more useful.
: e2big.eu TS 81 $; pfexec /usr/sbin/dtrace -Cs scsi.d -D QUIET -D PERF_REPORT -D REPORT_TARGET \
-D REPORT_LUN -n tick-1m {printa(@); clear(@); exit(0) }
Hit Control C to interrupt
qus 1
value ------------- Distribution ------------- count
131072 | 0
262144 |@@@@ 25
524288 |@@@@@@@@@@@@ 68
1048576 |@@@@@@ 34
2097152 | 2
4194304 |@@@ 19
8388608 |@@@@@ 29
16777216 |@@@@ 22
33554432 |@@@@@@ 35
67108864 | 1
134217728 | 0
fp 2
value ------------- Distribution ------------- count
262144 | 0
524288 | 3
1048576 | 1
2097152 |@@ 15
4194304 |@@@@@@@@ 67
8388608 |@@@@@@@@@@ 81
16777216 |@@@@@@@@ 65
33554432 |@@@@@@@@ 66
67108864 |@@@ 27
134217728 | 0
fp 0
value ------------- Distribution ------------- count
65536 | 0
131072 | 27
262144 |@ 485
524288 |@@@@@@ 2901
1048576 |@@@@@ 2203
2097152 |@@@@@@@ 3204
4194304 |@@@@@@@@@ 4087
8388608 |@@@@@@@@ 3978
16777216 |@@@ 1606
33554432 |@ 570
67108864 | 123
134217728 | 45
268435456 | 0
fp 3
value ------------- Distribution ------------- count
65536 | 0
131072 | 41
262144 |@ 493
524288 |@@@@@@ 2926
1048576 |@@@@ 2157
2097152 |@@@@@@ 3228
4194304 |@@@@@@@@@ 4461
8388608 |@@@@@@@@@ 4561
16777216 |@@@ 1634
33554432 |@ 510
67108864 | 116
134217728 | 52
268435456 | 2
536870912 | 0
scsi_vhci 0
value ------------- Distribution ------------- count
131072 | 0
262144 |@ 588
524288 |@@@@@ 4807
1048576 |@@@@@@ 5423
2097152 |@@@@@@@ 6609
4194304 |@@@@@@@@@ 8627
8388608 |@@@@@@@@@ 8641
16777216 |@@@ 3289
33554432 |@ 1088
67108864 | 239
134217728 | 97
268435456 | 2
536870912 | 0
: e2big.eu TS 82 $;
All the new options are supplied via -D flags to dtrace and they are:
|
Option Name |
Description |
|
QUIET |
Be quiet. Don't report any packets seen. Useful when you only want a performance report. |
|
PERF_REPORT |
Produce a per HBA performance report when the script complete. The report is an aggregation held in @ so can be printed at regular intervals using a tick probe as in the above example but without the call to exit(). |
|
REPORT_TARGET |
If producing a peformance report include the target to produce per target report. |
|
REPORT_LUN |
If producing a per target report then include the LUN to produce a per lun report. |
|
DYNVARSIZE |
Pass this value to the #pragma D option dynvarsize= option. Eg: -D DYNVARSIZE=64m |
The latest version of the script, version 1.15 is here: http://blogs.sun.com/chrisg/resource/scsi_d/scsi.d-1.15
Monday October 06, 2008 I am forced to have a Windows system at home which thankfully only very occasionally gets used however even though everything that gets on it is virus scanned all email is scanned before it gets near it and none of the users are administrators I still like to keep it backed up.
Given I have a server on a network which has ZFS file systems with capacity I decided that I could do this just using the dd(1) command which I have written about before. Using that to copy the entire disk image to a ZFS file allows me to back the system up. However if I snapshot the back up file system and then back up again every block gets re written so takes up space on the server enven if they have not changed (roll on de dup). To stop this I have a tiny program that mmap()s the entire backup file and then only updates the blocks that have changed.
I call it syncer for no good reason:
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <stdio.h>
#include <sys/time.h>
/*
* Build by:
* cc -m64 -o syncer syncer.c
*/
/*
* Match this to the file system record size.
*/
#define BLOCK_SIZE (128 * 1024)
#define KILO 1024
#define MEG (KILO * KILO)
#define MSEC (1000LL)
#define NSEC (MSEC * MSEC)
#define USEC (NSEC * MSEC)
static long block_size;
char *
map_file(const char *file)
{
int fd;
char *addr;
struct stat buf;
if ((fd = open(file, O_RDWR)) == -1) {
return (NULL);
}
if (fstat(fd, &buf) == -1) {
close(fd);
return (NULL);
}
block_size = buf.st_blksize;
addr = mmap(0, buf.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
return (addr);
}
off64_t
read_whole(int fd, char *buf, int len)
{
int count;
int total = 0;
while (total != len &&
(count = read(0, &buf[total], len - total)) > 0) {
total+=count;
}
return (total);
}
static void
print_amount(char *str, off64_t value)
{
if (value < KILO) {
printf("%s %8lld ", str, value);
} else if (value < MEG) {
printf("%s %8lldK", str, value/(KILO));
} else {
printf("%s %8lldM", str, value/(MEG));
}
}
int
main(int argc, char **argv)
{
char *buf;
off64_t offset = 0;
off64_t update = 0;
off64_t count;
off64_t tcount = 0;
char *addr;
long bs;
hrtime_t starttime;
hrtime_t lasttime;
if (argc == 1) {
fprintf(stderr, "Usage: %s outfile\n", *argv);
exit(1);
}
if ((addr = map_file(argv[1])) == NULL) {
exit(1);
}
bs = block_size == 0 ? BLOCK_SIZE : block_size;
if ((buf = malloc(block_size == 0 ? BLOCK_SIZE : block_size)) == NULL) {
perror("malloc failed");
exit(1);
}
print_amount("Block size:", bs);
printf("\n");
fflush(stdout);
starttime = lasttime = gethrtime();
while ((count = read_whole(0, buf, bs)) > 0) {
hrtime_t thistime;
if (memcmp(buf, addr+offset, count) != 0) {
memcpy(addr+offset, buf, count);
update+=count;
}
madvise(addr+offset, count, MADV_DONTNEED);
offset+=count;
madvise(addr+offset, bs, MADV_WILLNEED);
thistime = gethrtime();
/*
* Only update the output after a second so that is readable.
*/
if (thistime - lasttime > USEC) {
print_amount("checked", offset);
printf(" %4d M/sec ", ((hrtime_t)tcount * USEC) /
(MEG * (thistime - lasttime)));
print_amount(" updated", update);
printf("\r");
fflush(stdout);
lasttime = thistime;
tcount = 0;
} else {
tcount += count;
}
}
printf(" \r");
print_amount("Read: ", offset);
printf(" %lld M/sec ", (offset * NSEC) /
(MEG * ((gethrtime() - starttime)/MSEC)));
print_amount("Updated:", update);
printf("\n");
/* If nothing is updated return false */
exit(update == 0 ? 1 : 0);
}
Then a simple shell function to do the back up and then snapshot the file system:
function backuppc
{
ssh -o Compression=no -c blowfish pc pfexec /usr/local/sbin/xp_backup | time ~/lang/c/syncer /tank/backup/pc/backup.dd && \
pfexec /usr/sbin/zfs snapshot tank/backup/pc@$(date +%F)
}
Running it I see that only 2.5G of data was actually written to disk, and yet thanks to ZFS I have a complete disk image and have not lost the previous disk images.
: pearson FSS 17 $; backuppc 665804+0 records in 665804+0 records out Read: 20481M 9 M/sec Updated: 2584M real 35m50.00s user 6m27.98s sys 2m43.76s : pearson FSS 18 $;
Friday October 03, 2008 Getting back on topic, here is a nice short bit of Dtrace.
Sometimes by the time I get to see an issue the “where on the object” question is well defined and in two recent cases that came down to “Why is system call X slow?” . The two system calls were not the same in each case but the bit of D to find the answer was almost identical in both cases.
Faced with a system call that is taking a long time you have to understand the three possible reasons this can happen:
It has to do a lot of processing to achieve it's results.
It blocks for a long time waiting for an asynchronous event to occur.
It blocks for a short time but many times waiting for asynchronous events to occur.
So it would be really nice to be able to see where a system call is spending all it's time. The starting point for such an investigation is that when in the system call there are only two important states. The thread is either running on a CPU or it is not. Typically when it is not it is because it is blocked for some reason. So using the Dtrace sched provider's on-cpu and off-cpu probes to see how much time the system call spends blocked and then print out stacks if it is blocked for more than a given amount of time.
Here it is running against a simple mv(1) command:
$ pfexec /usr/sbin/dtrace -s syscall-time.d -c "mv .d .x"
dtrace: script 'syscall-time.d' matched 17 probes
dtrace: pid 26118 has exited
CPU ID FUNCTION:NAME
3 79751 rename:entry rename(.d, .x)
3 21381 resume:on-cpu Off cpu for: 1980302
genunix`cv_timedwait_sig+0x1c6
rpcmod`clnt_cots_kcallit+0x55d
nfs`nfs4_rfscall+0x3a9
nfs`rfs4call+0xb7
nfs`nfs4rename_persistent_fh+0x1eb
nfs`nfs4rename+0x482
nfs`nfs4_rename+0x89
genunix`fop_rename+0xc2
genunix`vn_renameat+0x2ab
genunix`vn_rename+0x2b
3 79752 rename:return
on-cpu
value ------------- Distribution ------------- count
16384 | 0
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 3
65536 |@@@@@@@@@@ 1
131072 | 0
off-cpu
value ------------- Distribution ------------- count
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
524288 | 0
1048576 |@@@@@@@@@@@@@ 1
2097152 | 0
rename times on: 205680 off: 2625604 total: 2831284
$
From the aggregations at the bottom of the output you can see that the system call went off-cpu three times and one of those occasions it was off CPU for long enough that my limit of 10000000 nano seconds was reached and so a stack trace was printed. It also becomes pretty clear where that system call spent all it's time. It was a “rename” system call and I'm on an NFS file system so it has to wait for the server to respond and that server is going to have to make sure it has updated some non-volatile storage.
Here is the script:
#!/usr/sbin/dtrace -s
/* run using dtrace -p or dtace -c */
syscall::rename:entry
/ pid == $target /
{
self->traceme = 1;
self->ts = timestamp;
self->on_cpu = timestamp;
self->total_on = 0;
self->total_off = 0;
printf("rename(%s, %s)", copyinstr(arg0), copyinstr(arg1));
}
sched:::off-cpu
/ self->traceme == 1 /
{
self->off_cpu = timestamp;
self->total_on += self->off_cpu - self->on_cpu;
}
sched:::off-cpu
/ self->traceme == 1 /
{
@["on-cpu"] = quantize(self->off_cpu - self->on_cpu);
}
sched:::on-cpu
/ self->traceme == 1 /
{
self->on_cpu = timestamp;
@["off-cpu"] = quantize(self->on_cpu - self->off_cpu);
self->total_off += self->on_cpu - self->off_cpu;
}
/* if off for more than a second print a stack */
sched:::on-cpu
/ self->traceme == 1 && timestamp - self->off_cpu > 1000*1000 /
{
printf("Off cpu for: %d", self->on_cpu - self->off_cpu);
stack(10);
}
sched:::off-cpu
/ self->traceme == 1 && timestamp - self->on_cpu > 1000*1000 /
{
printf("On cpu for: %d", self->off_cpu - self->on_cpu);
stack(10);
}
syscall::rename:return
/self->traceme/
{
self->traceme = 0;
self->total_on += timestamp - self->on_cpu;
@["on-cpu"] = quantize(timestamp - self->on_cpu);
printa(@);
printf("%s times on: %d off: %d total: %d\n",probefunc, self->total_on,
self->total_off, timestamp-self->ts);
self->on_cpu = 0;
self->off_cpu = 0;
self->total_on = 0;
self->total_off = 0;
}
Thursday October 02, 2008 I've been having a conversation with some colleagues about how to communicate announcements within Sun and rather than do this via email I thought I would go a bit off topic for this blog and post it here.
Why I think a blog is a good place to put announcements.
Email provides a fantastic one to one or one to many communication medium but only if the “many” are all known to you. Sending out large announcements via email is likely to cause large numbers of the potential audience to ignore you. If you want this sort of broadcast medium I would suggest a blog.
Blogs
Blogs being largely write once and then read only are a great way to put out announcements so that they can be subscribed to by users using the blog's RSS feed and consumed either by a blog reader or even by thunderbird so that they look like emails. Or you can read them directly via the web or via a blog reading web site like google.com/reader. They will get indexed via search engines and so can be found by occasional readers while at the same time those who want to can subscribe to them to get the latest news or views when they are posted.
Wikis
There are wikis and there are wikis so to some degree the question as to when they are good and bad depends on the wiki. However I've not found a wiki yet that has a really good RSS feed for handling announcements. They are great for cooperative working or for community documentation (eg wikipedia) but for announcements they lack concise RSS feeds or notification methods that don't either suffer from too much noise (changing a single typo results in the entire page being in the RSS feed) or don't send enough updates (the feed is per page so having an announcement per page does not produce a good stream of announcements). Some wikis allow you to build complex RSS feeds based on search criteria that may allow a feed to be built but this is really for a power user and so I've not found it suitable for announcements.
Tuesday September 30, 2008 The hot news around here is that the session timeouts for Sunsolve and the other tools that use the authentication system on sun.com are going to be increased to something approaching reasonable timeouts. The current 30 minute idle and 2 hour session timeout will be increased to 8 hours idle and 24 hours for the session. Not quite the 14 days and 90 days I would have but none the less a welcome step in the right direction.
If all goes well the change should happen on October 9th. I wish it was sooner but none the less the prospect is exciting enough for me to pre-announce it here, not that anyone will read it!
A big thank you to those who are making it happen.
Wednesday September 24, 2008 This entry has been sitting in my draft queue for over a year mainly as it is no longer be relevant as NFSv4 should have rendered the script useless. The rest of this entry refers to NFSv2 and NFSv3 filehandles only.
How can you decode an NFS filehandle?
NFS file handles are opaque so only the server who hands them out can draw firm conclusions from them. However since the implementation in SunOS has not changed it is possible to write a script that will turn a file handle that has been handed out by a server running Solaris into an inode number and device. Hence way back when I wrote that script and only today someone made good use of it so here it is for everyone.
The script has not been touched in over 10 years until I added the CDDL but should still be able to understand messages files and snoop -v output and then decode the file handles.
This snoop was taken while accessing a the file “passwd” that was in /export/home on the server:
: s4u-10-gmp03.eu TS 19 $; /usr/sbin/snoop -p 3,3 -i /tmp/snoop.cg13442 -v | decodefh | grep NFS RPC: Program = 100003 (NFS), version = 3, procedure = 4 NFS: ----- Sun NFS ----- NFS: NFS: Proc = 4 (Check access permission) NFS: File handle = [8CB2] NFS: 0080000000000002000A000000019DAC03419521000A000000019DA96E637436 decodefh: SunOS NFS server file handle decodes as: maj=32,min=0, inode=105900 NFS: Access bits = 0x0000002d NFS: .... ...1 = Read NFS: .... ..0. = (no lookup) NFS: .... .1.. = Modify NFS: .... 1... = Extend NFS: ...0 .... = (no delete) NFS: ..1. .... = Execute NFS:
Now taking this information to the server you need to find the file system that is shared and has major number 32 and minor number 0 and then look for the file with the inode number 105900 :
# share - /export/home rw "" # df /export/home / (/dev/dsk/c0t0d0s0 ):13091934 blocks 894926 files # ls -lL /dev/dsk/c0t0d0s0 brw-r----- 1 root sys 32, 0 Aug 22 15:11 /dev/dsk/c0t0d0s0 # find /export/home -inum 105900 /export/home/passwd #
Clearly this is a trivial example but you get the idea.
The script also understands messages files:
$ grep 'nfs:.*702911' /var/adm/messages | head -2 | decodefh Sep 21 03:14:34 vi64-netrax4450a-gmp03 nfs: [ID 702911 kern.notice] (file handle: d41cd448 a3dd9683 a00 2040000 1000000 a00 2000000 2000000) decodefh: SunOS NFS server file handle decodes as: maj=13575,min=54344, inode=33816576 Sep 21 08:34:11 vi64-netrax4450a-gmp03 nfs: [ID 702911 kern.notice] (file handle: d41cd448 a3dd9683 a00 2040000 1000000 a00 2000000 2000000) decodefh: SunOS NFS server file handle decodes as: maj=13575,min=54344, inode=33816576 $
and finally can take the file handle from the command line:
$ decodefh 0080000000000002000A000000019DAC03419521000A000000019DA96E637436 0080000000000002000A000000019DAC03419521000A000000019DA96E637436 decodefh: SunOS NFS server file handle decodes as: maj=32,min=0, inode=105900 $
So here is the script: http://blogs.sun.com/chrisg/resource/decodefh.sh
Remember this will only work for filehandles generated by NFS servers running Solaris and only for NFS versions 2 & 3. It is possible that the format could change in the future but at the time of writing and for the last 13 years it has been stable.
Tuesday September 23, 2008 Today the entire family left the house in the morning by Bike.
It does not get much better than that.
Sunday September 21, 2008 We did a 63 mile round trip via Henfold lakes but out via a strange route taking in Ripley, Newlands Corner and then Cranleigh. When descending what is a very exciting hill towards Cranleigh doing 40mph I had the added thrill of hitting a large stone in the road and having my front tyre deflate instantly. As I braked as hard as I could using my back brake and slowed very slowly I was able to see the tyre not coming off the rim but also not doing much in the way of letting me steer around the bend that was approaching. The odd thing was what went through my mind was the question: ¨Am I using the rear brake?¨, which I was. Thankfully the rear brake was able to overcome the 1:7 hill and bring me to a halt before the tyre came off or I hit anything. A tribute to continental GP4000s ability to be ridden when flat.
When I went to put the wheel back on after fixing the tube the spring on the front brake decided to brake. I can´t really complain since it is 9 years old but it is the first time I have every had a brake spring fail (Campagnolo Record) I should be able to get a new spring if my local bike shop comes good with stocking Campagnolo spares, something they say they are going to do. The failure meant that if I used the front brake I had to manually spring the callipers apart for the rest of the ride. Not hard to do but enough to mean I wont be commuting on the bike again this year. I now have quite a large number of things to fix on my summer bike although not enough to let me upgrade to the new 11 speed Campagnolo group set, alas.
The rest of the ride was uneventful and we were able to take advantage of the Indain Summer we appear to be having.
Saturday September 20, 2008 There is an amazing Advert running on local radio at the moment. The premise is that you don't need to waste money on expensive brand name trainers but to get the best out or education then you have to have Microsoft Office 2008. The irony of the add is that you would be wasting your money not on brand name trainers but instead on a“brand name” office suite.
Of course for infinitely less, yes free, you can have OpenOffice.org.
I know I've been here before but it is worth repeating.
Sunday September 14, 2008 Six riders good weather and a great cafe. We went out by what is not the usual route, via Old Woking, Normandy, Ash Green and The Sands. The return trip was south of Guildford via Albury, where we managed to loose two riders. One insisted we not wait for him and the other did not see us turn back towards Dorking so we could go up Coombe Bottom and so went up Newlands. Then even on the descent from Coombe bottom we got split up but managed to regroup as it turned out the slower riders were in front so were caught.
Ended up doing 79 miles and No RAIN!!!!!
Saturday September 13, 2008 I've been using samba at home for a while and now but would like to migrate over to the new CIFS implentation provided by solaris. Since there are somre subtle differences in what each service provides* this means a slower migration.
Obviously you can't configure both services to run on the same system so to get around this I am going to migrate all the SMB services into a zone running on the server and then allow the global zone to act as the native CIFS service.
So I configured a zone called, rather dully, “samba” with loop back access to all the file systems that I share via SMB and added the additional priviledge “sys_smb” so that the daemons could bind to the smb service port.
zonecfg:samba> set limitpriv=default,sys_smb The end command only makes sense in the resource scope. zonecfg:samba> commit zonecfg:samba> exit
Now you can configure the zone in the usual way to run samba. I simply copied the smb.conf and smbpasswd files from the global zone using zcp.
Once that was done and samba enabled in smf I could then enable the natives CIFS server in the global zone and have the best of both worlds.
*) The principal difference I see is that the native smb service does not cross file systems mount points. So if you have a hierarchy of file systems you have to mount each one on the client. With samba you can just mount the root and it will see everything below.
Wednesday September 10, 2008 This morning I colleague, lets call him Adrian, popped round to ask me an important question:
Will Lance return to professional riding and ride in the Tour?
I said definitely not. He has just signed up of the doping tests so he can compete in the Leadville 100.....
An hour later another colleague sent me a URL via IM: http://news.bbc.co.uk/sport1/hi/other_sports/cycling/7605378.stm
It is going to be interesting.....but clearly I don't have a career in sports predictions.
Sunday September 07, 2008 Having run out of space in the root file systems and being close to full on the zpool the final straw was being able to get 2 750Gb sata drives for less than £100, that and knowing that sanpshots no longer cause re livering to restart which greatly simplifies the data migration. So I'm replacing the existing drives with new ones. Since the enclosure I have can only hold three drives this involved a two stage upgrade so that at no point was my data on less than two drives. First stage was to install one drive and label it:
partition> print Current partition table (unnamed): Total disk cylinders available: 45597 + 2 (reserved cylinders) Part Tag Flag Cylinders Size Blocks 0 root wm 39383 - 41992 39.99GB (2610/0/0) 83859300 1 unassigned wm 0 0 (0/0/0) 0 2 backup wu 0 - 45596 698.58GB (45597/0/0) 1465031610 3 unassigned wm 0 0 (0/0/0) 0 4 unassigned wm 36773 - 39382 39.99GB (2610/0/0) 83859300 5 unassigned wm 45594 - 45596 47.07MB (3/0/0) 96390 6 unassigned wm 36379 - 36772 6.04GB (394/0/0) 12659220 7 unassigned wm 3 - 36378 557.31GB (36376/0/0) 1168760880 8 boot wu 0 - 0 15.69MB (1/0/0) 32130 9 alternates wm 1 - 2 31.38MB (2/0/0) 64260 partition>
These map to the partitions from the original set up, only they are bigger. I'm confident that when the 40Gb root disks are to small I will have migrated to ZFS for root. So this looks like a good long term solution.
pearson # dumpadm -d /dev/dsk/c2d0s6
Dump content: kernel pages
Dump device: /dev/dsk/c2d0s6 (dedicated)
Savecore directory: /var/crash/pearson
Savecore enabled: yes
pearson # metadb -a -c 3 /dev/dsk/c2d0s5
pearson # egrep c2d0 /etc/lvm/md.tab
d12 1 1 /dev/dsk/c2d0s0
d42 1 1 /dev/dsk/c2d0s4
pearson # metainit d12
d12: Concat/Stripe is setup
pearson # metainit d42
d42: Concat/Stripe is setup
pearson # metattach d0 d12
d0: submirror d12 is attached
pearson # Now wait until the disk has completed resyning. While you can do this in parallel this causes the disk heads to move more so overall it is slower. Left to just do one partition at a time it is really quite quick:
extended device statistics
device r/s w/s kr/s kw/s wait actv svc_t %w %b
cmdk0 357.2 0.0 18321.8 0.0 2.6 1.1 10.4 52 58
cmdk1 0.0 706.4 0.0 36147.4 1.0 0.5 2.2 23 27
cmdk2 350.2 0.0 17929.6 0.0 0.4 0.3 2.1 12 15
md1 70.0 71.0 35859.2 36371.5 0.0 1.0 7.1 0 100
md3 0.0 71.0 0.0 36371.5 0.0 0.3 3.8 0 27
md15 35.0 0.0 17929.6 0.0 0.0 0.6 16.5 0 58
md18 35.0 0.0 17929.6 0.0 0.0 0.1 4.3 0 15
pearson # metastat d0
d0: Mirror
Submirror 0: d10
State: Okay
Submirror 1: d11
State: Okay
Submirror 2: d12
State: Resyncing
Resync in progress: 70 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 20482875 blocks (9.8 GB)
d10: Submirror of d0
State: Okay
Size: 20482875 blocks (9.8 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c1d0s0 0 No Okay Yes
d11: Submirror of d0
State: Okay
Size: 20482875 blocks (9.8 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c5d0s0 0 No Okay Yes
d12: Submirror of d0
State: Resyncing
Size: 83859300 blocks (39 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c2d0s0 0 No Okay Yes
Device Relocation Information:
Device Reloc Device ID
c1d0 Yes id1,cmdk@AST3320620AS=____________3QF09GL1
c5d0 Yes id1,cmdk@AST3320620AS=____________3QF0A1QD
c2d0 Yes id1,cmdk@AST3750840AS=____________5QD36N5M
pearson # Once complete do the other root disk:
pearson # metattach d4 d42 d4: submirror d42 is attached pearson #
Finally attach slice 7 to the zpool:
pearson # zpool attach -f tank c1d0s7 c2d0s7
pearson # zpool status
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.00% done, 252h52m to go
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0s7 ONLINE 0 0 0
c5d0s7 ONLINE 0 0 0
c2d0s7 ONLINE 0 0 0
errors: No known data errors
pearson # The initial estimate is more pessimistic than reality but it still took over 11hours to complete. The next thing was to shut the system down and replace one of the old drives with the new. Once this was done the final slices in use from the old drive can be detached and in the case of the meta devices cleared.
: pearson FSS 4 $; zpool status
pool: tank
state: ONLINE
scrub: scrub completed after 11h8m with 0 errors on Sat Sep 6 20:58:05 2008
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c5d0s7 ONLINE 0 0 0
c2d0s7 ONLINE 0 0 0
errors: No known data errors
: pearson FSS 5 $;
: pearson FSS 5 $; metastat
d6: Mirror
Submirror 0: d62
State: Okay
Submirror 1: d63
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 12659220 blocks (6.0 GB)
d62: Submirror of d6
State: Okay
Size: 12659220 blocks (6.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c5d0s6 0 No Okay Yes
d63: Submirror of d6
State: Okay
Size: 12659220 blocks (6.0 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c2d0s6 0 No Okay Yes
d4: Mirror
Submirror 0: d42
State: Okay
Submirror 1: d43
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 83859300 blocks (39 GB)
d42: Submirror of d4
State: Okay
Size: 83859300 blocks (39 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c5d0s4 0 No Okay Yes
d43: Submirror of d4
State: Okay
Size: 83859300 blocks (39 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c2d0s4 0 No Okay Yes
d0: Mirror
Submirror 0: d12
State: Okay
Submirror 1: d13
State: Okay
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 83859300 blocks (39 GB)
d12: Submirror of d0
State: Okay
Size: 83859300 blocks (39 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c5d0s0 0 No Okay Yes
d13: Submirror of d0
State: Okay
Size: 83859300 blocks (39 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c2d0s0 0 No Okay Yes
Device Relocation Information:
Device Reloc Device ID
c5d0 Yes id1,cmdk@AST3750840AS=____________5QD36N5M
c2d0 Yes id1,cmdk@AST3750840AS=____________5QD3EQEX
: pearson FSS 6 $; The old drive is still in the system but currently only has a metadb on it:
: pearson FSS 6 $; metadb -i
flags first blk block count
a m p luo 16 8192 /dev/dsk/c1d0s5
a p luo 8208 8192 /dev/dsk/c1d0s5
a p luo 16400 8192 /dev/dsk/c1d0s5
a p luo 16 8192 /dev/dsk/c5d0s5
a p luo 8208 8192 /dev/dsk/c5d0s5
a p luo 16400 8192 /dev/dsk/c5d0s5
a luo 16 8192 /dev/dsk/c2d0s5
a luo 8208 8192 /dev/dsk/c2d0s5
a luo 16400 8192 /dev/dsk/c2d0s5
r - replica does not have device relocation information
o - replica active prior to last mddb configuration change
u - replica is up to date
l - locator for this replica was read successfully
c - replica's location was in /etc/lvm/mddb.cf
p - replica's location was patched in kernel
m - replica is master, this is replica selected as input
t - tagged data is associated with the replica
W - replica has device write errors
a - replica is active, commits are occurring to this replica
M - replica had problem with master blocks
D - replica had problem with data blocks
F - replica had format problems
S - replica is too small to hold current data base
R - replica had device read errors
B - tagged data associated with the replica is not valid
: pearson FSS 7 $;
I'm tempted to leave the third disk in the system so that the disk suite configuration will always have a quorum if a single drive files. However since the BIOS only seems to be able to boot from the first disk drive this may be pointless.
I'm now keenly interested in bug 6592835 “resliver needs to go faster†since if a disk did fail I don't fancy waiting more than 24hours after I have sourced a new drive for the data to sync when the disks fill. The disk suite devices managed to drive the disk at over 40Mb/sec while ZFS achieved 5Mb/sec.
Sunday August 31, 2008 There were six in Molesey this morning braving the fog and laughing in the face of the weather forecast. Those six made it to Epsom before the fog cleared to be replaced by the thunder and heavy rain that had been forecast. The idea of climbing and riding over Epsom downs in a Thunder storm did not fill anyone with joy. So after briefly taking shelter in Epsom and failing to find a café open we rode back as fast as our legs would allow. If the people at the Met Office could have seen us they would have been in stitches.
Got home very very wet having ridden 23 miles.
Next weekend we are supposed to be cycling to Yeovil which is about 120 miles, I hope the weather is better.
Saturday August 30, 2008 I rode my blue bike to work this week for two reasons. First I needed to carry more things home than usual and so had a pannier and second to make sure it was all ready for the winter. It turns out that all is not well. Having replaced the chain and cassette at the end of the winter and now I know that the Shimano Ultegra chain rings are only marginally more hard wearing than the 105 rings.
Prior to the bike's last major rebuild the 105 rings had only lasted one winter (3000 miles or so) and were replaced with TA rings. When the whole chain set was replaced it had Ultegra rings on so I left them. The Ultegra rings have lasted three winters (about 10,000 miles), not that bad until you compare the my summer bike. It is 9 years old and has done over 36,000 miles and still has the same chain rings (Campagnolo Record). The maintenance schedule of both bikes is the same, new chain and cassette after 3,500 miles. Yes the summer bike is lighter and does not go out in the winter but I still don't think that explains the drastic difference in the wear. I can't help agree with those that claim Shimano chain rings are made of grey cheese.
Luckily I have a spare set of chain rings, Stronglight ones, that I spent a happy hour fitting.
Thursday August 28, 2008 Since the home server has been snapping regularly I have had to choose between snapshots and scrubbing and I chose snapshots. User error is more likely than hardware failures and scrubbing is really about seeing those errors sooner so you don't get a unrecoverable failure due to having two problems at once. However I would rather not have to choose.
So I was particularly pleased to see that build 94 contains the fix for this bug:
6343667 scrub/resilver has to start over when a snapshot is taken
So today the home server had it´s first scrub in years and it scrubbed up well:
: pearson FSS 5 $; pfexec zpool status
pool: tank
state: ONLINE
status: The pool is formatted using an older on-disk format. The pool can
still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'. Once this is done, the
pool will no longer be accessible on older software versions.
scrub: scrub completed after 12h42m with 0 errors on Thu Aug 28 20:12:36 2008
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c1d0s7 ONLINE 0 0 0
c5d0s7 ONLINE 0 0 0
errors: No known data errors
: pearson FSS 6 $;
When I upgrade the pool, after the other live upgrade boot environment can support this pool version, there is the promise of a faster scrub but since this scrub happened during the day and I also backed up the pool using zfs_backup during the same time.