The dot in ... --- ...

Chris Gerhard's Weblog

Main | Next page »

20091113 Friday November 13, 2009

CIFS, ACls, permissions and iTunes

If you share a file system using the CIFS server (not SAMBA) and create a file in that file system using Windows XP the file ends up with these strange permissions and an ACL like this:

: pearson FSS 12 $; ls -vd Bad
d---------+  2 cjg      staff          2 Nov 13 17:11 Bad
     0:user:cjg:list_directory/read_data/add_file/write_data/add_subdirectory
         /append_data/read_xattr/write_xattr/execute/delete_child
         /read_attributes/write_attributes/delete/read_acl/write_acl
         /write_owner/synchronize:allow
     1:group:2147483648:list_directory/read_data/add_file/write_data

         /add_subdirectory/append_data/read_xattr/write_xattr/execute

         /delete_child/read_attributes/write_attributes/delete/read_acl

         /write_acl/write_owner/synchronize:allow

: pearson FSS 13 $; 


The first thing that riles UNIX some users is the lack of any file permissions, although things seem to work fine. The strange group ACL is for the local WINDOWS SYSTEM group. However the odd thing is for me it renders iTunes on the Windows system unable to see the files that it has created.

The solution is to add a default ACL to the root of the file system (well to every object in the file system if the file system is not new) that looks like this:

A+owner@:full_set:fd:allow,everyone@:read_set/execute:fd:allow

So this has the rather pleasant side effect of setting the UNIX permissions to something more recognisable:

: pearson FSS 20 $; ls -vd Good
drwxr-xr-x+  2 cjg      staff          2 Nov 13 18:16 Good
     0:owner@:list_directory/read_data/add_file/write_data/add_subdirectory
         /append_data/read_xattr/write_xattr/execute/delete_child
         /read_attributes/write_attributes/delete/read_acl/write_acl
         /write_owner/synchronize:file_inherit/dir_inherit/inherited:allow
     1:everyone@:list_directory/read_data/read_xattr/execute/read_attributes
         /read_acl:file_inherit/dir_inherit/inherited:allow
: pearson FSS 21 $; 

and the even more pleasant side effect of making iTunes works again!


( Nov 13 2009, 06:56:00 PM GMT ) Permalink Trackback

   

20091112 Thursday November 12, 2009

The Kings of Computing use dtrace!

I've said many times that dtrace is not just a wonderful tool for developers and performance gurus. The Kings of Computing, which are of course System Admins, also find it really useful.

There is an ancient version of make called Parallel make that occasionally suffers from a bug (1223984) where it gets into a loop like this:

waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD

This will then consume a CPU and the users CPU shares. The application is never going to be fixed so the normal advice is not to use it. However since it can be NFS mounted from anywhere I can't reliably delete all copies of it so occasionally we will see run away processes on our build server.

It turns out this is a snip to fix with dtrace. Simply look for cases where the wait system call returns an error and errno is set to ECHILD (10) and if that happens 10 times in a row for the same process and that process does not call fork then stop the process.

The script is simple enough for me to just do it on the command line:


# dtrace -wn 'syscall::waitsys:return / arg1 <= 0 && 
execname == "make.bin" && errno == 10  && waitcount[pid]++ > 20 / {

	stop();

	printf("uid %d pid %d", uid, pid) }

syscall::forksys:return / arg1 > 0 / { waitcount[pid] = 0 }'
dtrace: description 'syscall::waitsys:return ' matched 2 probes
dtrace: allowing destructive actions
CPU     ID                    FUNCTION:NAME
  2  20588                   waitsys:return uid 36580 pid 29252
  3  20588                   waitsys:return uid 36580 pid 2522
  5  20588                   waitsys:return uid 36580 pid 28663
  7  20588                   waitsys:return uid 36580 pid 29884
 10  20588                   waitsys:return uid 36580 pid 941
 15  20588                   waitsys:return uid 36580 pid 1098

This was way easier then messing around with prstat, truss and pstop!


( Nov 12 2009, 03:33:23 PM GMT ) Permalink Trackback

   

20091009 Friday October 09, 2009

Preparing for OpenSolaris @ home

Since the "nevada" builds of Solaris next are due to end soon and for some time the upgrade of my home server has involved more than a little bit of TLC to get it to work I will be moving to an OpenSolaris build just as soon as I can.

However before I can do this I need to make sure I have all thesoftware to provide home service. This is really a note to myself to I don't forget anything.

I'm going to see if I can jump through the legal hoops that will allow me to contribute the builds to the contrib repository via Source Juicer. However as this is my spare time I don't know whether the legal reviews will be funded.

Due to the way OpenSolaris is delivered I also need to be more careful about what I install. rather than being able to choose everything. First I need my list from my laptop. Then in addtion to that I'll need

Oh and I'll need the Sun Ray server software.


( Oct 09 2009, 07:07:53 PM BST ) Permalink Trackback

   

20090915 Tuesday September 15, 2009

Moving to an OpenSolaris Sun Ray

Today I took the plunge and moved from working on our Nevada based Sun Ray Servers to one running OpenSolaris. So that I could get the full OpenSolaris look and feel I first purged my home directory of a number of configuration files and directories using a script like1 this:

#!/bin/ksh -p
TARGET=b4OpenSolaris
test -d $HOME/$TARGET || mkdir $HOME/$TARGET
mv $HOME/.ICEauthority $HOME/$TARGET
mv $HOME/.cache $HOME/$TARGET
mv $HOME/.chewing $HOME/$TARGET
mv $HOME/.config $HOME/$TARGET
mv $HOME/.dbus $HOME/$TARGET
mv $HOME/.dmrc $HOME/$TARGET
mv $HOME/.gconf $HOME/$TARGET
mv $HOME/.gconfd $HOME/$TARGET
mv $HOME/.gksu.lock $HOME/$TARGET
mv $HOME/.gnome2 $HOME/$TARGET
mv $HOME/.gnome2_private $HOME/$TARGET
mv $HOME/.gstreamer-0.10 $HOME/$TARGET
mv $HOME/.gtk-bookmarks $HOME/$TARGET
mv $HOME/.iiim $HOME/$TARGET
mv $HOME/.local $HOME/$TARGET
mv $HOME/.nautilus $HOME/$TARGET
mv $HOME/.printer-groups.xml $HOME/$TARGET
mv $HOME/.rnd $HOME/$TARGET
mv $HOME/.sunstudio $HOME/$TARGET
mv $HOME/.sunw $HOME/$TARGET
mv $HOME/.updatemanager $HOME/$TARGET
mv $HOME/.xesam $HOME/$TARGET
mv $HOME/.xsession-errors $HOME/$TARGET

I generated the list by installing OpenSolaris in a VirtualBox and then logging in and doing a bit of browsing and general usage and then seeing was was created. Additionally “.mozilla” was created but I chose to retain that so that I can keep all the history that is in my browser.

Once logged in I have removed the update-manager icon as I am not the administrator. I have also removed the power notification and network monitor as they provide no useful data on a Sun Ray server.

Using “System->Preferences->Startup Applications” I unchecked the codeina update notifier and added my script for updating my IM status.

So far so good but it is taking a while to get used to the menu being a the top and the window list at the bottom of the screen.

1Like as in similar to and not this exact script as mine had my home directory hard coded into it.


( Sep 15 2009, 05:15:55 PM BST ) Permalink Trackback

   

20090909 Wednesday September 09, 2009

Understanding iostat

1Iostat has been around for years and until Dtrace came along and allowed us to look more deeply into the kernel was the tool for analysing how the io subsystem was working in Solaris. However interpreting the output has proved in the past to cause problems.

First if you are looking at latency issues it is vital that you use the smallest time quantum to iostat you can, which as of Solaris 10 is 1 second. Here is a sample of some output produced from “iostat -x 1”:

                  extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 1026.5    1.6 1024.5  0.0 25.6   24.8   0  23 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 


The first thing to draw your attention to is the Column “%b” which the manual tells you is:

%b percent of time the disk is busy (transactions in progress)

So in this example the disk was “busy”, ie had at least one transaction (command) in progress for 23% of the time period. Ie 0.23 seconds as the time period was 1 second.

Now look at the “actv” column. Again the manual says:

actv average number of transactions actively being serviced (removed from the queue but not yet completed)
This is the number of I/O operations accepted, but not yet serviced, by the device.
In this example the average number of transactions outstanding for this time quantum was 25.6. Now here is the bit that is so often missed. Since we know that all the transactions actually took place within 0.23 seconds and were not evenly spread across the full second the average queue depth when busy was 100/23 * 25.6 or 111.3. Thanks to dtrace and this D script you can see the actual IO pattern2:

Even having done the maths iostat smooths out peaks in the IO pattern and thus under reports the peak number of transactions as 103.0 when the true value is 200.
The same is true for the bandwidth. The iostat above comes reports 1031.5 transactions a second (r/s + w/s) again though this does not take into account that all those IO requests happened in 0.23 seconds. So the true figure for the device would be 1031.5 * 100/23 which is 4485 transations/sec.
If we up the load on the disk a bit then you can conclude more from the iostat:
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 2155.7    1.6 2153.7 30.2 93.3   57.1  29  45 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0 3989.1    0.0 3989.1 44.6 157.2   50.6  41  83 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 

Since the %w column is non zero, and from the manual %w is:

%w percent of time there are transactions waiting for service (queue non-empty)

This is telling us that the device's active queue was full. So on the third line of the above output the devices queue was full for 0.41 seconds. Since the queue depth is quite easy to find out3 and in this case was 256, you can deduce that the queue depth for that 0.41 seconds was 256. Thus the average for the 0.59 seconds left was (157.2-(0.41*256))/0.59 which is 88.5. The graph of the dtrace results tells a different story:


These examples demonstrate what can happen if your application dumps a large number of transactions onto a storage device while the through put will be fine and if you look at iostat data things can appear ok if the granularity of the samples is not close to your requirement for latency any problem can be hidden by the statistical nature of iostat.

1Apologies to those who saw a draft of this on my blog briefly.

2The application creating the IO attempts to keep 200 transations in the disk at all the time. It is interesting to see that it fails as it does not get notification of the completion of the IO until all or nearly all the outstanding transactions have completed.

3This command will do it for all the devices on your system:

echo '*sd_state::walk softstate | ::print -d -at "struct sd_lun" un_throttle' | pfexec mdb -k

however be warned the throttle is dynamic so dtrace gives the real answer.


( Sep 09 2009, 02:21:35 PM BST ) Permalink Trackback

   

20090907 Monday September 07, 2009

Recovering /etc/name_to_major

What do you do if you manage to delete or corrupt /etc/name_to_major? Assuming you don't have a backup a ZFS snapshot or an alternative boot environment, in which case you probably are in the wrong job, you would appear to be in trouble.

First thing is not to panic. Do not reboot the system. If you do that it won't boot and your day has just got a whole lot worse. The data needed to rebuild /etc/name_to_major is in the running kernel so it can be rebuilt from that. If your system an x86 system it is also in the boot archive.

However if you have no boot archive or have over written it with the bad name_to_system this script will extract it from the kernel, all be it slowly:

#!/bin/ksh
i=0
while ((i < 1000 ))
do
print "0t$i::major2name" | mdb -k | read x && echo $x $i
let i=i+1 
done

1Redirect that into a file then move the remains of your /etc/name_to_major out of the way and copy the file in place.

Next time make sure you have a back up or snapshot or alternative boot environment!

1You will see lots of errors of the form “mdb: failed to convert major number to name” these are to be expected. They can be limited to just one by adding “|| break” to the mdb line but that assumes that you have no holes in the major number listings which you may have if you have removed a device, so best to not risk that.


( Sep 07 2009, 04:45:50 PM BST ) Permalink Trackback

   

20090903 Thursday September 03, 2009

Improved sd.conf format

Editing sd.conf has always been somewhat difficult thanks to it not being a documented interface and that the interface was never inteded to be exposed and it was even architecture specific. Fortunately Micheal documented it, which meant that it was known even if syntax remained obscure.

However after ARC case 2008/465 was approved and the changes pushed as part of bug 6518995 you can now use more a human readable syntax1:

sd-config-list=
        "ATA     VBOX HARDDISK", "disksort:false";

As it turns out the “disksort”2 option along with the thottle-max and throttle-min are the ones I most often want to tune.

Here is the current list of tunables lifted straight from the ARC case.


Tunable_Name

Commitment

Data_Type

cache-nonvolatile

Private

BOOLEAN

controller-type

Private

UINT32

delay-busy

Committed

UINT32

disksort

Private

BOOLEAN

timeout-releasereservation

Private

UINT32

reset-lun

Private

BOOLEAN

retries-busy

Private

UINT32

retries-timeout

Committed

UINT32

retries-notready

Private

UINT32

retries-reset

Private

UINT32

throttle-max

Private

UINT32

throttle-min

Private

UINT32


1This reminds me of the change to /etc/printcap that allowed you to specify the terminal flags as strings rather than as a bitmap. All the mystery seemed to be removed!

2While I used disksort as an example for this case I can't think of any reason why you would have it enabled for a virtual disk in VirtualBox.


( Sep 03 2009, 04:49:40 PM BST ) Permalink Trackback

   

20090827 Thursday August 27, 2009

Starting remote X applications

Someone has posted a script to start a remote xterm on BigAdmin which exposes a number of issues I thought it would be better if google stood some chance of finding a better answer or at least an answer that does not rely on inherently insecure settings.

Remote X applications should be started using ssh -X so that the X traffic is encrypted and if you add -C compressed which can be a significant performance boost. So a script to do this could be handy although to be honest knowing the ssh options or having them set as the default in your .ssh/config is just as easy:

: exdev.eu FSS 31 $; egrep '^(Compress|ForwardX)' ~/.ssh/config
ForwardX11 yes
Compression yes
: exdev.eu FSS 32 $; ssh -f pearson /usr/X11/bin/xterm         
: exdev.eu FSS 33 $; 

or more usefully to start graphical tools:

: exdev.eu FSS 33 $; ssh -f pearson pfexec /usr/sadm/admin/bin/dhcpmgr
: exdev.eu FSS 34 $; 

However if you really want a script to do it here is one that will and no need to mess with your .ssh/config

#!/bin/ksh
REMOTE_PATH=${REMOTE_PATH:-${PATH}}
APP=${0##*/}
if (( $# < 1 )) 
then
        print "USAGE: ${APP} host [args]" >&2
        exit 1
fi
host=$1
shift
exec /usr/bin/ssh -o ClearAllForwardings=yes -C -Xfn $host \
        PATH=${REMOTE_PATH} pfexec ${APP#r} $@

If you save this into a file called “rxterm” then running “rxterm remotehost” will start an xterm on the system remotehost assuming you can ssh to that system.

More entertainingly you can save it as “rdhcpmgr” and it will start the dhcpmgr program on a remote system and securely display it on your current display (assuming your PATH includes /usr/sadm/admin/bin and your profile allows you access to that application). You can use it to start any application by simple naming it after the application in question with a preceding “r”.


( Aug 27 2009, 01:55:50 PM BST ) Permalink
Trackback

   

20090826 Wednesday August 26, 2009

More OpenSolaris updates

As I have lived with OpenSolaris I've got used to the updates occuring automatically as you would with most modern Operating Systems. What is a real joy is that it creates a new boot environment for the updates so in the event that one was toxic you can always just roll back. It also gives you a handy reference as to how many updates you have done. Number 13 has just happened for me:

cjg@brompton:~$ beadm list
BE               Active Mountpoint Space  Policy Created          
--               ------ ---------- -----  ------ -------          
b4nvidia-bin-fix -      -          86.0K  static 2009-06-06 17:27 
opensolaris-10   -      -          15.68M static 2009-07-18 08:04 
opensolaris-11   -      -          33.73M static 2009-07-26 09:42 
opensolaris-12   N      /          266.5K static 2009-08-24 13:26 
opensolaris-13   R      -          16.42G static 2009-08-26 18:58 
opensolaris-4    -      -          22.19M static 2009-01-30 21:42 
opensolaris-5    -      -          21.30M static 2009-02-25 20:14 
opensolaris-6    -      -          45.87M static 2009-04-10 18:17 
opensolaris-7    -      -          37.83M static 2009-06-01 20:51 
opensolaris-8    -      -          19.03M static 2009-07-02 18:55 
opensolaris-9    -      -          11.56M static 2009-07-10 07:30 
cjg@brompton:~$ 

I'm going to have to bite the bullet on my home server and swith to OpenSolaris soon as the nevada builds stop. Alas with term time approaching I don't think I will be allowed significant down time for a while.


( Aug 26 2009, 07:15:04 PM BST ) Permalink
Trackback

   

20090731 Friday July 31, 2009

Adding a Dtrace provider to the kernel

Since writing scsi.d I have been pondering if there should really be a scsi dtrace provider that allows you to do all that scsi.d does and more. Since the push of 6797025 that both removed the main reason for not doing this and also gave impetus to do it as scsi.d needed incompatible changes to use the new return function as the return “probe”.

This work is very much work in progress and may or may not see the light of day due to some other issues around scsi addressing, however I thought I would document how I added a kernel dtrace provider so if you want to you don't have to do so much searching1.

Adding the probes themselves is simplicity itself using the DTRACE_PROBEN() macros. Following the convention I added this macro:

#define	DTRACE_SCSI_2(name, type1, arg1, type2, arg2)			\
	DTRACE_PROBE2(__scsi_##name, type1, arg1, type2, arg2);

to usr/src/uts/common/sys/sdt.h. Then after including <sys/sdt.h> in each file I put this macro in each of the places I wanted my probes:

 	DTRACE_SCSI_2(transport, struct scsi_pkt *, pkt,
 	    struct scsi_address *, P_TO_ADDR(pkt))

The bit that took a while to find was how to turn these into a provider. To do that edit the file “usr/src/uts/common/dtrace/sdt_subr.c” and create the attribute structure2:

 static dtrace_pattr_t scsi_attr = {
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 };

and add it to the sdt_providers array:


	{ "scsi", "__scsi_", &scsi_attr, 0 },

than add the probes to the sdt_args array:

	{ "scsi", "transport", 0, 0, "struct scsi_pkt *", "scsi_pktinfo_t *"},
	{ "scsi", "transport", 1, 1, "struct scsi_address *", "scsi_addrinfo_t *"},
	{ "scsi", "complete", 0, 0, "struct scsi_pkt *", "scsi_pktinfo_t *"},
	{ "scsi", "complete", 1, 1, "struct scsi_address *", "scsi_addrinfo_t *"},

Finally you need to create a file containing the definitions of the output structures, scsi_pktinfo_t and scsi_addrinfo_t and define translators for them. That goes into /usr/lib/dtrace and I called mine scsa.d (there is already one called scsi.d).

/*
 * CDDL HEADER START
 *
 * The contents of this file are subject to the terms of the
 * Common Development and Distribution License (the "License").
 * You may not use this file except in compliance with the License.
 *
 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 * or http://www.opensolaris.org/os/licensing.
 * See the License for the specific language governing permissions
 * and limitations under the License.
 *
 * When distributing Covered Code, include this CDDL HEADER in each
 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 * If applicable, add the following below this CDDL HEADER, with the
 * fields enclosed by brackets "[]" replaced with your own identifying
 * information: Portions Copyright [yyyy] [name of copyright owner]
 *
 * CDDL HEADER END
 */
/*
 * Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 * Use is subject to license terms.
 */

#pragma D depends_on module scsi
#pragma D depends_on provider scsi

inline char TEST_UNIT_READY = 0x0;
#pragma D binding "1.0" TEST_UNIT_READY
inline char REZERO_UNIT_or_REWIND = 0x0001;
#pragma D binding "1.0" REZERO_UNIT_or_REWIND

inline char SCSI_HBA_ADDR_COMPLEX = 0x0040;
#pragma D binding "1.0" SCSI_HBA_ADDR_COMPLEX

typedef struct scsi_pktinfo {
	caddr_t pkt_ha_private;
	uint_t	pkt_flags;
	int	pkt_time;
	uchar_t *pkt_scbp;
	uchar_t *pkt_cdbp;
	ssize_t pkt_resid;
	uint_t	pkt_state;
	uint_t 	pkt_statistics;
	uchar_t pkt_reason;
	uint_t	pkt_cdblen;
	uint_t	pkt_tgtlen;
	uint_t	pkt_scblen;
} scsi_pktinfo_t;

#pragma D binding "1.0" translator
translator scsi_pktinfo_t  < struct scsi_pkt *P > {
	pkt_ha_private = P->pkt_ha_private;
	pkt_flags = P->pkt_flags;
	pkt_time = P->pkt_time;
	pkt_scbp = P->pkt_scbp;
	pkt_cdbp = P->pkt_cdbp;
	pkt_resid = P->pkt_resid;
	pkt_state = P->pkt_state;
	pkt_statistics = P->pkt_statistics;
	pkt_reason = P->pkt_reason;
	pkt_cdblen = P->pkt_cdblen;
	pkt_tgtlen = P->pkt_tgtlen;
	pkt_scblen = P->pkt_scblen;
};

typedef struct scsi_addrinfo {
	struct scsi_hba_tran	*a_hba_tran;
	ushort_t a_target;	/* ua target */
	uchar_t	 a_lun;		/* ua lun on target */
	struct scsi_device *a_sd;
} scsi_addrinfo_t;

#pragma D binding "1.0" translator
translator scsi_addrinfo_t  < struct scsi_address *A > {
	a_hba_tran = A->a_hba_tran;
	a_target = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_target;
	a_lun = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_lun;
	a_sd = (A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		A->a.a_sd : 0;
};

again this is just enough to get going so I can see and use the probes:

jack@v4u-2500b-gmp03:~$ pfexec dtrace -P scsi -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
 1303       scsi              scsi                    scsi_transport transport
 1313       scsi              scsi                 scsi_hba_pkt_comp complete
jack@v4u-2500b-gmp03:~$ 

While this all works well for parallel scsi getting the address of devices on fibre is not clear to me. If you have any suggestions I'm all ears.

1If there is such a document already in existence then please add a comment. I will just wish I could have found it.

2These may not be the right attributes but gets me to the point it compiles and can be used in a PoC.


( Jul 31 2009, 01:38:27 PM BST ) Permalink Trackback

   

20090722 Wednesday July 22, 2009

1,784,593 the highest load average ever?

As I cycled home I realised there was one more thing I could do on the exploring the limits of threads and processes on Solaris. That would be the highest load average ever. Modifying the thread creator program to not have each thread sleep once started but instead wait until all the threads were set up and then go into an infinite compute loop that should get me the highest load average possible on a system or so you would think.

With 784001 threads the load stabilised at:

10:16am  up 18:07,  2 users,  load average: 22114.50, 22022.68, 21245.781

Which was somewhat disappointing. However an earlier run with just 780,000 threads managed to peak the load at 1,784,593 while it was exiting:

 7:44am  up 15:35,  2 users,  load average: 1724593.79, 477392.80, 188985.10

I' still pondering how 780000 thread can result in a load average of more than 1 million.


( Jul 22 2009, 11:48:52 AM BST ) Permalink Trackback

   

20090717 Friday July 17, 2009

10 Steps to OpenSolaris Laptop Heaven

If you have recently come into possession of a Laptop onto which to load Solaris then here are my top tips:

  1. Install OpenSolaris. At the time of writing the release is 2009.06, install that, parts of this advice may become obsolete with later releases. Do not install Solaris 10 or even worse Nevada. You should download the live CD and burn it onto a disk boot that and let it install but before you start the install read the next tip.

  2. Before you start the install open a terminal so that you can turn on compression on the root pool once it it created. You have to keep running “zpool list” until you see the pool is created and then run (pfexec zfs set compression=on rpool). You may think that disk is big but after a few months you will be needing every block you can get. Also laptop drives are so slow that compression will probably make things faster.

  3. Before you do anything after installation take a snapshot of the system so you can always go back (pfexec beadm create opensolaris@initialinstall). I really mean this.

  4. Add the extras repository. It contains virtualbox, the flash plugin for firefox, true type fonts and more. All you need is a sun online account. See https://pkg.sun.com/register/ and http://blogs.sun.com/chrisg/entry/installing_support_certificates_in_opensolaris

  5. Decide whether you want to use the development or support repository. If in doubt choose the supported one. Sun employees get access to the support repository. Customers need to get a support contract. (http://www.opensolaris.com/learn/subscriptions/). Then update to the latest bigs (pfexec pkg image-update).

  6. Add any extra packages you need. Since I am now writing this retrospectively there may be things missing. My starting list is:

    • OpenOffice (pfexec pkg install openoffice)

    • SunStudio (pfexec pkg install sunstudioexpress)

    • Netbeans (pfexec pkg install netbeans)

    • Flash (pkfexec pkg install flash)

    • Virtualbox (pfexec pkg install virtualbox)

    • TrueType fonts (pfxec pkg install ttf-fonts-core)

  7. If you are a Sun Employee install the punchin packages so you can access SWAN. I actually rarely use this as I have a Solaris 10 virtualbox image that I use for punchin so I can be both on and off SWAN at the same time but it is good to have the option.

  8. Add you keys to firefox so that you can browse the extras and support repositories from firefox. See http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Browse+the+Support+and+Extra+Repositories.

  9. Go to Fluendo and get and install the free mp3 decoder. They also sell a complete and legal set of decoders for the major video formats, I have them and have been very happy with them. They allow me to view the videos I have cycling events.

  10. Go to Adobe and get acroread. I live in hope that at some point this will be in a repository either at Sun or one Adobe runs so that it can be installed using the standard pkg commands but until then do it by hand.

Enjoy.


( Jul 17 2009, 09:42:06 AM BST ) Permalink Trackback

   

20090630 Tuesday June 30, 2009

Using dtrace to track down memory leaks

I've been working with a customer to try and find a memory “leak” in their application. Many things have been tried, libumem, and the mdb ::findleaks command all with no success.

So I was, as I am sure others before me have, pondering if you could use dtrace to do this. Well I think you can. I have a script that puts probes into malloc et al and counts how often they are called by this thread and when they are freed often free is called.

Then in the entry probe of the target application note away how many calls there have been to the allocators and how many to free and with a bit of care realloc. Then in the return probe compare the number of calls to allocate and free with the saved values and aggregate the results. The principle is that you find the routines that are resulting in allocations that they don't clear up. This should give you a list of functions that are possible leakers which you can then investigate1.

Using the same technique I for getting dtrace to “follow fork” that I described here I ran this up on diskomizer, a program that I understand well and I'm reasonably sure does not have systemic memory leaks. The dtrace script reports three sets of results.

  1. A count of how many times each routine and it's descendents have called a memory allocator.

  2. A count of how many times each routine and it's descendents have called free or realloc with a non NULL pointer as the first argument.

  3. The difference between the two numbers above.

Then with a little bit of nawk to remove all the functions for which the counts are zero gives:

# /usr/sbin/dtrace -Z -wD TARGET_OBJ=diskomizer2 -o /tmp/out-us \
	-s /tmp/followfork.d \
	-Cs /tmp/allocated.d -c \
         "/opt/SUNWstc-diskomizer/bin/sparcv9/diskomizer -f /devs -f background \
          -o background=0 -o SECONDS_TO_RUN=1800"
dtrace: failed to compile script /tmp/allocated.d: line 20: failed to create entry probe for 'realloc': No such process
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
 
# nawk '$1 != 0 { print  $0 }' < /tmp/out.3081
allocations
           1 diskomizer`do_dev_control
           1 diskomizer`set_dev_state
           1 diskomizer`set_state
           3 diskomizer`report_exit_reason
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           6 diskomizer`update_aio_read_stats
           7 diskomizer`cancel_all_io
           9 diskomizer`update_aio_write_stats
          13 diskomizer`cleanup
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          80 diskomizer`my_calloc
         240 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         449 diskomizer`handle_write
         606 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2561 diskomizer`init_buf
        2561 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66255 diskomizer`handle_write_then_read
      124888 diskomizer`init_read_buf
      124897 diskomizer`do_new_read
      127460 diskomizer`expect_signal
freecount
           1 diskomizer`expect_signal
           3 diskomizer`report_exit_reason
           4 diskomizer`close_and_free_paths
           6 diskomizer`update_aio_read_stats
           9 diskomizer`update_aio_write_stats
          11 diskomizer`cancel_all_io
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          17 diskomizer`cleanup
         160 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         442 diskomizer`handle_write
         599 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2560 diskomizer`init_buf
        2560 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66246 diskomizer`handle_write_then_read
      124888 diskomizer`do_new_read
      124888 diskomizer`init_read_buf
      127448 diskomizer`cancel_expected_signal
mismatch_count
     -127448 diskomizer`cancel_expected_signal
          -4 diskomizer`cancel_all_io
          -4 diskomizer`cleanup
          -4 diskomizer`close_and_free_paths
           1 diskomizer`do_dev_control
           1 diskomizer`init_buf
           1 diskomizer`set_dev_state
           1 diskomizer`set_io_len
           1 diskomizer`set_state
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           7 diskomizer`do_new_write
           7 diskomizer`handle_write
           9 diskomizer`do_new_read
           9 diskomizer`handle_write_then_read
          80 diskomizer`init_read
          80 diskomizer`my_calloc
      127459 diskomizer`expect_signal

#

From the above you can see that there are two functions that create and free the majority of the allocations and the allocations almost match each other, which is expected as they are effectively constructor and destructor for each other. The small mismatch is not unexpected in this context.

However it is the vast number of functions that are not listed at all as they and their children make no calls to the memory allocator or have exactly matching allocation and free that are important here. Those are the functions that we have just ruled out.

From here it is easy now to drill down on the functions that are interesting you, ie the ones where there are unbalanced allocations.


I've uploaded the files allocated.d and followfork.d so you can see the details. If you find it useful then let me know.

1Unfortunately the list is longer than you want as on SPARC it includes any functions that don't have their own stack frame due to the way dtrace calculates ustackdepth, which the script makes use of.

2The script only probes particular objects, in this case the main diskomizer binary, but you can limit it to a particular library or even a particular set of entry points based on name if you edit the script.


( Jun 30 2009, 12:14:33 PM BST ) Permalink Trackback

   

20090627 Saturday June 27, 2009

Follow fork for dtrace pid provider?

There is a ongoing request to have follow fork functionality for the dtrace pid provider but so far no one has stood upto the plate for that RFE. In the mean time my best workaround for this is this:

cjg@brompton:~/lang/d$ cat followfork.d
proc:::start
/ppid == $target/
{
	stop();
	printf("fork %d\n", pid);
	system("dtrace -qs child.d -p %d", pid);
}
cjg@brompton:~/lang/d$ cat child.d
pid$target::malloc:entry
{
	printf("%d %s:%s %d\n", pid, probefunc, probename, ustackdepth)
}
cjg@brompton:~/lang/d$ pfexec /usr/sbin/dtrace -qws followfork.d -s child.d -p 26758
26758 malloc:entry 22
26758 malloc:entry 15
26758 malloc:entry 18
26758 malloc:entry 18
26758 malloc:entry 18
fork 27548
27548 malloc:entry 7
27548 malloc:entry 7
27548 malloc:entry 18
27548 malloc:entry 16
27548 malloc:entry 18

Clearly you can have the child script do what ever you wish.

Better solutions are welcome!


( Jun 27 2009, 01:25:52 PM BST ) Permalink Trackback

   

20090618 Thursday June 18, 2009

Diskomizer Open Sourced

I'm pleased to announce the Diskomizer test suite has been open sourced. Diskomizer started life in the dark days before ZFS when we lived in a world full1 of bit flips, phantom writes, phantom reads, misplaced writes and misplaced reads.

With a storage architecture that does not use end to end data verification the best that you can hope for was that your application will spot errors quickly and allow you to diagnose the broken part or bug quickly. Diskomizer was written to be a “simple” application that could verify all the data paths worked correctly and worked correctly under extreme load. It has been and is used by support, development and test groups for system verification.

For more details of what Diskomizer is and how to build and install read these pages:

http://www.opensolaris.org/os/community/storage/tests/Diskomizer/

You can download the source and precompiled binaries from:

http://dlc.sun.com/osol/test/downloads/current/

and can browse the source here:

http://src.opensolaris.org/source/xref/test/stcnv/usr/src/tools/diskomizer

Using Diskomizer

First remember in most cases Diskomizer will destroy all the data on any target you point it at. So extreme care is advised.

I will say that again.

Diskomizer will destroy all the data on any target that you point it at.

For the purposes of this explanation I am going to use ZFS volumes so that I can create and destroy them with confidence that I will not be destroying someone's data.

First lets create some volumes.

# i=0
# while (( i < 10 ))
do
zfs create -V 10G storage/chris/testvol$i
let i=i+1
done
#

Now write the names of the devices you wish to test into a file after the key “DEVICE=”:

# echo DEVICE= /dev/zvol/rdsk/storage/chris/testvol* > test_opts

Now start the test. When you installed diskomizer it put the standard option files on the system and has a search path so that it can find them. I'm using the options file “background” which will make the test go into the back ground redirecting the output into a file called “stdout” and any errors into a file called “stderr”:


# /opt/SUNWstc-diskomizer/bin/diskomizer -f test_opts -f background
# 

If Diskomizer has any problems with the configuration it will report them and exit. This is to minimize the risk to your data from a typo. Also the default is to open devices and files exclusively to again reduce the danger to your data (and to reduce false positives where it detects data corruption).

Once up and running it will report it's progress for each process in the output file:

# tail -5 stdout
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol7 (zvol0:a)2 write times (0.000,0.049,6.068) 100%
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol1 (zvol0:a) write times (0.000,0.027,6.240) 100%
PID 1152: INFO /dev/zvol/rdsk/storage/chris/testvol7 (zvol0:a) read times (0.000,1.593,6.918) 100%
PID 1154: INFO /dev/zvol/rdsk/storage/chris/testvol9 (zvol0:a) write times (0.000,0.070,6.158)  79%
PID 1151: INFO /dev/zvol/rdsk/storage/chris/testvol0 (zvol0:a) read times (0.000,0.976,7.523) 100%
# 

meanwhile all the usual tools can be used to view the IO:

# zpool iostat 5 5                                                  
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
storage      460G  15.9T    832  4.28K  6.49M  31.2M
storage      460G  15.9T  3.22K  9.86K  25.8M  77.2M
storage      460G  15.9T  3.77K  6.04K  30.1M  46.8M
storage      460G  15.9T  2.90K  11.7K  23.2M  91.4M
storage      460G  15.9T  3.63K  5.86K  29.1M  45.7M
# 


1Full may be an exaggeration but we will never know thanks to the fact that the data loss was silent. There were enough cases reported where there was reason to doubt whether the data was good to keep me busy.

2The fact that all the zvols have the same name (zvol0:a) is bug 6851545 found with diskomizer.


( Jun 18 2009, 10:42:10 AM BST ) Permalink Trackback

   

20090607 Sunday June 07, 2009

OpenSolaris 2009.06

After a week of running 2009.06 on my Toshiba Tecra M9 having upgraded from 2008.11 I'm in a position to comment on it. I've been able to remove all the workarounds I had on the system. Nwam appears to work even in the face a suspend and resume. Removalbe media also appears to be robust without the occasional panics that would happen when I removed the SD card with 2008.11.

Feature wise the things I have noticed are the new tracker system for searching files, but it seems to be completely non functional. The big improvements are in the support for the special keys on the Toshiba and the volume control, which unlike the volume on the M2 is a logical control so requires software support. 2009.06 has this support along with support fo the number pad, brightness and mute buttons.

The downside was hitting this bug. This pretty much renders resume useless and I was about to go back to 2008.06 when the bug was updated to say it will be fixed in the first update release and in the mean time there are binary packages. So after creating a new boot enviroment so that I have an unpatched one to switch to when the fix gets into the support repository I have applied the patch. Seems to work which is very pleasing as it has not taken me long to get used to the brightness buttons working.


( Jun 07 2009, 05:45:49 PM BST ) Permalink
Trackback

   

20090526 Tuesday May 26, 2009

Why everyone should be using ZFS

It is at times like these that I'm glad I use ZFS at home.


  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror       ONLINE       0     0     0
            c20t0d0s7  ONLINE       6     0     4
            c21t0d0s7  ONLINE       0     0     0
          mirror       ONLINE       0     0     0
            c21t1d0    ONLINE       0     0     0
            c20t1d0    ONLINE       0     0     0

errors: No known data errors
: pearson FSS 14 $; 

The drive with the errors was also throwing up errors that iostat could report and from it's performance was trying heroicially to give me back data. However it had failed. It's performance was terrible and then it failed to give the right data on 4 occasions. Anyother file system would, if that was user data, just had deliviered it to the user without warning. That bad data could then propergate from there on, probably into my backups. There is certainly no good that could come from that. However ZFS detected and corrected the errors.


Now I have offlined the disk the performance of the system is better but I have no redundancy until the new disk I have just ordered arriaves. Now time to check out Seagate's warranty return system.


( May 26 2009, 10:46:56 AM BST ) Permalink Trackback

   

20090508 Friday May 08, 2009

Stopping find searching remote directories.

Grizzled UNIX users look away now.

The find command is a wonderful thing but there are some uses of it that seem to cause confusion enough that it seems worth documenting them for google. Today's is:

How can I stop find(1) searching remote file systems?

On reading the documentation the “-local” option should be just what you want and it is, but not on it's own. If you just do:

$ find . -local -print

It will indeed only report on files that are on local file systems below the current directory. However it will search the entire directory tree for those local files even if the directory tree is on NFS

To get find to stop searching when it finds a remote file system you need:


$ find . \( ! -local -prune \) -o -print

simple.


( May 08 2009, 03:57:16 PM BST ) Permalink Trackback

   

20090501 Friday May 01, 2009

Installing support certificates in OpenSolaris

For some reason you only get the instructions on how to install a certificate to get access to supported or extras updates on your OpenSolaris system after you have downloaded the certificate. Not a big issue as that is generally when you want the instructions. However if you already have your certificates and now want to install them on another system (that you have support for) you can't get the instructions without getting another certificate.

So here are the instructions cut'n'pasted from the support page, as much for me as for you:

How to Install this OpenSolaris 2008.11 standard support Certificate

  1. Download the provided key and certificate files, called OpenSolaris_2008.11_standard_support.key.pem andOpenSolaris_2008.11_standard_support.certificate.pem using the buttons above. Don't worry if you get logged out, or lose the files. You can come back to this site later and re-download them. We'll assume that you downloaded these files into your Desktop folder,~/Desktop/.

  2. Use the following comands to make a directory inside of /var/pkg to store the key and certificate, and copy the key and certificate into this directory. The key files are kept by reference, so if the files become inaccessible to the packaging system, you will encounter errors. Here is how to do it:

            $ pfexec mkdir -m 0755 -p /var/pkg/ssl
            $ pfexec cp -i ~/Desktop/OpenSolaris_2008.11_standard_support.key.pem /var/pkg/ssl
            $ pfexec cp -i ~/Desktop/OpenSolaris_2008.11_standard_support.certificate.pem /var/pkg/ssl 

  3. Add the publisher:

            $ pfexec pkg set-authority \
               -k /var/pkg/ssl/OpenSolaris_2008.11_standard_support.key.pem \
               -c /var/pkg/ssl/OpenSolaris_2008.11_standard_support.certificate.pem \
               -O https://pkg.sun.com/opensolaris/support/ opensolaris.org 
    
  4. To see the packages supplied by this authority, try:

            $ pkg list -a 'pkg://opensolaris.org/*' 
    

If you use the Package Manager graphical application, you will be able to locate the newly discovered packages when you restart Package Manager.

How to Install this OpenSolaris extras Certificate

  1. Download the provided key and certificate files, called OpenSolaris_extras.key.pem and OpenSolaris_extras.certificate.pem using the buttons above. Don't worry if you get logged out, or lose the files. You can come back to this site later and re-download them. We'll assume that you downloaded these files into your Desktop folder, ~/Desktop/.

  2. Use the following comands to make a directory inside of /var/pkg to store the key and certificate, and copy the key and certificate into this directory. The key files are kept by reference, so if the files become inaccessible to the packaging system, you will encounter errors. Here is how to do it:

            $ pfexec mkdir -m 0755 -p /var/pkg/ssl
            $ pfexec cp -i ~/Desktop/OpenSolaris_extras.key.pem /var/pkg/ssl
            $ pfexec cp -i ~/Desktop/OpenSolaris_extras.certificate.pem /var/pkg/ssl
                    
  3. Add the publisher:

            $ pfexec pkg set-authority \
                -k /var/pkg/ssl/OpenSolaris_extras.key.pem \
                -c /var/pkg/ssl/OpenSolaris_extras.certificate.pem \
                -O https://pkg.sun.com/opensolaris/extra/ extra
            
  4. To see the packages supplied by this authority, try:

            $ pkg list -a 'pkg://extra/*'
            

    If you use the Package Manager graphical application, you will be able to locate the newly discovered packages when you restart Package Manager.


( May 01 2009, 12:27:21 PM BST ) Permalink Trackback

   

20090419 Sunday April 19, 2009

User and group quotas for ZFS!

This push will be very popular amoung those who are managing servers with thousands of users:

Repository: /export/onnv-gate
Total changesets: 1

Changeset: f41cf682d0d3

Comments:
PSARC/2009/204 ZFS user/group quotas & space accounting
6501037 want user/group quotas on ZFS
6830813 zfs list -t all fails assertion
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr
6815592 panic: No such hold X on refcount Y from zfs_znode_move
6759986 zfs list shows temporary %clone when doing online zfs recv

User quotas for zfs has been the feature I have been asked about most when talking to customers. This probably relfects that most customers are simply blown away by the other features of ZFS and the only missing feature was user quotas if you have a large user base.


( Apr 19 2009, 04:18:44 PM BST ) Permalink Trackback

   

Valid HTML! Valid CSS!

Except where otherwise noted, this site is
licensed under a Creative Commons License 2.0

This is a personal weblog, I do not speak for my employer.