Last Post
This is my last day at Sun. Further blogging and Solaris hacks/tips can be found be at super-user.org
Cheers,
GaryL.
( Apr 13 2006, 04:10:15 PM BST )
Permalink
Comments [0]
Mounting an iPod on Solaris
This worked for me.... I'm using Solaris 11 / Nevada built from build 36# uname -a SunOS dhcp-egmp02-36-140 5.11 snv_36 i86pc i386 i86pcI plug in my Ipod to the usb port (I've not tried it on Firewire) of my laptop. I type devfsadm at the root prompt
# devfsadm #I then use rmformat -l to see if the device is recognised
# rmformat -l
Looking for devices...
1. Volmgt Node: /vol/dev/aliases/cdrom0
Logical Node: /dev/rdsk/c1t0d0s2
Physical Node: /pci@0,0/pci-ide@1f,1/ide@1/sd@0,0
Connected Device: TOSHIBA DVD-ROM SD-C2612 1F27
Device Type: DVD Reader
Bus: IDE
Size:
Label:
Access permissions:
2. Logical Node: /dev/rdsk/c5t0d0s2
Physical Node: /pci@0,0/pci10cf,11ab@1d,7/storage@1/disk@0,0
Connected Device: Apple iPod 1.53
Device Type: Removable
Bus: USB
Size: 19.1 GB
Label:
Access permissions: Medium is not write protected.
3. Logical Node: /dev/rdsk/c5t0d0p0
Physical Node: /pci@0,0/pci10cf,11ab@1d,7/storage@1/disk@0,0
Connected Device: Apple iPod 1.53
Device Type: Removable
Bus: USB
Size: 19.1 GB
Label:
Access permissions: Medium is not write protected.
Now, since the Ipod is formatted as PCFS I need to do the following magic command.
# mount -F pcfs /dev/dsk/c5t0d0p0:c /aI can now see the iPod disk
# cd /a # ls Calendars Contacts iPod_Control NotesAnd I can make a file
# timex mkfile 100m ipod-disk1-100m real 1:15.47 user 0.01 sys 5.64 # ls -l total 204832 drwxrwxrwx 1 root root 4096 Jan 1 1970 Calendars drwxrwxrwx 1 root root 4096 Jan 1 1970 Contacts -rwxrwxrwx 1 root root 104857600 Mar 31 11:53 ipod-disk1-100m drwxrwxrwx 1 root root 4096 Jan 1 1970 iPod_Control drwxrwxrwx 1 root root 4096 Jan 1 1970 Notes #Cool!
( Mar 31 2006, 11:54:38 AM BST ) Permalink Comments [3]
Persistent Resource Controls in S10
In a previous blog entry, I used prctl to change a resource limit on a project wide basis. It turns out that this is only temporary - and will be overwritten on reboot. For persistant resource changes it seems we still need to use the projmod command (or edit the /etc/project file by hand). Initially, my project file looks like this:-bash-3.00# cat /etc/project system:0:::: user.root:1:::: noproject:2:::: default:3:::: group.staff:10:::: user.oracle:11::::
Which means that my shared memory limit will be reset on reboot, which is not what we want. To make the change permanent, we use the projmod command like so.
# projmod -s -K "project.max-shm-memory=(priv,4gb,deny)" user.oracle # cat /etc/project system:0:::: user.root:1:::: noproject:2:::: default:3:::: group.staff:10:::: user.oracle:11::::project.max-shm-memory=(priv,4294967296,deny) # bc 4*1024*1024*1024 4294967296If you want to edit the /etc/project by hand, you'll need to enter just a decimal number. It won't accept 4gb (at least not on my system,i tried). The changes are only seen on reboot. To change dynamically, use
prctl -n project.max-shm-memory -r -v 4gb -i project user.oracleThen you will see the results immediately. When issuing the prctl command (above) at least one process e.g. a shell needs to be running in the project user.oracle (the simplest way to do this is to simply login to the machine as oracle in another terminal) ( Mar 28 2006, 03:49:09 PM BST ) Permalink Comments [0]
UK:TV tonight - Solaris powers the universe
Apparently, tonights Horizon (Wed 9th December) 9pm BBC2 will feature research work and visualisations processed by the Sun Grid on behalf of Durham University. ( Feb 09 2006, 05:35:41 PM GMT ) Permalink Comments [0]Scotts Photo's
Scott Macdonald has created a small gallery with some of his photography. The macro stuff is particularly good IMHO. He's going to update the site with comments and annotations to the photo's soon. Looks like Adrian finally has some competition. ( Jan 27 2006, 02:52:30 PM GMT ) Permalink Comments [0]Don't bogart that file my friend...
I spent yesterday at the Sun office in the City of London at a sort of open day for our customers. We were demonstrating the new features in Solaris 10, and someone asked us how they could detect that a user had *attempted* to delete a file (though the same holds true for read, write etc). So, even though the attempt to delete a file will fail, due to permissions (either legacy or RBAC) they wanted to know that it had been attempted. Such a feat *is* achievable using auditting (aka BSM) but is more fun, and flexible from dtrace. In the script below, we log a message to the messages file, and for fun kill the process! I'm no expert in Dtrace, but it was pretty simple thanks in large part to Chris' blog earlier this month. Anyhow, the interesting thing was that the request from the customer was pretty random, but on the spot we were able to tell them how to achieve their aim with a few lines of 'D'. In the example below, the file is /tmp/fred.
#!/usr/sbin/dtrace -s
#pragma D option destructive
#pragma D option quiet
syscall::unlink:entry
/ ((self->path = copyinstr(arg0)) == "fred" && cwd =="/tmp") || (self->path == "/tmp/fred")
/
{
self->prot=1;
self->path = copyinstr(arg0);
raise(9);
}
syscall::unlink:return
/ self->prot==1
/
{
system("logger -p user.err Deletion attempted of %s by user %d",self->path,uid);
}
( Jan 26 2006, 03:55:58 PM GMT )
Permalink
Comments [0]
Converting a ZFS pool to be mirrored
So, the ZFS syntax is quite different to that of SVM which can lead to confusion. Ben Rockwood does a good job of explaning the difference, but does not show how to convert an un-mirrored ZFS pool into mirrored one. So, here's how to do it
o We start with a pool called realzfs (because it's made out of real devices rather than files)
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
realzfs 544G 1.17G 543G 0% ONLINE -
o We can see that it is made up of 4 disks
# zpool status
pool: realzfs
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
realzfs ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
o The correct way is to attach a new device to each existing ldev e.g.
# zpool attach -f realzfs c3t0d0 c3t8d0
# zpool status
pool: realzfs
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 99.99% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
realzfs ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0 178.3 resilvered
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
# zpool attach -f realzfs c3t1d0 c3t9d0
# zpool attach -f realzfs c3t2d0 c3t10d0
# zpool attach -f realzfs c3t5d0 c3t11d0
o Finally we see all our ldevs mirrored.
# zpool status
pool: realzfs
state: ONLINE
scrub: resilver completed with 0 errors on Mon Jan 23 15:26:16 2006
config:
NAME STATE READ WRITE CKSUM
realzfs ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c3t11d0 ONLINE 0 0 0
o The WRONG way to do it is as follows:-
# zpool add -f realzfs mirror c3t8d0 c3t9d0 c3t10d0 c3t11d0
# zpool status
pool: realzfs
state: ONLINE
scrub: resilver completed with 0 errors on Mon Jan 23 15:26:16 2006
config:
NAME STATE READ WRITE CKSUM
realzfs ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c3t8d0 ONLINE 0 0 0
c3t9d0 ONLINE 0 0 0
c3t10d0 ONLINE 0 0 0
c3t11d0 ONLINE 0 0 0
Which is 4 single disk ldevs and one 4way mirrored ldev. NOT 4 mirrored ldev's which is what we actually wanted.
( Jan 23 2006, 04:13:50 PM GMT )
Permalink
Allow dtrace for a regular user. (RBAC)
Here's the magic command to allow a regular user to run dtrace. Ideal for your own laptop/workstation. The username here, is garyli# usermod -K defaultpriv=basic,dtrace_kernel,dtrace_proc,dtrace_user garyli( Jan 04 2006, 11:18:44 AM GMT ) Permalink
Living in the future.
Listening to the radio via the interweb, really feels like living in the future, here are some of my favourites. Both of these come under the heading of deep house. Firstly Digitally Imported , which has a pretty slick website. Secondly, one I found by accident, the wonderfully named Deepmix.ru . Both are playable via iTunes. ( Nov 21 2005, 08:16:53 PM GMT ) Permalink Comments [0]Detecting data/file corruption
Sometimes I get escalations that go along the lines of '...I moved this application data from machine fred to machine bob and now the application won't read it. What's happened?'To try and debug the problem from the application down, is probably going to be quite long-winded. So, my first action is to verify that the file is actually the same on both machines. i.e. did it get corrupted in the transfer. If it did, then we can forget the appliction layer stuff, and concentrate on the method of transfer. It seems obvious when you think of it, but sometimes in the heat of the momemt, the simplest things get forgotten. What follows are some examples of how to use standard Solaris tools to detect data corruption.
For a long time we've had binaries that generate a checksum against a file - which is a simple way to tell if the source and destination copies are the same. There are sum, cksum and now in s10 digest. Also we have 'cmp' which will do a byte-for-byte conparison of two files.
Examples
All of these tools can be used on reguar files and raw devices. !!Copy a raw disk slice to an image file using dd. # dd if=/dev/rdsk/c0t0d0s3 of=/var/tmp/c0t0d0s3.img bs=1024k 41+1 records in 41+1 records out !!Now we can use the comparison tools, they should all come back identical or clean. Remember cmp gives no output for a matching pair of files. For sum and cksum, the first column is the checksum, the second column, the size. # cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img # sum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 28918 85050 /dev/rdsk/c0t0d0s3 28918 85050 /var/tmp/c0t0d0s3.img # cksum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 3185788260 43545600 /dev/rdsk/c0t0d0s3 3185788260 43545600 /var/tmp/c0t0d0s3.img # digest -a md5 /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img (/dev/rdsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255 (/var/tmp/c0t0d0s3.img) = 0616a55e0a4e30ecf49c974f23a56255 To show what happens when a file is corrupted we will write a single byte to the front of the file, which is currently all zero's. The current contents of the first 10 bytes of the file (offsets are in octal) # od -x -N 10 /var/tmp/c0t0d0s3.img 0000000 0000 0000 0000 0000 0000 0000012 Now we write the first byte of /etc/hosts (any file would do) to the front of the image file, to simulate corruption. # dd if=/etc/hosts of=/var/tmp/c0t0d0s3.img bs=1 count=1 conv=notrunc We now see that the file has changed by one byte. # od -x -N 10 /var/tmp/c0t0d0s3.img 0000000 3100 0000 0000 0000 0000 0000012 !!Now we will re-run the comparison commands to see what is shown for a corrupted file. # cmp /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img differ: char 1, line 1 # sum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 28918 85050 /dev/rdsk/c0t0d0s3 28967 85050 /var/tmp/c0t0d0s3.img # cksum /dev/rdsk/c0t0d0s3 /var/tmp/c0t0d0s3.img 3185788260 43545600 /dev/rdsk/c0t0d0s3 1666608083 43545600 /var/tmp/c0t0d0s3.img Again, note that for cksum and sum, that the second column is identical in the original and corrupt version since we have not changed the file length. Timings, comparing two identical files on filesystem. Single disk Ultra10 Solaris10. The timings are dominated by waiting for IO. # timex cmp /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak real 12.83 user 4.86 sys 1.31 # timex sum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak 28918 85050 /dev/dsk/c0t0d0s3 28918 85050 c0t0d0s3.img.bak real 15.17 user 3.89 sys 1.15 # timex cksum /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak 3185788260 43545600 /dev/dsk/c0t0d0s3 3185788260 43545600 c0t0d0s3.img.bak real 14.57 user 2.73 sys 1.33 # timex digest -a md5 /dev/dsk/c0t0d0s3 c0t0d0s3.img.bak (/dev/dsk/c0t0d0s3) = 0616a55e0a4e30ecf49c974f23a56255 (c0t0d0s3.img.bak) = 0616a55e0a4e30ecf49c974f23a56255 real 15.82 user 4.07 sys 1.68( Jul 22 2005, 11:15:35 AM BST ) Permalink Comments [0]
A Simple way to increase shared memory in Solaris10
Short Version
To increase the shared memory available to a given user on Solaris 10.- Find out which project the user is in
- Use prctl to raise the limit e.g. to 200mb, using the project ID returned by id -p.
arches $ id -p uid=90712(garyli) gid=10(staff) projid=10(group.staff) arches $ su Password: # prctl -n project.max-shm-memory -r - v 200mb -i project 10
Long Version
By default the maximum amount of shared memory that a process can use is around 25% of physical memory. If you try to create a shared memory sgement larger than the allowable limit, you will see an error in the messages file, and the shmget system call will fail with EINVAL.Jul 4 17:51:53 arches genunix: [ID 883052 kern.notice] privileged rctl project.max-shm-memory (value 195078144) exceeded by project 10For instance, on arches we have only 512Mb
SunOS arches 5.10 s10_43 sun4u sparc SUNW,Ultra-5_10 arches $ prtdiag | head System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPAR C-IIi 300MHz) System clock frequency: 100 MHz Memory size: 512 MegabytesAnd we can see what the default maximum shared memory segment will be, by using prctl
arches $ prctl -n project.max-shm-memory -i project 10
25758: prctl -n project.max-shm-memory -i project 10
project.max-shm-memory [ no-basic deny ]
128100352 privileged deny
18446744073709551615 system deny [ max ]
arches $ bc
128100352/(1024*1024)
122
In the above case we have a maximum of 128100352 (122 mb) which we can allocate using shmat()/shmget()
We can now demonstrate that it is the case, by trying to allocate first 122mb, then 123mb of shared memory. The program shm_var takes a single value as its input, which is the size in Mb of a shared memory segment that we want to create
arches $ ./shm_var 122 Attempting attach of 122 Mb shm base address = F7000000 shmid = 5 shmat time = 1 sec arches $ ./shm_var 123 Attempting attach of 123 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 secIn the above example, shmat() fails because shmget() returned -1 as the address after it failed to get the shared segment which we asked for. Using truss, we see shmget fail...
shmget(25851, 128974971, 0777|IPC_CREAT) Err#22 EINVALSo, how to change all this is actually quite simple, and can be done on the fly. IMHO the prtctl command doesn't do us any favours with what looks to me like an overy complex syntax. However, here's a cook-book approach.
Firstly, because the shared memory resource is controlled on a project basis, need to know the project to adjust. In the simple case, the project to change will be the project that the user belongs to. So, in the case of an oracle install - su to oracle and issue id -p. Unless you have changed things manually, the project will be '3' 'default'. However, in the example below, my project ID is based on my groupid - so my projectid is 10. Your project id can simply be found by issuing id -p
arches $ id -p uid=90712(garyli) gid=10(staff) projid=10(group.staff)Then we issue the magic prctl command to raise the value
# prctl -n project.max-shm-memory -r -v 200mb -i project 10We can now allocate 200mb, but NOT 201mb
roxy $ ./shm_var 200 Attempting attach of 200 Mb shm base address = F2800000 shmid = 2 shmat time = 1 sec roxy $ ./shm_var 201 Attempting attach of 201 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 secInterestingly the shmmax limit is cumulative - and so does away with the confusing shmmax, shmseg etc.
roxy $ ./shm_var 100 Attempting attach of 100 Mb shm base address = F8C00000 shmid = 3 shmat time = 0 sec ^Croxy $ ./shm_var 100 Attempting attach of 100 Mb shm base address = F8C00000 shmid = 4 shmat time = 1 sec ^Croxy $ ./shm_var 1 Attempting attach of 1 Mb shm base address = FFFFFFFF shmid = FFFFFFFF shmat time = 0 secNote, that in the above test we did not do a shmdt() between each run of shm_var, and so in ipcs -a we see 200Mb of shared memory across two segments
IPC status fromNotice also that Oracle has 46Mb that is not affected by our allocation (or vice versa)as of Mon Jul 4 17:59:50 BST 2005 T ID KEY MODE OWNER GROUP CREATOR CGROUP CBYTES QNUM QBYTES LSPID LRPID STIME RTIME CTIME Message Queues: T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME Shared Memory: m 4 0x2f8b --rw-rw-rw- garyli staff garyli staff 0 104857600 12171 12171 17:58:16 17:58:18 17:58:15 m 3 0x2f86 --rw-rw-rw- garyli staff garyli staff 0 104857600 12166 12166 17:58:07 17:58:12 17:58:07 m 1 0x43cb9a88 --rw-r----- oracle dba oracle dba 6 46235648 791 809 13:25:22 13:25:53 13:25:14 T ID KEY MODE OWNER GROUP CREATOR CGROUP NSEMS OTIME CTIME Semaphores: s 5 0xe49024ec --ra-r----- oracle dba oracle dba 39 17:56:57 13:25:14 s 1 0x71000b51 --ra-ra-ra- root root root root 1 13:18:56 13:18:34 s 0 0x187cf --ra-ra-ra- root sys root sys 1 13:17:55 13:17:54 roxy $
# su - oracle
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
-bash-3.00$ id -p
uid=101(oracle) gid=1001(dba) projid=11(user.oracle)
-bash-3.00$ prctl -n project.max-shm-memory -i project user.oracle
project: 11: user.oracle
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.max-shm-memory
privileged 186MB - deny -
system 16.0EB max deny
Notice also I used user.oracle, rather than the project ID, although -i project 11 would have achieved the same thing.
( Jul 05 2005, 11:00:36 AM BST )
Permalink
Comments [0]
Erins portait of me on fathers day
Erin took this photo of me using her camera on fathers day. We're having a bit of a heatwave in the UK at the moment, which explains the reckless shirtlessness displayed here. I've not cropped or otherwise edited the picture and I love the kids-eye view of the world that her pictures give.
g. ( Jun 23 2005, 08:33:01 AM BST ) Permalink Comments [0]
fsync performance
Not the most exciting title I know, but here goes anyway... OK, so fsync() is used to ensure that dirty pages that have been written to a file actully go down to disk. The same sort of thing can be done by opening a file using one of the O_SYNC options when a file is opened. fsync() however, allows greater flexibility since the programmer can specify when the synchronisation to disk takes place - perhaps in a separate thread. Anyhow, generally fsync() is goodness - and 'cheap' since it only sync's the data that is dirty. So far so good. However there is (or rather was) a subtle problem that shows up when very large files are mapped into the memory of systems with reasonable amounts of memory. The problem is not to do with large memory systems as such, just that you need a lot of memory to really cache a large file. The problem is that the file is searched linearly (from beginning to end) from the first page that is mapped in right through to the last. This can take quite a lot of time. Given a big enough file, and a big enough physical memory - the time taken can be measured in seconds (yes really!). Since many developers think of fsync() as a 'free' system call, often it is called quite indiscriminately and so fsync() performance can really make a BIG difference (See the test results below for a pathalogical case).The good news is that this behavior is changed in Solaris 9 (I think of this is the version of Solaris designed for large systems like the 15/12/25K StarCat range) so that all the dirty pages are put at the head of the list, and we need only search the list until we find the first non-dirty page. This is logged as CR 4336082, fixed in patch 112233-09 and later.
So how can you tell if your application is suffering this problem? I would use truss -c against a running process, and see if fdsync()* is using some appreciable amount of CPU time. Note that this is 'real' cpu time spent examining pages rather than 'sleeping' or time that is spent waiting for the actual page to be written to disk.
In the experiments below, I used Solaris 9, with the latest recommended patch set and with the same file on UFS and VxFS on the same system, the results are quite dramatic. VxFS at the time of wrtiting does not incorporate the fixes that are in UFS - so serves as a good counter example to the speed of UFS. This test was performed on a 16Gb file with 16Gb RAM.
The inital run shows a fast time since, there are few pages mapped into the page cache - after the first run, we read in the file using 'dd' making sure to keep below the threshold used by VxFS to do 'discovered direct IO' which seems not to populate the cache - as you would expect.
Firstly we run the test file (read then write 100 blocks at a size of 4Kb)
# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100 New block size 4096 Trying to open testfile blocks blocksize = 4096 real 0.91 user 0.00 sys 0.10Then we read in the entire file in blocks of 4Kb
# dd if=./testfile of=/dev/null bs=4096k 4000+1 records in 4000+1 records outWe see that the performance has not changed, since we didn't populate the cache, because we read in the blocks at a size that triggers the VxFS discover direct IO size
# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100 New block size 4096 Trying to open testfile blocksize = 4096 real 0.92 user 0.00 sys 0.07Next we read the file in the blocks of 4Kb (rather than 4Mb)
# dd if=./testfile of=/dev/null bs=4096 4096001+0 records in 4096001+0 records outNow we see the problem - note that we are still only writing 100 blocks of 4K as we have always been. Note also that the time is accurately attributed to sys as expected.
# timex /var/tmp//write_random_fsync_loopsome testfile 4096 100 New block size 4096 Trying to open testfile blocksize = 4096 real 6:47.77 user 0.00 sys 6:45.85Using truss -c we can see where the time has gone.
# truss -c /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open testfile
The size of the file is -402649088 bytes -98303 application blocks blocksize = 4096
syscall seconds calls errors
_exit .000 1
read .012 101
write .018 106
open .000 5 1
close .000 4
brk .000 2
stat .000 4
lseek .007 202
fstat .000 3
ioctl .000 1
fdsync 403.329 101 <--- Here's where all the time went, in fsync() as expected
execve .000 1
getcontext .000 1
evsys .000 1
evtrapret .000 1
mmap .000 11
munmap .000 1
getrlimit .000 1
memcntl .000 1
resolvepath .000 5
-------- ------ ----
sys totals: 403.371 553 1
usr time: .006
elapsed: 404.210
In the above truss -c each fdsync() takes around 4 SECONDS to complete
On UFS, having used 'dd' to map in the file
# dd if=testfile of=/dev/null bs=4096 4096001+0 records in 4096001+0 records out # timex /var/tmp//write_random_fsync_loopsome testfile 4096 100 New block size 4096 Trying to open testfile blocksize = 4096 real 1.33 user 0.00 sys 0.06
# truss -c /var/tmp//write_random_fsync_loopsome testfile 4096 100
New block size 4096
Trying to open testfile blocksize = 4096
syscall seconds calls errors
_exit .000 1
read .005 101
write .006 106
open .000 5 1
close .000 4
brk .000 2
stat .000 4
lseek .006 202
fstat .000 3
ioctl .000 1
fdsync .094 101
execve .000 1
getcontext .000 1
evsys .000 1
evtrapret .000 1
mmap .000 11
munmap .000 1
getrlimit .000 1
memcntl .000 1
resolvepath .000 5
-------- ------ ----
sys totals: .115 553 1
usr time: .005
elapsed: 1.370
The fsync() calls above are still super-quick despite having a lot of the file cached in RAM
*fsync() is the call made from 'c' but shows up in truss as fdsync().
I tried reducing discovered_direct_iosz to 2k, and it seems to follow that if you allow VxFS to do directIO then the fsync issue is not hit. However, from the read side, you will not get any benefit of cacheing, whereas on UFS you get both cached data and fast fsync's ( Jun 11 2005, 02:15:23 AM BST ) Permalink Comments [2]

