The view from the Engine RoomBart Smaalders' weblog |
|
Tuesday Feb 12, 2008
Indiana Preview 2 - my new desktop
This weekend I decided to bite the bullet and convert my desktop to Indiana Preview 2. Since unlike most people at Sun my desktop machine also receives my email, and hosts both my home directory and calendar server, the switch-over needed some quiet concentration on my part to insure nothing important got left behind.
The installation of Preview 2 (now available here) went smoothly – not surprising, since I'd tested many trial builds on the same machine, a 2 x 2.8GHz Ultra 40. After installation completed and the machine rebooted, I created a second zpool with the two remaining drives; I use this for my home directory, mail spool, tunes and pkg server. This isolates me from any difficulties with the new installer or possible future upgrade problems. ZFS of course makes this all very easy:
: barts@cyber[227]; zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 2.63G 224G 49.5K /rpool rpool@install 0 - 49.5K - rpool/ROOT 2.62G 224G 18K none rpool/ROOT@install 0 - 18K - rpool/ROOT/preview2 2.62G 224G 2.09G legacy rpool/ROOT/preview2@install 66.8M - 1.94G - rpool/ROOT/preview2/opt 483M 224G 483M /opt rpool/ROOT/preview2/opt@install 77K - 3.61M - rpool/export 2.44M 224G 19K /export rpool/export@install 15K - 19K - rpool/export/home 2.41M 224G 2.39M /export/home rpool/export/home@install 19K - 21K - zfs 177G 51.8G 21K /zfs zfs/home 133G 51.8G 133G /export/home/cyber zfs/local 291M 51.8G 291M /usr/local zfs/mail 110M 51.8G 110M /var/mail zfs/music 43.4G 51.8G 43.3G /zfs/music zfs/music@2.1.2008 2.54M - 42.3G - zfs/repo 18K 51.8G 18K /zfs/repo : barts@cyber[228]; I then got to thinking about having a mirrored root pool; I hunted up one more 250GB drive, hot plugged it into the machine (love those SATA features) and used cfgadm -al and cfgadm -c to get Solaris to find the drive. Zpool attach took care of establishing the mirror; the mirror was resilvered in just a few minutes since ZFS knows what's data and what's empty space.
Now I needed dovecot, since I run an IMAP server to allow remote access of my mail. Off to dovecot.org for a tarball, download, configure and hmm – no C compiler. pkg search -r gcc told me that I needed SUNWgcc installed, so pkg install SUNWgcc grabbed the compiler, assembler and binutils. Cool. Run configure again and whoops – no headers! pkg search -r stdlib.h said I needed SUNWhea, so pkg install SUNWhea and I was compiling dovecot.... For a quick look at the packages available in Indiana so far browse over to http://pkg.opensolaris.org. I wrote this blog post using openoffice – which you'll find in a package called openoffice. Indiana and IPS are usable, but we've still got a lot work to do:
However, it's coming together – and being able to upgrade from preview2 from preview1 without running any postinstall scripts helps use feel better about the assertions that started the project....
Posted at 07:23PM Feb 12, 2008 by barts in General |
Monday Nov 05, 2007
A programmer's ABCs
Several years ago, before blogging, I cons'd up a programmer's ABC for Stephen Hahn's first child, Benjamin. I'd forgotten about this until Stephen mentioned last week and mailed me a link to the image; I'd lost the original. It's a little SPARC-centric, but so was I at the time:
A Programmer's ABC
A is for algorithm, patented or not. B is for break, to jump out of this rot. C is for continue, to jump to the top of one's loops. D is for default, the case that handles the oops. E is for else, the predicate's inversion. F is for for, of the loops the most popular version. G is for goto, a jump oft considered dubious. H is for hardware, for profits salubrious. I is for if, a conditional statement. J is for jmpl, an indirect jump causing performance abatement. K is for thousands in powers of two. L is for long, whose size recently grew. M is for membar, whose use can confound. N is for NOP, which in delay slots often is found. O is for operator, whose overloading is oft unsupportable. P is for pragma, with usage unportable. Q is for quadword, the largest of all. R is for return, when we make the stack not so tall. S is for switch, a computed goto for which we all yearn. T is for trap, from which we may never return. U is for unsigned, to avoid two's complement extension. V is for volatile, whose presence incites apprehension. W is for WSTATE register, for window traps most topical. X is for XOR, bitwise not logical. Y is for Y register, deprecated for years. Z is for zero, whose dereferencing everyone fears. Posted at 08:52PM Nov 05, 2007 by barts in General | Comments[3]
Wednesday Jul 25, 2007
Rethinking patching
As Stephen mentioned recently, several of us have been thinking about revising the way we manage software change on Solaris. I've been particularly focused on the difficulties Sun and it's customers have with the patching process, and the kinds of changes we need to make as a result in our technology and development processes. Today, most customers don't run OpenSolaris; they run a supported version of Solaris such as Solaris 8, 9 or 10. A supported release means that someone will answer the phone, and that patches for problems are available. Patches are a separate software change control mechanism distinct from package versions in Solaris. Each patch may affect portions of several packages; patches are intended to include all the files necessary to fix one or more problems, either directly or by specifying dependencies. If a patch affects packages which are not installed on this system (typically because it has been minimized), those portions of the patch are not installed. If the administrator later adds the missing package, he must remember (good luck) to re-apply the patches since the packaging code knows nothing of patches. Customers are today free to install which ever patches they feel are appropriate for their environment, consistent with the built-in dependency requirements. This customization is a technique I refer to as Dim Sum patching, and is a major cause of patching difficulties. Many customers pick and choose amongst the thousands of patches available for Solaris 10, for example; this means that customers are often pioneering new configurations. Note that each Solaris release consists of a single source base; all Solaris 10 updates, for example, are but snapshots of the same Solaris patch gate at different times. As a result, the developers are working on a cumulative set of all previous changes; when a new patch is created, the files in the patch not only contain the desired fix, but all previous fixes as well. Thus, the software change is constructed as a linear stream of change, but customers installs selected binaries from the various builds via patches.
When I've discussed the hazards of Dim Sum patching with customers, the reasons given are typically characterizable as :
To these, I reply:
For our new packaging system, there is a powerful incentive to eliminate Dim Sum patching: since we wish to use a single version numbering space for any package, attempting to support fine-grain Dim Sum patching would require very small packages - affecting the performance of packaging operations, and significantly increasing the workload of OpenSolaris developers. Instead, we can set package boundaries according to what makes sense for minimization purposes. This implies that future (post Solaris 11) patches will be completely cumulative (aside from some exceptions for urgent security fixes), at least for the core OS. Your system will be able to determine what is needed to bring the installed software up to the desired revision level automatically; needing to pick and choose patches will be a thing of the past.
![]() Posted at 03:56PM Jul 25, 2007 by barts in packaging and patching | Comments[10]
Friday Jun 01, 2007
Friday afternoon SPARKFUN
Mike Pogue mentioned he was considering using a USB-connected PIC controller to drive some stepper motors from a PC. He'd ordered out a USB Bit Wacker , which plugs right into a USB port and looks like a serial port to the host OS, and gives you 14 ports that can either be digital in, out or analog in. The host sends simple ASCII commands down, and the Bit Whacker sends back status/port data. Cool! Earlier this afternoon, he brought it in and we plugged it into my Tyan whitebox running Nevada 64a using a mini-USB cable. I was running a tail -f on /var/adm/messages, and saw: Jun 1 15:29:34 cyber usba: [ID 912658 kern.info] USB 2.0 device (usb4d8,a) operating at full speed (USB 1.x) on USB 1.10 root hub: communications@1, usb_mid6 at bus address 3 Well, that looked promising! Sure enough, % ls -l /dev/term % Ok. I added a quick entry to /etc/remote: sparkfun:\ # tip sparkfun At this point I used various commands to read the inputs/buttons, flash the LED, etc. If needed, one can also write and download alternate firmware directly to the on-board PIC18F2455; there's a built-in bootloader that cannot be flashed over, so a PIC programmer is never needed. This unit seems ideally suited to wiring up coffee makers, sprinkler systems, etc, so they can be connected to a host computer. Looks like I could write some firmware to directly drive servo motors using the Bit Whacker to count encoder pulses and close the velocity loop; the host computer would read say 100Hz update of position and update the desired velocity of the motor... and w/o any firmware work at all, it seems directly applicable to our next Burning Man project, which will have an OpenSolaris computer controlling things for the first time on the playa AFAIK... more about that later. Here's a picture of the device:
(The picture was taken w/ my Razr 3i and uploaded using Solaris as well.... )
Posted at 06:31PM Jun 01, 2007 by barts in General |
Wednesday Jan 17, 2007
New home server
Building a new household server....Like a lot of families these days, our household IT infrastructure has had to adapt as we all became more and more fond of computers for work, school and recreation. With digital photograpy, ripping hundreds of CDs, describing our various activities and travels on web pages, two teenagers and the heavy use of email, and the need to provide stable storage for homework and digital art, we've been playing catchup for a while. This led us directly to designing and building a new server to handle storage of all the digital media, web-serving and email. At the same time, I was tired of the whine from the surplus X1 rack-mount server I had stuffed in the closet, and decided to merge my home desktop and server together to reduce power consumption. With some thinking we arrived at the following hardware design:
So far things are working very well. The 4x500 GB drives are in a
RAID-Z configuration with ZFS; we can sustain 120 MB/sec or so reading
or writing to the 20 odd filesystems configured on the single
pool. Samba works pretty well; we managed to feed 10 different
files to 10 different clients at nearly 100Mb/sec apiece during one of
the kids' Lan parties. Dovecot in particular seems very fast on
top of ZFS, and other than a glitch with my forgetting to set the maximum user mailbox size Postfix has been trouble free. I use this machine as my desktop as well in the evenings.
Posted at 03:53PM Jan 17, 2007 by barts in General | Comments[3]
Wednesday Nov 16, 2005
Some thoughts on ZFS's impact on Solaris
So ZFS is now available and we've put together lots of blogs and demos to show everyone the neat kinds of things ZFS supports - snapshots, writable snapshots (clones), simple disk management, protection against hardware and firmware errors, etc. Rather than discuss some other neat feature of ZFS or do some extreme performance demos, I thought it might be interesting to mull over some of the possible implications of this new technology on the rest of Solaris and other applications. We don't yet support booting ZFS quite yet, so some of the ideas below will have to wait a bit for implementation - but it's certainly time to start thinking about them. First of all, in years past we've been moving more and more to the "one giant filesystem" model for installing systems; it's just been to much of a PITA to anticipate how much space we might need for root, var, opt, usr, etc.... but now with ZFS divorcing space allocation from filesystem boundaries, it's easy to use separate filesystems to make administration easier. It's ok to use lots of filesystems - they're essentially free, and since snapshots are at the filesystem level, filesystems also the "undo" boundary. Clearly, we don't want to delivery of mail or the growth of logs from inhibiting one from rolling back a ill-considered change to other parts of the system, so separating directories that contain files that are modified "automatically" from those that containing binaries or configuration files seems like a good idea. Once this separation occurs, using zfs rollback to undo the effects of changes such as a ill-considered patch or administrative actions becomes simple. It also appears that Live Upgrade could be a lot easier with ZFS - just snapshot and clone the filesystems being upgraded. Perhaps we should take snapshots always before adding a patch... hmmm, given that we have ZFS, how would we redesign patching if we could always use ZFS for root filesystems? Another area of possible change is initial installation. Right now we use bzip2 compression on on each package on the install media to compress our packages into the smallest possible footprint; this has been needed in order to fit localization information onto the first CD. Since we access the CD/DVD as a filesystem and then uncompress the package as it's being installed, we often have trouble reading data fast enough from the install media to keep the device spun up at anywhere near full speed, esp. for small packages. With zfs compression, we could store a ZFS filesystem image on the CD/DVD and stream that onto a ZFS filesystem on the hard disk in one (fast)shot, and then install packages from the hard disk as needed. Afterwards the extra packages could be deleted, or left there to facilitate later installation of additional components; disk space is amazingly cheap these days. Another area of interest is providing backups. ZFS makes it very easy to determine the differences between snapshots; this makes doing incremental updates of even very large slightly modified filesystems inexpensive in terms of disk and CPU utilization. It sure would be nice if my Ferrari laptop could just upload it's changes since the last backup automatically when I plugged it into the building nets; yes, I could do that today, but without the smarts of ZFS, traversing a 50GB data set to see what has changed just isn't very practical on the slow laptop drives. There are more ideas to consider, but it's clear that today is just the beginning of the ZFS revolution; ZFS will change the way Solaris works. Posted at 09:00AM Nov 16, 2005 by barts in General | Comments[0]
Friday Aug 05, 2005
libMicro opensourced
I've been busy with libMicro lately. LibMicro is a set of portable OS (kernel/library) benchmarks developed as part of the Solaris 10 performance effort. I've been working on a set of changes for a while to fix up some of the statistics and improve repeatability and prepare for open sourcing under the CDDL license. You can find libMicro in the performance community at opensolaris.org.There's still a lot to do on libMicro, starting with documentation on how to add new benchmarks, and how the whole thing works... but that will have to wait until I'm back from a brief vacation. Posted at 10:12AM Aug 05, 2005 by barts in Performance | Comments[0]
Wednesday Jun 15, 2005
Most common recent benchmarking mistake
Most frequently asked performance question of late: Why is this trivial piece of code slower on Solaris than on Linux with both OSes running on the same Opteron box? Often this is because the default compilation mode using /usr/sfw/bin/gcc (or Studio, for that matter) on Solaris is always a portable binary - and 64 bit Solaris x86 isn't seen as a different architecture, since all the 32 bit programs still work just fine. So gcc -o foo foo.c produces a 32 bit binary on Solaris amd64, and a 64 bit binary when compiled on a 64 bit Linux. No wonder there's a performance difference! To get 64 bit compilation using gcc, simply use the -m64 flag: gcc -o foo foo.c -m64 When in doubt, specify what you want - you'll more likely get what you need. Posted at 08:40PM Jun 15, 2005 by barts in Performance | Comments[3]
Tuesday Jun 14, 2005
Doing the Jitter Bug
One of the nice things about finally having OpenSolaris properly launched is that I can share Solaris source code with everyone when talking about Solaris. Since Solaris is the code, not being able to show people the code has been like having a discussion about various aspects of the flavors of wine without actually tasting any - possible perhaps, but a rather dry (heh) and uninteresting narrative....
Like most performance problems I run into, this looked like a job
for DTrace.
As Bryan has
pointed out many times, DTrace is best at quickly evaluating
various hypotheses...so I needed some testable ideas as to what might
be going on. Well, what could make the program's timers fire late?
Well, hypothesis testing time. First of all, the test program clearly created a processor set containing CPU 1 and placed itself as the only thread in that set. It even disabled interrupts on that CPU. A quick check with the following script:
on-cpuverified that there was only a single thread (the expected one) running on CPU 1. Hmmm. interstat verified that CPU 1 wasn't getting any device interrupts, either. So what's happening here? I had some experience with time keeping on Solaris, but not enough to know offhand where to look for this problem. Confused? DTrace will help! A quick change to the above script lets us use the enqueue probe in the sched provider to discover exactly how our process is made runnable when the timer expires: sched:::enqueueThe output is pretty straightforward - lots of entries that look like this: 1 2561 setbackdq:enqueueThis gave me a place to get started.... timer_fire is clearly important, so let's instrument that and see when it fires: #pragma D option quietThis script just prints a line with a timestamp in microseconds since we started, with the number of microseconds late (more than the 20 msec) interval that was desired. Interestingly enough, the output looks like this: 13822657: interval is 0 usecs lateWhoops! Turns out our test program was leading us down the garden path; the problem is that one timer fire appears to be early, and the next one makes up the difference; the net error is 0 over time. Reading the comments in cyclic.c, this isn't that surprising but does give us a place to start looking. At this point, I noticed that I can suspend the test program and still get the same intermittant late behavior. How often are we late? #pragma D option quietInteresting - once a second. Reading the cyclic.c source code we see that cyclic_fire handles all the interesting timer stuff. Let's trace cyclic_fire in addition to timer_fire: 2248600: interval is 0 usecs lateSo did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb: # mdb -kSo we need to trace calls to cbe_reprogram: #pragma D option quiet 3839612: cyclic_fire: 80 usecs lateAhaha! Turns out we're late all of the time; normally 80 usecs when there's a full 20 msec between cyclic firings, but less when the deadman cyclic has fired recently. Computing error as a percentage of sleep time, we see that our error is a relatively constant .4% of the interval slept; it's the varying time bewteen firings (due to multiple cyclics with different periods) that's responsible for the jitter. Ok, so what is cbe_reprogram doing? Rats, calling through another function pointer. Ok, this time we'll use DTrace to figure out where we're going rather than mdb: cbe_reprogram:entryOk, now let's look at the source for apic_timer_reprogram This is introducing a .4 % error... I wonder what the value of apic_nsecs_per_tick really is? # echo 'apic_nsec_per_tick/D' | mdb -k That's it. We're suffering from an imprecise conversion between nsecs and apic ticks; the appropriate value cannot be represented as an integer with this conversion accurately enough to keep excessive jitter from appearing. Since the deadman cyclic runs every second, if we want to keep worst case jitter due to quantization errors down to 1 usec we'll need 1 ppm resolution on the conversion factor. This also explained why sometimes the bug isn't visible - if the a 20 msec cyclic always occurs just before the deadman cyclic fires, there is almost no change in the amount of time between cyclic firings. At this point I filed a bug, 6266961. The fix involves rewriting the conversion to use a different factor and redoing the calibration of the apic timer to use the kernel's idea of hi-res time rather than the pit timer. /*After I get this finished up, it's back to malloc... more about that later on. Technorati Tag: OpenSolaris Technorati Tag: Solaris Technorati Tag: DTrace Technorati Tag: mdb |