The view from the Engine Room
Fattening packages - supporting multiple variants in a single package
Dealing with parts of a package
Traditionally, packaging systems have placed optional components of a package in separate packages, and established conventions for naming such components, such as -localization, -locale, -devel, -doc, etc. This method of ad-hoc decomposition makes it more difficult for GUI programs to offer the appropriate choices when selecting components, makes the introduction of new optional components difficult and makes installing documentation after the fact a painful process.
Packaging options also exist which are mutually exclusive; the typical example is which architecture the package supports. One cannot select both sparc and x86, since the two architecture's files collide - /usr/bin/ls is either a sparc binary or a i386 binary. Other examples of such colliding options include debug vs non-debug kernel binaries, and global vs nonglobal zones.
Initially we started solving the problem of selecting parts of a package using more or less ad-hoc client-side filtering, but it became clear that the publisher of the package needed to be able to describe what components intersected or were optional, and that we needed to be able to clearly distinguish between the two cases. Thus, we refer to options that may be selected or not selected in any combination, such as various locales, documentation, etc., as facets. Options which must be mutually exclusive are called variants. Both variants and facet appear as tags on IPS actions, and result in the action being selected or de-selected for installation.
An example of variant tags on an action is:
dir group=sys mode=0755 opensolaris.zone=global owner=root path=kernel/drv/amd64 variant.arch=i386 variant.opensolaris.zone=global
Here this directory action is tagged with variant.arch=i386, indicating that this directory should only be installed on i386 machines (Solaris doesn't distinguish between 32 and 64 bit architectures) and variant.opensolaris.zone=global, which shows that this directory should only be installed in the global zone and not in local zones as those don't contain kernel components.
Note that any action in the repository can be so tagged, including dependencies.
During installation planning, all actions not applicable to the current image are removed from consideration. If a user with a down-rev version of IPS that doesn't understand variants attempts to install a fat package, either an exception will occur indicating that duplicate actions have been found, or if a set action is found with an architecture tag an assertion will trigger. In either case the solution is either to stop attempting to install fat packages, or upgrade your version of IPS to the latest available for your build.
The initial implementation in build 106 supports only variants (facets are coming later on). To faciliate detection of attempts to install packages on the wrong architecture, our publication code now inserts set actions into packages to indicate what architectures the package supports:
set name=variant.arch value=sparc value=i386
Publishing Fat packages
We produce fat packages by publishing separately on each architecture, and then merging the resulting manifests. Actions which are identical for both architectures are not tagged w/ variant tags; those that differ or exist only for one architecture are tagged. Packages that exist only on one architecture are tagged to indicate that as well. This step can be repeated for different variants, and more than two variants types can be combined at once, so packages containing debug kernel binaries could be merged w/ non-debug versions, and then merged across several architectures. An example of a merged manifest can be seen here. Of course, those using pkgsend to create packages can simply insert the appropriate tags directly. For the curious, the code to merge packages can be seen here. Further work is needed to improve the performance of publishing the merged packages; allowing multiple repositories to share the filestore would help greatly here.
Posted at 03:56PM Feb 04, 2009 by barts in packaging and patching | Comments[9]
Indiana Preview 2 - my new desktop
This weekend I decided to bite the bullet and convert my desktop to Indiana Preview 2. Since unlike most people at Sun my desktop machine also receives my email, and hosts both my home directory and calendar server, the switch-over needed some quiet concentration on my part to insure nothing important got left behind.
The installation of Preview 2 (now available here) went smoothly – not surprising, since I'd tested many trial builds on the same machine, a 2 x 2.8GHz Ultra 40. After installation completed and the machine rebooted, I created a second zpool with the two remaining drives; I use this for my home directory, mail spool, tunes and pkg server. This isolates me from any difficulties with the new installer or possible future upgrade problems. ZFS of course makes this all very easy:
: barts@cyber[227]; zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 2.63G 224G 49.5K /rpool rpool@install 0 - 49.5K - rpool/ROOT 2.62G 224G 18K none rpool/ROOT@install 0 - 18K - rpool/ROOT/preview2 2.62G 224G 2.09G legacy rpool/ROOT/preview2@install 66.8M - 1.94G - rpool/ROOT/preview2/opt 483M 224G 483M /opt rpool/ROOT/preview2/opt@install 77K - 3.61M - rpool/export 2.44M 224G 19K /export rpool/export@install 15K - 19K - rpool/export/home 2.41M 224G 2.39M /export/home rpool/export/home@install 19K - 21K - zfs 177G 51.8G 21K /zfs zfs/home 133G 51.8G 133G /export/home/cyber zfs/local 291M 51.8G 291M /usr/local zfs/mail 110M 51.8G 110M /var/mail zfs/music 43.4G 51.8G 43.3G /zfs/music zfs/music@2.1.2008 2.54M - 42.3G - zfs/repo 18K 51.8G 18K /zfs/repo : barts@cyber[228];
I then got to thinking about having a mirrored root pool; I hunted up one more 250GB drive, hot plugged it into the machine (love those SATA features) and used cfgadm -al and cfgadm -c to get Solaris to find the drive. Zpool attach took care of establishing the mirror; the mirror was resilvered in just a few minutes since ZFS knows what's data and what's empty space.
Now I needed dovecot, since I run an IMAP server to allow remote access of my mail. Off to dovecot.org for a tarball, download, configure and hmm – no C compiler. pkg search -r gcc told me that I needed SUNWgcc installed, so pkg install SUNWgcc grabbed the compiler, assembler and binutils. Cool. Run configure again and whoops – no headers! pkg search -r stdlib.h said I needed SUNWhea, so pkg install SUNWhea and I was compiling dovecot.... For a quick look at the packages available in Indiana so far browse over to http://pkg.opensolaris.org. I wrote this blog post using openoffice – which you'll find in a package called openoffice.
Indiana and IPS are usable, but we've still got a lot work to do:
The packages need to be re-factored into smaller pieces and renamed and tagged to better support minimization, searching, etc.
The command line pkg interface needs some attention...
and here's still much work to be done in constraints, simplifying pkg publishing and there are lots of bugs, tracebacks and other issues to address.
However, it's coming together – and being able to upgrade from preview2 from preview1 without running any postinstall scripts helps use feel better about the assertions that started the project....
Posted at 07:23PM Feb 12, 2008 by barts in General |
A programmer's ABCs
Several years ago, before blogging, I cons'd up a programmer's ABC for Stephen Hahn's first child, Benjamin. I'd forgotten about this until Stephen mentioned last week and mailed me a link to the image; I'd lost the original. It's a little SPARC-centric, but so was I at the time:
A Programmer's ABC
A is for algorithm, patented or not.
B is for break, to jump out of this rot.
C is for continue, to jump to the top of one's loops.
D is for default, the case that handles the oops.
E is for else, the predicate's inversion.
F is for for, of the loops the most popular version.
G is for goto, a jump oft considered dubious.
H is for hardware, for profits salubrious.
I is for if, a conditional statement.
J is for jmpl, an indirect jump causing performance abatement.
K is for thousands in powers of two.
L is for long, whose size recently grew.
M is for membar, whose use can confound.
N is for NOP, which in delay slots often is found.
O is for operator, whose overloading is oft unsupportable.
P is for pragma, with usage unportable.
Q is for quadword, the largest of all.
R is for return, when we make the stack not so tall.
S is for switch, a computed goto for which we all yearn.
T is for trap, from which we may never return.
U is for unsigned, to avoid two's complement extension.
V is for volatile, whose presence incites apprehension.
W is for WSTATE register, for window traps most topical.
X is for XOR, bitwise not logical.
Y is for Y register, deprecated for years.
Z is for zero, whose dereferencing everyone fears.
Posted at 08:52PM Nov 05, 2007 by barts in General | Comments[3]
Rethinking patching
As Stephen mentioned recently, several of us have been thinking about revising the way we manage software change on Solaris. I've been particularly focused on the difficulties Sun and it's customers have with the patching process, and the kinds of changes we need to make as a result in our technology and development processes.
Today, most customers don't run OpenSolaris; they run a supported version of Solaris such as Solaris 8, 9 or 10. A supported release means that someone will answer the phone, and that patches for problems are available.
Patches are a separate software change control mechanism distinct from package versions in Solaris. Each patch may affect portions of several packages; patches are intended to include all the files necessary to fix one or more problems, either directly or by specifying dependencies. If a patch affects packages which are not installed on this system (typically because it has been minimized), those portions of the patch are not installed. If the administrator later adds the missing package, he must remember (good luck) to re-apply the patches since the packaging code knows nothing of patches.
Customers are today free to install which ever patches they feel are appropriate for their environment, consistent with the built-in dependency requirements. This customization is a technique I refer to as Dim Sum patching, and is a major cause of patching difficulties. Many customers pick and choose amongst the thousands of patches available for Solaris 10, for example; this means that customers are often pioneering new configurations. Note that each Solaris release consists of a single source base; all Solaris 10 updates, for example, are but snapshots of the same Solaris patch gate at different times. As a result, the developers are working on a cumulative set of all previous changes; when a new patch is created, the files in the patch not only contain the desired fix, but all previous fixes as well. Thus, the software change is constructed as a linear stream of change, but customers installs selected binaries from the various builds via patches.
When I've discussed the hazards of Dim Sum patching with customers, the reasons given are typically characterizable as :
- we don't need all those patches, we don't have those drivers loaded
- we're reducing downtime by not installing so many patches
- the less change, the less risk.
To these, I reply:
- If you don't need those drivers, then remove them them w/ pkgrm rather than leaving them in an unpatched state awaiting the introduction of new hardware or software to expose problems. Minimization, not spotty patching, is the answer. This is akin to disposing of an unused car, rather than simply leaving it unmaintained.
- Today, you should be using Live Upgrade and patching the alternate boot image to reduce downtime. This allows machines in production to be safely patched, and will not leave the system in an inconsistent or unbootable state in the case of power failure during patching operations. In the future, the new packaging system will always patch a clone of the current system to avoid the potential for disaster in case of power failure.
- Our experience has been that customers running all of the changes in an update are generally far less likely to experience problems than those who select only the fixes and features that appeal to them, and hope that our QA processes found all hidden dependencies on previous changes.
For our new packaging system, there is a powerful incentive to eliminate Dim Sum patching: since we wish to use a single version numbering space for any package, attempting to support fine-grain Dim Sum patching would require very small packages - affecting the performance of packaging operations, and significantly increasing the workload of OpenSolaris developers. Instead, we can set package boundaries according to what makes sense for minimization purposes.
This implies that future (post Solaris 11) patches will be completely cumulative (aside from some exceptions for urgent security fixes), at least for the core OS. Your system will be able to determine what is needed to bring the installed software up to the desired revision level automatically; needing to pick and choose patches will be a thing of the past.

Posted at 03:56PM Jul 25, 2007 by barts in packaging and patching | Comments[10]
Friday afternoon SPARKFUN
Mike Pogue mentioned he was considering using a USB-connected PIC controller to drive some stepper motors from a PC. He'd ordered out a USB Bit Wacker , which plugs right into a USB port and looks like a serial port to the host OS, and gives you 14 ports that can either be digital in, out or analog in. The host sends simple ASCII commands down, and the Bit Whacker sends back status/port data. Cool! Earlier this afternoon, he brought it in and we plugged it into my Tyan whitebox running Nevada 64a using a mini-USB cable. I was running a tail -f on /var/adm/messages, and saw:
Jun 1 15:29:34 cyber usba: [ID 912658 kern.info] USB 2.0 device (usb4d8,a) operating at full speed (USB 1.x) on USB 1.10 root hub: communications@1, usb_mid6 at bus address 3
Jun 1 15:29:34 cyber usba: [ID 349649 kern.info] Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun 1 15:29:34 cyber genunix: [ID 936769 kern.info] usb_mid6 is /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1
Jun 1 15:29:34 cyber genunix: [ID 408114 kern.info] /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1 (usb_mid6) online
Jun 1 15:29:34 cyber usba: [ID 349649 kern.info] usba: no driver found for interface 1 (nodename: 'data') of Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun 1 15:29:34 cyber usba: [ID 912658 kern.info] USB 2.0 interface (usbif4d8,a.config1.0) operating at full speed (USB 1.x) on USB 1.10 root hub: modem@0, usbsacm3 at bus address 3
Jun 1 15:29:34 cyber usba: [ID 349649 kern.info] Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun 1 15:29:34 cyber genunix: [ID 936769 kern.info] usbsacm3 is /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0
Jun 1 15:29:34 cyber genunix: [ID 408114 kern.info] /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0 (usbsacm3) online
Well, that looked promising! Sure enough,
% ls -l /dev/term
total 6
lrwxrwxrwx 1 root root 78 Jun 1 15:29 0 ->
../../devices/pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0:0
lrwxrwxrwx 1 root root 29 Dec 21 14:43 a -> ../../devices/isa/asy@1,3f8:a
lrwxrwxrwx 1 root root 29 Dec 21 14:43 b -> ../../devices/isa/asy@1,2f8:b
%
Ok. I added a quick entry to /etc/remote:
sparkfun:\
:dv=/dev/term/0:br#38400:el=^C^S^Q^U^D:ie=%$:oe=^D:
# tip sparkfun
connected
UBW FW D Version 1.4.1
At this point I used various commands to read the inputs/buttons, flash the LED, etc. If needed, one can also write and download alternate firmware directly to the on-board PIC18F2455; there's a built-in bootloader that cannot be flashed over, so a PIC programmer is never needed. This unit seems ideally suited to wiring up coffee makers, sprinkler systems, etc, so they can be connected to a host computer. Looks like I could write some firmware to directly drive servo motors using the Bit Whacker to count encoder pulses and close the velocity loop; the host computer would read say 100Hz update of position and update the desired velocity of the motor... and w/o any firmware work at all, it seems directly applicable to our next Burning Man project, which will have an OpenSolaris computer controlling things for the first time on the playa AFAIK... more about that later.
Here's a picture of the device:

(The picture was taken w/ my Razr 3i and uploaded using Solaris as well.... )
Posted at 06:31PM Jun 01, 2007 by barts in General |
New home server
Building a new household server....
Like a lot of families these days, our household IT infrastructure has had to adapt as we all became more and more fond of computers for work, school and recreation. With digital photograpy, ripping hundreds of CDs, describing our various activities and travels on web pages, two teenagers and the heavy use of email, and the need to provide stable storage for homework and digital art, we've been playing catchup for a while. This led us directly to designing and building a new server to handle storage of all the digital media, web-serving and email. At the same time, I was tired of the whine from the surplus X1 rack-mount server I had stuffed in the closet, and decided to merge my home desktop and server together to reduce power consumption. With some thinking we arrived at the following hardware design:
- Tyan 2865
motherboard - 4 SATA ports that worked w/ Solaris in compatibility
mode, sound and an on-board nvidia gigabit ethernet chip, both
supported by Solaris as these were the motherboards in the original Ultra 20s.
- 2 x 2.6 GHZ AMD CPU; this was rather more CPU than I originally planned, but sometimes things turn up...
- 1 GB ECC RAM, later upgraded to 2 GB.
- 4 x 500 GB Seagate drives for redundant data storage, plus a leftover 40GB IDE drive for root and a DVD-RW drive for movies and burning DVDs.
- NVIdia 6600 GT video card - available for a reasonable price and
with good 3D graphics for playing with those video game ports to
Solaris. With the bundled NVidia OpenGL drivers in 55, the cool
screen savers work out of the box, too!
- 450 W Antec power supply
- inexpensive second NIC for our external interface; I happened to
have an $7 Gb Airlink card which uses the Realtek part; this works out of
the box as well.
- A small case - I used an Antec Lanboy as they're small and
light. In retrospect, a larger case such as an Antec P180
would have been a better choice. I do like the disk cooling on
this design, though - the 120 MM (blue, heh) fan keeps the drive temps
very moderate.
- Solaris Nevada build 55
- the bundled Bind 9 for handling internal, DMZ and external views
of our various domains.
- the bundled Apache 2.2 handles our various web pages and our 3 virtual domains.
- the bundled dhcp server handles DHCP service. I configured
this with dhcpmgr; since all the other machines in the house use DHCP there's little admin work to do anywhere else. This is a real boon during the kid's LAN parties; things just work.
- The bundled Samba server provides NAS for the various Microsoft
and Mac clients in the house. With the latest OS-X builds, NFS
works well enough to use; we may switch. I use swat to configure
Samba.
- Printing is provided by the stock lp and ipp-listeners aided by
Samba for the kid's microsoft game platforms. I
configured this with printmgr. We print to an HP1200 that has
worked flawlessly for years w/ just a couple of toner changes.
- the bundled IPFilter provides carefully controlled port filtering
on both the external and internal network interfaces.
- Dovecot (compiled from
source) provides IMAP service for both our LAN and (with SSL) over the
Internet. This service is under SMF control so it starts
automatically when the machine boots. I chose Dovecot since it also cooperates with Postfix to handle SMTP authentication.
- Postfix (compiled from
source + Berkeley DBM libraries). This provides SMTP service with
TLS and AUTH needed to control relaying by family members using
external connections. The anti-SPAM features are also excellent
and pretty easy to setup given the numerous how-to guides available on
the web. With Postfix I feel actually in control; I always had to rely on John Beck to customize my sendmail configuration.
- The unbundled SUNcry{r} encryption packages were needed to get the
bundled OpenSSL libraries working correctly w/ TLS and Thunderbird. Darren Moffat promises to have the high grade support bundled into OpenSolaris soon.
So far things are working very well. The 4x500 GB drives are in a
RAID-Z configuration with ZFS; we can sustain 120 MB/sec or so reading
or writing to the 20 odd filesystems configured on the single
pool. Samba works pretty well; we managed to feed 10 different
files to 10 different clients at nearly 100Mb/sec apiece during one of
the kids' Lan parties. Dovecot in particular seems very fast on
top of ZFS, and other than a glitch with my forgetting to set the maximum user mailbox size Postfix has been trouble free. I use this machine as my desktop as well in the evenings.
What are we still thinking about changing? Well:
- My son wants a Ventrilo
audio server for his Guild Wars games. Another manifest, some
more testing.
- I may add greylisting to our SPAM prevention techniques; this
will require a bit more configuration work. We reject perhaps 700 SPAM emails per day for our small number of email users; we each still get a handleful each day.
- I'm going to add a Slim Devices
Squeezebox to our stereo system; we outgrew the 200 CD changer some
time ago. That's another server to run, but we get access to the
65+ GB of music anywhere in the house.
- I'm thinking about adding a separate Ethernet interface to plug in the wireless router This will let me experiment with more secure ways to configure wireless services.
- When ZFS root and boot is supported , I'll configure another pair
of drives for a mirrored root pool using a SI3114 card. Right now we just back up the small root partition to our ZFS pool.
Posted at 03:53PM Jan 17, 2007 by barts in General | Comments[3]
Some thoughts on ZFS's impact on Solaris
So ZFS is now available and we've put together lots of blogs and demos to show everyone the neat kinds of things ZFS supports - snapshots, writable snapshots (clones), simple disk management, protection against hardware and firmware errors, etc. Rather than discuss some other neat feature of ZFS or do some extreme performance demos, I thought it might be interesting to mull over some of the possible implications of this new technology on the rest of Solaris and other applications. We don't yet support booting ZFS quite yet, so some of the ideas below will have to wait a bit for implementation - but it's certainly time to start thinking about them.
First of all, in years past we've been moving more and more to the "one giant filesystem" model for installing systems; it's just been to much of a PITA to anticipate how much space we might need for root, var, opt, usr, etc.... but now with ZFS divorcing space allocation from filesystem boundaries, it's easy to use separate filesystems to make administration easier. It's ok to use lots of filesystems - they're essentially free, and since snapshots are at the filesystem level, filesystems also the "undo" boundary. Clearly, we don't want to delivery of mail or the growth of logs from inhibiting one from rolling back a ill-considered change to other parts of the system, so separating directories that contain files that are modified "automatically" from those that containing binaries or configuration files seems like a good idea.
Once this separation occurs, using zfs rollback to undo the effects of changes such as a ill-considered patch or administrative actions becomes simple. It also appears that Live Upgrade could be a lot easier with ZFS - just snapshot and clone the filesystems being upgraded. Perhaps we should take snapshots always before adding a patch... hmmm, given that we have ZFS, how would we redesign patching if we could always use ZFS for root filesystems?
Another area of possible change is initial installation. Right now we use bzip2 compression on on each package on the install media to compress our packages into the smallest possible footprint; this has been needed in order to fit localization information onto the first CD. Since we access the CD/DVD as a filesystem and then uncompress the package as it's being installed, we often have trouble reading data fast enough from the install media to keep the device spun up at anywhere near full speed, esp. for small packages. With zfs compression, we could store a ZFS filesystem image on the CD/DVD and stream that onto a ZFS filesystem on the hard disk in one (fast)shot, and then install packages from the hard disk as needed. Afterwards the extra packages could be deleted, or left there to facilitate later installation of additional components; disk space is amazingly cheap these days.
Another area of interest is providing backups. ZFS makes it very easy to determine the differences between snapshots; this makes doing incremental updates of even very large slightly modified filesystems inexpensive in terms of disk and CPU utilization. It sure would be nice if my Ferrari laptop could just upload it's changes since the last backup automatically when I plugged it into the building nets; yes, I could do that today, but without the smarts of ZFS, traversing a 50GB data set to see what has changed just isn't very practical on the slow laptop drives.
There are more ideas to consider, but it's clear that today is just the beginning of the ZFS revolution; ZFS will change the way Solaris works.
Posted at 09:00AM Nov 16, 2005 by barts in General | Comments[0]
libMicro opensourced
I've been busy with libMicro lately. LibMicro is a set of portable OS (kernel/library) benchmarks developed as part of the Solaris 10 performance effort. I've been working on a set of changes for a while to fix up some of the statistics and improve repeatability and prepare for open sourcing under the CDDL license.
You can find libMicro in the performance community at opensolaris.org.There's still a lot to do on libMicro, starting with documentation on how to add new benchmarks, and how the whole thing works... but that will have to wait until I'm back from a brief vacation.
Posted at 10:12AM Aug 05, 2005 by barts in Performance | Comments[0]
Most common recent benchmarking mistake
Most frequently asked performance question of late: Why is this trivial piece of code slower on Solaris than on Linux with both OSes running on the same Opteron box?
Often this is because the default compilation mode using /usr/sfw/bin/gcc (or Studio, for that matter) on Solaris is always a portable binary - and 64 bit Solaris x86 isn't seen as a different architecture, since all the 32 bit programs still work just fine. So
gcc -o foo foo.c
produces a 32 bit binary on Solaris amd64, and a 64 bit binary when compiled on a 64 bit Linux. No wonder there's a performance difference! To get 64 bit compilation using gcc, simply use the -m64 flag:
gcc -o foo foo.c -m64
When in doubt, specify what you want - you'll more likely get what you need.
Posted at 08:40PM Jun 15, 2005 by barts in Performance | Comments[3]
Doing the Jitter Bug
One of the nice things about finally having OpenSolaris properly launched is that I can share Solaris source code with everyone when talking about Solaris. Since Solaris is the code, not being able to show people the code has been like having a discussion about various aspects of the flavors of wine without actually tasting any - possible perhaps, but a rather dry (heh) and uninteresting narrative....
I was happily (as I don't get to write nearly enough code) whacking
away on a revised version of malloc
to fix some longstanding performance and scalabilty issues when one of
our marketing folks darted in
asking about telcos, real time, and latency bubbles [I'll write about
our malloc findings later on]. After getting the relevant email trail
and a test
program from the customer contact, it turned out that indeed the
customer was seeing a real problem - 20 millisecond timers was
apparently firing up to 60 microseconds late occasionally on his
Opteron box, but no problem was seen on
SPARC machines. What's weirder, this problem appeared sometimes, but
not always
- any given instance of the test
program might display the problem, or it might not.
Hmmm. The customer's test program would carefully store
information about intervals between timer firings that exceeded 40
usecs + 20 msecs and report them every 20 seconds or so... and
when the problem appeared, it was happening once a second...
Like most performance problems I run into, this looked like a job
for DTrace.
As Bryan has
pointed out many times, DTrace is best at quickly evaluating
various hypotheses...so I needed some testable ideas as to what might
be going on. Well, what could make the program's timers fire late?
- Other threads or interrupts - perhaps these are interfering with our program somehow?
- Other timers firing on this cpu. Maybe some other system timers are causing problems on CPU 1?
Well, hypothesis testing time. First of all, the test program clearly created a processor set containing CPU 1 and placed itself as the only thread in that set. It even disabled interrupts on that CPU. A quick check with the following script:
on-cpuverified that there was only a single thread (the expected one) running on CPU 1. Hmmm. interstat verified that CPU 1 wasn't getting any device interrupts, either. So what's happening here? I had some experience with time keeping on Solaris, but not enough to know offhand where to look for this problem. Confused? DTrace will help! A quick change to the above script lets us use the enqueue probe in the sched provider to discover exactly how our process is made runnable when the timer expires:
/curcpu->cpu_id == 1/
{
trace(curthread);
}
sched:::enqueueThe output is pretty straightforward - lots of entries that look like this:
/curcpu->cpu_id == 1/
{
stack(10);
}
1 2561 setbackdq:enqueueThis gave me a place to get started.... timer_fire is clearly important, so let's instrument that and see when it fires:
TS`ts_setrun+0xf1
genunix`cv_unsleep+0x6e
genunix`setrun_locked+0x6d
genunix`eat_signal+0x89
genunix`sigtoproc+0x420
genunix`sigaddqa+0x4a
genunix`timer_fire+0xb9
genunix`clock_highres_fire+0x3f
genunix`cyclic_softint+0xb2
unix`cbe_low_level+0x14
#pragma D option quietThis script just prints a line with a timestamp in microseconds since we started, with the number of microseconds late (more than the 20 msec) interval that was desired.
BEGIN
{
start = timestamp;
last = timestamp;
}
fbt::timer_fire:entry
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}
Interestingly enough, the output looks like this:
13822657: interval is 0 usecs lateWhoops! Turns out our test program was leading us down the garden path; the problem is that one timer fire appears to be early, and the next one makes up the difference; the net error is 0 over time. Reading the comments in cyclic.c, this isn't that surprising but does give us a place to start looking. At this point, I noticed that I can suspend the test program and still get the same intermittant late behavior. How often are we late?
13842656: interval is 0 usecs late
13862657: interval is 0 usecs late
13882657: interval is 0 usecs late
13902657: interval is 0 usecs late
13922599: interval is -57 usecs late
13942657: interval is 57 usecs late
13962657: interval is 0 usecs late
13982657: interval is 0 usecs late
14002657: interval is 0 usecs late
#pragma D option quietInteresting - once a second.
BEGIN
{
start = timestamp;
last = timestamp;
}
fbt::timer_fire:entry
/(int) (((timestamp - last) - 20000000))/1000 != 0/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
}
fbt::timer_fire:entry
{
last = timestamp;
}
5974796: interval is -57 usecs late
5994854: interval is 56 usecs late
6974796: interval is -57 usecs late
6994854: interval is 56 usecs late
Reading the cyclic.c source code we see that cyclic_fire handles all the interesting timer stuff. Let's trace cyclic_fire in addition to timer_fire:
2248600: interval is 0 usecs lateSo did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb:
2263098: cyclic_fire
2268539: cyclic_fire
2268543: interval is -57 usecs late
2288597: cyclic_fire
2288600: interval is 57 usecs late
# mdb -kSo we need to trace calls to cbe_reprogram:
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba
s1394 nca fctl lofs nfs audiosup random sppp ipc ptm logindmux ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc1ede0 1b 3 0 59 no no t-0 ffffffff8584e220 mdb
1 ffffffff825cb800 b 0 0 -1 no no t-516001 fffffe8000275c80 (idle)
> ffffffff825cb800::print cpu_t cpu_cyclic
cpu_cyclic = 0xffffffff80f45600
> 0xffffffff80f45600::print cyc_cpu_t cyp_backend
cyp_backend = 0xffffffff803d65a8
> 0xffffffff803d65a8::print cyc_backend_t
{
cyb_configure = cbe_configure
cyb_unconfigure = 0
cyb_enable = cbe_enable
cyb_disable = cbe_disable
cyb_reprogram = cbe_reprogram
cyb_softint = cbe_softint
cyb_set_level = cbe_set_level
cyb_restore_level = cbe_restore_level
cyb_xcall = cbe_xcall
cyb_suspend = 0
cyb_resume = 0
cyb_arg = 0xffffffff825cb800
}
>
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}
cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
firetime = (arg1 - start)/1000;
printf("%d: cbe_reprogram: set timer to fire at %d\n",
(timestamp - start)/1000, firetime);
}
cyclic_fire:entry
/curcpu->cpu_id == 1/
{
now = (timestamp - start)/1000;
printf("%d: cyclic_fire: %d usecs late\n",
now, now - firetime );
}
fbt::timer_fire:entry
/curcpu->cpu_id == 1/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}
3839612: cyclic_fire: 80 usecs lateAhaha! Turns out we're late all of the time; normally 80 usecs when there's a full 20 msec between cyclic firings, but less when the deadman cyclic has fired recently. Computing error as a percentage of sleep time, we see that our error is a relatively constant .4% of the interval slept; it's the varying time bewteen firings (due to multiple cyclics with different periods) that's responsible for the jitter. Ok, so what is cbe_reprogram doing? Rats, calling through another function pointer. Ok, this time we'll use DTrace to figure out where we're going rather than mdb:
3839614: cbe_reprogram: set timer to fire at 3859532
3839615: interval is 0 usecs late
3859613: cyclic_fire: 81 usecs late
3859614: cbe_reprogram: set timer to fire at 3879532
3859616: interval is 0 usecs late
3879612: cyclic_fire: 80 usecs late
3879614: cbe_reprogram: set timer to fire at 3894055
3879615: interval is 0 usecs late
3894114: cyclic_fire: 59 usecs late
3894115: cbe_reprogram: set timer to fire at 3899532
3899555: cyclic_fire: 23 usecs late
3899556: cbe_reprogram: set timer to fire at 3919532
3899558: interval is -57 usecs late
3919613: cyclic_fire: 81 usecs late
3919614: cbe_reprogram: set timer to fire at 3939532
3919616: interval is 57 usecs late
3939612: cyclic_fire: 80 usecs late
3939614: cbe_reprogram: set timer to fire at 3959532
3939616: interval is 0 usecs late
cbe_reprogram:entryOk, now let's look at the source for apic_timer_reprogram
/curcpu->cpu_id == 1/
{
self->trace = 1;
}
fbt:::entry,
fbt:::return
/self->trace/
{
}
cbe_reprogram:return
/curcpu->cpu_id == 1/
{
self->trace = 0;
}
1 -> cbe_reprogram
1 -> apic_timer_reprogram
1 -> gethrtime
1 -> tsc_gethrtime
1 <- tsc_gethrtime
1 <- gethrtime
1 <- apic_timer_reprogram
1 <- cbe_reprogram
This is introducing a .4 % error... I wonder what the value of apic_nsecs_per_tick really is?
ticks = (uint_t)((time - now) / apic_nsec_per_tick);
# echo 'apic_nsec_per_tick/D' | mdb -k
apic_nsec_per_tick:
apic_nsec_per_tick: 10
That's it. We're suffering from an imprecise conversion between nsecs and apic ticks; the appropriate value cannot be represented as an integer with this conversion accurately enough to keep excessive jitter from appearing. Since the deadman cyclic runs every second, if we want to keep worst case jitter due to quantization errors down to 1 usec we'll need 1 ppm resolution on the conversion factor.
This also explained why sometimes the bug isn't visible - if the a 20 msec cyclic always occurs just before the deadman cyclic fires, there is almost no change in the amount of time between cyclic firings.
At this point I filed a bug, 6266961. The fix involves rewriting the conversion to use a different factor and redoing the calibration of the apic timer to use the kernel's idea of hi-res time rather than the pit timer.
/*After I get this finished up, it's back to malloc... more about that later on.
* deal with resolution issues by converting between ticks and
* nanoseconds using 64 bit math.
* use multiplication cause it's faster; divide by power of 2
* for speed.
*/
static int64_t apic_ticks_per_128Mnsecs; /* ticks/128*1024*1024 nsecs */
#define APIC_TICKS_TO_NSECS(ticks) (((long long)(ticks) * 134217728LL)/\
apic_ticks_per_128Mnsecs)
#define APIC_NSECS_TO_TICKS(nsecs) ((long long)(nsecs) * \
apic_ticks_per_128Mnsecs / 134217728LL)
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb
Wednesday Feb 04, 2009