The view from the Engine Room

pageicon Wednesday Feb 04, 2009

Fattening packages - supporting multiple variants in a single package

Dealing with parts of a package

Traditionally, packaging systems have placed optional components of a package in separate packages, and established conventions for naming such components, such as -localization, -locale, -devel, -doc, etc. This method of ad-hoc decomposition makes it more difficult for GUI programs to offer the appropriate choices when selecting components, makes the introduction of new optional components difficult and makes installing documentation after the fact a painful process.

Packaging options also exist which are mutually exclusive; the typical example is which architecture the package supports. One cannot select both sparc and x86, since the two architecture's files collide - /usr/bin/ls is either a sparc binary or a i386 binary. Other examples of such colliding options include debug vs non-debug kernel binaries, and global vs nonglobal zones.

Initially we started solving the problem of selecting parts of a package using more or less ad-hoc client-side filtering, but it became clear that the publisher of the package needed to be able to describe what components intersected or were optional, and that we needed to be able to clearly distinguish between the two cases. Thus, we refer to options that may be selected or not selected in any combination, such as various locales, documentation, etc., as facets. Options which must be mutually exclusive are called variants. Both variants and facet appear as tags on IPS actions, and result in the action being selected or de-selected for installation.

An example of variant tags on an action is:

dir group=sys mode=0755 opensolaris.zone=global owner=root path=kernel/drv/amd64
  variant.arch=i386 variant.opensolaris.zone=global

Here this directory action is tagged with variant.arch=i386, indicating that this directory should only be installed on i386 machines (Solaris doesn't distinguish between 32 and 64 bit architectures) and variant.opensolaris.zone=global, which shows that this directory should only be installed in the global zone and not in local zones as those don't contain kernel components.

Note that any action in the repository can be so tagged, including dependencies.

During installation planning, all actions not applicable to the current image are removed from consideration. If a user with a down-rev version of IPS that doesn't understand variants attempts to install a fat package, either an exception will occur indicating that duplicate actions have been found, or if a set action is found with an architecture tag an assertion will trigger. In either case the solution is either to stop attempting to install fat packages, or upgrade your version of IPS to the latest available for your build.

The initial implementation in build 106 supports only variants (facets are coming later on). To faciliate detection of attempts to install packages on the wrong architecture, our publication code now inserts set actions into packages to indicate what architectures the package supports:

set name=variant.arch value=sparc value=i386 

Publishing Fat packages

We produce fat packages by publishing separately on each architecture, and then merging the resulting manifests. Actions which are identical for both architectures are not tagged w/ variant tags; those that differ or exist only for one architecture are tagged.  Packages that exist only on one architecture are tagged to indicate that as well. This step can be repeated for different variants, and more than two variants types can be combined at once, so packages containing debug kernel binaries could be merged w/ non-debug versions, and then merged across several architectures. An example of a merged manifest can be seen here. Of course, those using pkgsend to create packages can simply insert the appropriate tags directly. For the curious, the code to merge packages can be seen here. Further work is needed to improve the performance of publishing the merged packages; allowing multiple repositories to share the filestore would help greatly here.



pageicon Tuesday Feb 12, 2008

Indiana Preview 2 - my new desktop

This weekend I decided to bite the bullet and convert my desktop to Indiana Preview 2. Since unlike most people at Sun my desktop machine also receives my email, and hosts both my home directory and calendar server, the switch-over needed some quiet concentration on my part to insure nothing important got left behind.


The installation of Preview 2 (now available here) went smoothly – not surprising, since I'd tested many trial builds on the same machine, a 2 x 2.8GHz Ultra 40. After installation completed and the machine rebooted, I created a second zpool with the two remaining drives; I use this for my home directory, mail spool, tunes and pkg server. This isolates me from any difficulties with the new installer or possible future upgrade problems. ZFS of course makes this all very easy:


: barts@cyber[227]; zfs list
NAME                              USED  AVAIL  REFER  MOUNTPOINT
rpool                            2.63G   224G  49.5K  /rpool
rpool@install                        0      -  49.5K  -
rpool/ROOT                       2.62G   224G    18K  none
rpool/ROOT@install                   0      -    18K  -
rpool/ROOT/preview2              2.62G   224G  2.09G  legacy
rpool/ROOT/preview2@install      66.8M      -  1.94G  -
rpool/ROOT/preview2/opt           483M   224G   483M  /opt
rpool/ROOT/preview2/opt@install    77K      -  3.61M  -
rpool/export                     2.44M   224G    19K  /export
rpool/export@install               15K      -    19K  -
rpool/export/home                2.41M   224G  2.39M  /export/home
rpool/export/home@install          19K      -    21K  -
zfs                               177G  51.8G    21K  /zfs
zfs/home                          133G  51.8G   133G  /export/home/cyber
zfs/local                         291M  51.8G   291M  /usr/local
zfs/mail                          110M  51.8G   110M  /var/mail
zfs/music                        43.4G  51.8G  43.3G  /zfs/music
zfs/music@2.1.2008               2.54M      -  42.3G  -
zfs/repo                           18K  51.8G    18K  /zfs/repo
: barts@cyber[228]; 

I then got to thinking about having a mirrored root pool; I hunted up one more 250GB drive, hot plugged it into the machine (love those SATA features) and used cfgadm -al and cfgadm -c to get Solaris to find the drive. Zpool attach took care of establishing the mirror; the mirror was resilvered in just a few minutes since ZFS knows what's data and what's empty space.


Now I needed dovecot, since I run an IMAP server to allow remote access of my mail. Off to dovecot.org for a tarball, download, configure and hmm – no C compiler. pkg search -r gcc told me that I needed SUNWgcc installed, so pkg install SUNWgcc grabbed the compiler, assembler and binutils. Cool. Run configure again and whoops – no headers! pkg search -r stdlib.h said I needed SUNWhea, so pkg install SUNWhea and I was compiling dovecot.... For a quick look at the packages available in Indiana so far browse over to http://pkg.opensolaris.org. I wrote this blog post using openoffice – which you'll find in a package called openoffice.

Indiana and IPS are usable, but we've still got a lot work to do:

  • The packages need to be re-factored into smaller pieces and renamed and tagged to better support minimization, searching, etc.

  • The command line pkg interface needs some attention...

  • and here's still much work to be done in constraints, simplifying pkg publishing and there are lots of bugs, tracebacks and other issues to address.

     

However, it's coming together – and being able to upgrade from preview2 from preview1 without running any postinstall scripts helps use feel better about the assertions that started the project....

pageicon Monday Nov 05, 2007

A programmer's ABCs

Several years ago, before blogging, I cons'd up a programmer's ABC for Stephen Hahn's first child, Benjamin.  I'd forgotten about this until Stephen mentioned last week and mailed me a link to the image; I'd lost the original.  It's a little SPARC-centric, but so was I at the time:


 

A Programmer's ABC



A is for algorithm, patented or not.

B is for break, to jump out of this rot.

C is for continue, to jump to the top of one's loops.

D is for default, the case that handles the oops.

E is for else, the predicate's inversion.

F is for for, of the loops the most popular version.

G is for goto, a jump oft considered dubious.

H is for hardware, for profits salubrious.

I is for if, a conditional statement.

J is for jmpl, an indirect jump causing performance abatement.

K is for thousands in powers of two.

L is for long, whose size recently grew.

M is for membar, whose use can confound.

N is for NOP, which in delay slots often is found.

O is for operator, whose overloading is oft unsupportable.

P is for pragma, with usage unportable.

Q is for quadword, the largest of all.

R is for return, when we make the stack not so tall.

S is for switch, a computed goto for which we all yearn.

T is for trap, from which we may never return.

U is for unsigned, to avoid two's complement extension.

V is for volatile, whose presence incites apprehension.

W is for WSTATE register, for window traps most topical.

X is for XOR, bitwise not logical.

Y is for Y register, deprecated for years.

Z is for zero, whose dereferencing everyone fears.



pageicon Wednesday Jul 25, 2007

Rethinking patching

As Stephen mentioned recently, several of us have been thinking about revising the way we manage software change on Solaris.  I've been particularly focused on the difficulties Sun and it's customers have with the patching process, and the kinds of changes we need to make as a result in our technology and development processes.

 Today, most customers don't run OpenSolaris; they run a supported version of Solaris such as Solaris 8, 9 or 10.  A supported release means that someone will answer the phone, and that patches for problems are available.

Patches are a separate software change control mechanism distinct from package versions in Solaris.  Each patch may affect portions of several packages; patches are intended to include all the files necessary to fix one or more problems, either directly or by specifying dependencies.  If a patch affects packages which are not installed on this system (typically because it has been minimized), those portions of the patch are not installed.  If the administrator later adds the missing package, he must remember (good luck) to re-apply the patches since the packaging code knows nothing of patches.

Customers are today free to install which ever patches they feel are appropriate for their environment, consistent with the built-in dependency requirements.  This customization is a technique I refer to as Dim Sum patching, and is a major cause of patching difficulties.  Many customers pick and choose amongst the thousands of patches available for Solaris 10, for example; this means that customers are often pioneering new configurations.  Note that each Solaris release consists of a single source base; all Solaris 10 updates, for example, are but snapshots of the same Solaris patch gate at different times.  As a result, the developers are working on a cumulative  set of all previous changes; when a new patch is created, the files in the patch not only contain the desired fix, but all previous fixes as well.  Thus, the software change is constructed as a linear stream of change, but customers installs selected binaries from the various builds via patches.

 

When I've discussed the hazards of  Dim Sum patching with customers, the reasons given are typically characterizable as :

 

 

  1. we don't need all those patches,  we don't have those drivers loaded
  2. we're reducing downtime by not installing so many patches
  3. the less change, the less risk.

To these, I reply:

  1. If you don't need those drivers, then remove them them w/ pkgrm rather than leaving them in an unpatched state awaiting the introduction of new hardware or software to expose problems.  Minimization, not spotty patching, is the answer.  This is akin to disposing of an unused car, rather than simply leaving it unmaintained.
  2. Today, you should be using Live Upgrade and patching the alternate boot image to reduce downtime.  This allows machines in production to be safely patched, and will not leave the system in an inconsistent or unbootable state in the case of power failure during patching operations.  In the future, the new packaging system will always patch a clone of the current system to avoid the potential for disaster in case of power failure.
  3. Our experience has been that customers running all of the changes in an update are generally far less likely to experience problems than those who select only the fixes and features that appeal to them, and hope that our QA processes found all hidden dependencies on previous changes.

For our new packaging system, there is a powerful incentive to eliminate Dim Sum patching:  since we wish to use a single version numbering space for any package, attempting to support fine-grain Dim Sum patching would require very small packages - affecting the performance of packaging operations, and significantly increasing the workload of OpenSolaris developers.   Instead, we can set package boundaries according to what makes sense for minimization purposes. 

This implies that future (post Solaris 11) patches will be completely cumulative (aside from some exceptions for urgent security fixes), at least for the core OS.  Your system will be able to determine what is needed to bring the installed software up to the desired revision level automatically; needing to pick and choose patches will be a thing of the past. 


 

No Dim Sum Patching!


pageicon Friday Jun 01, 2007

Friday afternoon SPARKFUN

Mike Pogue mentioned he was considering using a USB-connected PIC controller to drive some stepper motors from a PC.  He'd ordered out a USB Bit Wacker , which plugs right into a USB port and looks like a serial port to the host OS, and gives you 14 ports that can either be digital in, out or analog in.  The host sends simple ASCII commands down, and the Bit Whacker sends back status/port data. Cool!  Earlier this afternoon, he brought it in and we plugged it into my Tyan whitebox running Nevada 64a using a mini-USB cable.  I was running a tail -f on /var/adm/messages, and saw:

Jun  1 15:29:34 cyber usba: [ID 912658 kern.info] USB 2.0 device (usb4d8,a) operating at full speed (USB 1.x) on USB 1.10 root hub: communications@1, usb_mid6 at bus address 3
Jun  1 15:29:34 cyber usba: [ID 349649 kern.info]       Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun  1 15:29:34 cyber genunix: [ID 936769 kern.info] usb_mid6 is /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1
Jun  1 15:29:34 cyber genunix: [ID 408114 kern.info] /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1 (usb_mid6) online
Jun  1 15:29:34 cyber usba: [ID 349649 kern.info] usba: no driver found for interface 1 (nodename: 'data') of Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun  1 15:29:34 cyber usba: [ID 912658 kern.info] USB 2.0 interface (usbif4d8,a.config1.0) operating at full speed (USB 1.x) on USB 1.10 root hub: modem@0, usbsacm3 at bus address 3
Jun  1 15:29:34 cyber usba: [ID 349649 kern.info]       Microchip Technology Inc. CDC RS-232 Emulation Demo
Jun  1 15:29:34 cyber genunix: [ID 936769 kern.info] usbsacm3 is /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0
Jun  1 15:29:34 cyber genunix: [ID 408114 kern.info] /pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0 (usbsacm3) online

Well, that looked promising!  Sure enough,

 % ls -l /dev/term
 total 6
 lrwxrwxrwx   1 root     root          78 Jun  1 15:29 0 ->
../../devices/pci@0,0/pci1022,7460@6/pci1022,7464@0/communications@1/modem@0:0
 lrwxrwxrwx   1 root     root          29 Dec 21 14:43 a -> ../../devices/isa/asy@1,3f8:a
 lrwxrwxrwx   1 root     root          29 Dec 21 14:43 b -> ../../devices/isa/asy@1,2f8:b

Ok.  I added a quick entry to /etc/remote:

sparkfun:\
        :dv=/dev/term/0:br#38400:el=^C^S^Q^U^D:ie=%$:oe=^D:
# tip sparkfun
connected
UBW FW D Version 1.4.1

At this point I used various commands to read the inputs/buttons, flash the LED, etc.  If needed, one can also write and download alternate firmware directly to the on-board PIC18F2455; there's a built-in bootloader that cannot be flashed over, so a PIC programmer is never needed.  This unit seems ideally suited to wiring up coffee makers, sprinkler systems, etc, so they can be connected to a host computer.  Looks like I could write some firmware to directly drive servo motors using the Bit Whacker to count encoder  pulses and close the velocity loop; the host computer would read say 100Hz update of position and update the desired velocity of the motor... and w/o any firmware work at all, it seems directly applicable to our next Burning Man project, which will have an OpenSolaris computer controlling things for the first time on the playa AFAIK... more about that later.

Here's a picture of the device:

Bit Whacker next to a quarter


(The picture was taken w/ my Razr 3i and uploaded using Solaris as well.... )

 

 

 

pageicon Wednesday Jan 17, 2007

New home server

Building a new household server....


Like a lot of families these days, our household IT infrastructure has had to adapt as we all became more and more fond of computers for work, school and recreation.  With digital photograpy, ripping hundreds of CDs, describing our various activities and travels on web pages,  two teenagers and the heavy use of email, and the need to provide stable storage for homework and digital art, we've been playing catchup for a while.  This led us directly to designing and building a new server to handle storage of all the digital media, web-serving and email.  At the same time, I was tired of the whine from the surplus X1 rack-mount server I had stuffed in the closet, and decided to merge my home desktop and server together to reduce power consumption.  With some thinking we arrived at the following hardware design:

  • Tyan 2865 motherboard - 4 SATA ports that worked w/ Solaris in compatibility mode, sound and an on-board nvidia gigabit ethernet chip, both supported by Solaris as these were the motherboards in the original Ultra 20s.
  • 2 x 2.6 GHZ AMD CPU; this was rather more CPU than I originally planned, but sometimes things turn up...
  • 1 GB ECC RAM, later upgraded to 2 GB.
  • 4 x 500 GB Seagate drives for redundant data storage, plus a leftover 40GB IDE drive for root and a DVD-RW drive for movies and burning DVDs.
  • NVIdia 6600 GT video card - available for a reasonable price and with good 3D graphics for playing with those video game ports to Solaris.  With the bundled NVidia OpenGL drivers in 55, the cool screen savers work out of the box, too!
  • 450 W Antec power supply
  • inexpensive second NIC for our external interface; I happened to have an $7 Gb Airlink card which uses the Realtek part; this works out of the box as well.
  • A small case - I used an Antec Lanboy as they're small and light.  In retrospect, a larger case such as an Antec P180 would have been a better choice.  I do like the disk cooling on this design, though - the 120 MM (blue, heh) fan keeps the drive temps very moderate.
Well, this went together pretty easily (especially since I had my son do all the work...).  The software selection and configuration was actually considerably more effort than picking the hardware components and building the system.  We're now running:
  • Solaris Nevada build 55
  • the bundled Bind 9 for handling internal, DMZ and external views of our various domains.
  • the bundled Apache 2.2 handles our various web pages and our 3 virtual domains.
  • the bundled dhcp server handles DHCP service.  I configured this with dhcpmgr; since all the other machines in the house use DHCP there's little admin work to do anywhere else.  This is a real boon during the kid's LAN parties; things just work.
  • The bundled Samba server provides NAS for the various Microsoft and Mac clients in the house.  With the latest OS-X builds, NFS works well enough to use; we may switch.  I use swat to configure Samba.
  • Printing is provided by the stock lp and ipp-listeners aided by Samba for the kid's microsoft game platforms.  I configured this with printmgr.  We print to an HP1200 that has worked flawlessly for years w/ just a couple of toner changes.
  • the bundled IPFilter provides carefully controlled port filtering on both the external and internal network interfaces.
  • Dovecot (compiled from source) provides IMAP service for both our LAN and (with SSL) over the Internet.  This service is under SMF control so it starts automatically when the machine boots. I chose Dovecot since it also cooperates with Postfix to handle SMTP authentication.
  • Postfix (compiled from source + Berkeley DBM libraries).  This provides SMTP service with TLS and AUTH needed to control relaying by family members using external connections.  The anti-SPAM features are also excellent and pretty easy to setup given the numerous how-to guides available on the web.  With Postfix I feel actually in control; I always had to rely on John Beck to customize my sendmail configuration.
  • The unbundled SUNcry{r} encryption packages were needed to get the bundled OpenSSL libraries working correctly w/ TLS and Thunderbird. Darren Moffat promises to have the high grade support bundled into OpenSolaris soon.

So far things are working very well.  The 4x500 GB drives are in a RAID-Z configuration with ZFS; we can sustain 120 MB/sec or so reading or writing to the 20 odd filesystems configured on the single pool.  Samba works pretty well; we managed to feed 10 different files to 10 different clients at nearly 100Mb/sec apiece during one of the kids' Lan parties.  Dovecot in particular seems very fast on top of ZFS, and other than a glitch with my forgetting to set the maximum user mailbox size Postfix has been trouble free.  I use this machine as my desktop as well in the evenings.

 
What are we still thinking about changing? Well:

  • My son wants a Ventrilo audio server for his Guild Wars games.  Another manifest, some more testing.
  • I may add greylisting to our SPAM prevention techniques; this will require a bit more configuration work. We reject perhaps 700 SPAM emails per day for our small number of email users; we each still get a handleful each day.
  • I'm going to add a Slim Devices Squeezebox to our stereo system; we outgrew the 200 CD changer some time ago.  That's another server to run, but we get access to the 65+ GB of music anywhere in the house.
  • I'm thinking about adding a separate Ethernet interface to plug in the wireless router  This will let me experiment with more secure ways to configure wireless services.
  • When ZFS root and boot is supported , I'll configure another pair of drives for a mirrored root pool using a SI3114 card. Right now we just back up the small root partition to our ZFS pool.
pageicon Wednesday Nov 16, 2005

Some thoughts on ZFS's impact on Solaris

So ZFS is now available and we've put together lots of blogs and demos to show everyone the neat kinds of things ZFS supports - snapshots, writable snapshots (clones), simple disk management, protection against hardware and firmware errors, etc. Rather than discuss some other neat feature of ZFS or do some extreme performance demos, I thought it might be interesting to mull over some of the possible implications of this new technology on the rest of Solaris and other applications. We don't yet support booting ZFS quite yet, so some of the ideas below will have to wait a bit for implementation - but it's certainly time to start thinking about them.

First of all, in years past we've been moving more and more to the "one giant filesystem" model for installing systems; it's just been to much of a PITA to anticipate how much space we might need for root, var, opt, usr, etc.... but now with ZFS divorcing space allocation from filesystem boundaries, it's easy to use separate filesystems to make administration easier. It's ok to use lots of filesystems - they're essentially free, and since snapshots are at the filesystem level, filesystems also the "undo" boundary. Clearly, we don't want to delivery of mail or the growth of logs from inhibiting one from rolling back a ill-considered change to other parts of the system, so separating directories that contain files that are modified "automatically" from those that containing binaries or configuration files seems like a good idea.

Once this separation occurs, using zfs rollback to undo the effects of changes such as a ill-considered patch or administrative actions becomes simple. It also appears that Live Upgrade could be a lot easier with ZFS - just snapshot and clone the filesystems being upgraded. Perhaps we should take snapshots always before adding a patch... hmmm, given that we have ZFS, how would we redesign patching if we could always use ZFS for root filesystems?

Another area of possible change is initial installation. Right now we use bzip2 compression on on each package on the install media to compress our packages into the smallest possible footprint; this has been needed in order to fit localization information onto the first CD. Since we access the CD/DVD as a filesystem and then uncompress the package as it's being installed, we often have trouble reading data fast enough from the install media to keep the device spun up at anywhere near full speed, esp. for small packages. With zfs compression, we could store a ZFS filesystem image on the CD/DVD and stream that onto a ZFS filesystem on the hard disk in one (fast)shot, and then install packages from the hard disk as needed. Afterwards the extra packages could be deleted, or left there to facilitate later installation of additional components; disk space is amazingly cheap these days.

Another area of interest is providing backups. ZFS makes it very easy to determine the differences between snapshots; this makes doing incremental updates of even very large slightly modified filesystems inexpensive in terms of disk and CPU utilization. It sure would be nice if my Ferrari laptop could just upload it's changes since the last backup automatically when I plugged it into the building nets; yes, I could do that today, but without the smarts of ZFS, traversing a 50GB data set to see what has changed just isn't very practical on the slow laptop drives.

There are more ideas to consider, but it's clear that today is just the beginning of the ZFS revolution; ZFS will change the way Solaris works.

pageicon Friday Aug 05, 2005

libMicro opensourced

I've been busy with libMicro lately. LibMicro is a set of portable OS (kernel/library) benchmarks developed as part of the Solaris 10 performance effort. I've been working on a set of changes for a while to fix up some of the statistics and improve repeatability and prepare for open sourcing under the CDDL license.

You can find libMicro in the performance community at opensolaris.org.

There's still a lot to do on libMicro, starting with documentation on how to add new benchmarks, and how the whole thing works... but that will have to wait until I'm back from a brief vacation.

pageicon Wednesday Jun 15, 2005

Most common recent benchmarking mistake

Most frequently asked performance question of late: Why is this trivial piece of code slower on Solaris than on Linux with both OSes running on the same Opteron box?

Often this is because the default compilation mode using /usr/sfw/bin/gcc (or Studio, for that matter) on Solaris is always a portable binary - and 64 bit Solaris x86 isn't seen as a different architecture, since all the 32 bit programs still work just fine. So

gcc -o foo foo.c 

produces a 32 bit binary on Solaris amd64, and a 64 bit binary when compiled on a 64 bit Linux. No wonder there's a performance difference! To get 64 bit compilation using gcc, simply use the -m64 flag:

gcc -o foo foo.c -m64

When in doubt, specify what you want - you'll more likely get what you need.

pageicon Tuesday Jun 14, 2005

Doing the Jitter Bug


One of the nice things about finally having OpenSolaris properly launched is that I can share Solaris source code with everyone when talking about Solaris. Since Solaris is the code, not being able to show people the code has been like having a discussion about various aspects of the flavors of wine without actually tasting any - possible perhaps, but a rather dry (heh) and uninteresting narrative....


I was happily (as I don't get to write nearly enough code) whacking away on a revised version of malloc to fix some longstanding performance and scalabilty issues when one of our marketing folks darted in asking about telcos, real time, and latency bubbles [I'll write about our malloc findings later on]. After getting the relevant email trail and a test program from the customer contact, it turned out that indeed the customer was seeing a real problem - 20 millisecond timers was apparently firing up to 60 microseconds late occasionally on his Opteron box, but no problem was seen on SPARC machines. What's weirder, this problem appeared sometimes, but not always - any given instance of the test program might display the problem, or it might not.  Hmmm.   The customer's test program would carefully store information about intervals between timer firings that exceeded 40 usecs +  20 msecs and report them every 20 seconds or so... and when the problem appeared, it was happening once a second...

Like most performance problems I run into, this looked like a job for DTrace.  As Bryan has pointed out many times, DTrace is best at quickly evaluating various hypotheses...so I needed some testable ideas as to what might be going on. Well, what could make the program's timers fire late?

  • Other threads or interrupts - perhaps these are interfering with our program somehow?
  • Other timers firing on this cpu.  Maybe some other system timers are causing problems on CPU 1?

Well, hypothesis testing time.  First of all, the test program clearly created a processor set containing CPU 1 and placed itself as the only thread in that set.  It even disabled interrupts on that CPU.  A quick check with the following script:

on-cpu
/curcpu->cpu_id == 1/
{
trace(curthread);
}
verified that there was only a single thread (the expected one) running on CPU 1.  Hmmm.  interstat verified that  CPU 1 wasn't getting any device interrupts, either.  So what's happening here? I had some experience with  time keeping on Solaris, but not enough to know offhand where to look for this problem.  Confused?  DTrace will help!  A quick change to the above script lets us use the enqueue probe in the sched provider to discover exactly how our process is made runnable when the timer expires:

sched:::enqueue
/curcpu->cpu_id == 1/
{
stack(10);
}
The output is pretty straightforward - lots of entries that look like this:
 1   2561                setbackdq:enqueue
              TS`ts_setrun+0xf1
              genunix`cv_unsleep+0x6e
              genunix`setrun_locked+0x6d
             genunix`eat_signal+0x89
              genunix`sigtoproc+0x420
              genunix`sigaddqa+0x4a
              genunix`timer_fire+0xb9
              genunix`clock_highres_fire+0x3f
              genunix`cyclic_softint+0xb2
              unix`cbe_low_level+0x14
This gave me a place to get started.... timer_fire is clearly important, so let's instrument that and see when it fires:

#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}

fbt::timer_fire:entry
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}

This script just prints a line with a timestamp in microseconds since we started, with the number of microseconds late (more than the 20 msec) interval that was desired.

Interestingly enough, the output looks like this:
13822657: interval is 0 usecs late
13842656: interval is 0 usecs late
13862657: interval is 0 usecs late
13882657: interval is 0 usecs late
13902657: interval is 0 usecs late
13922599: interval is -57 usecs late
13942657: interval is 57 usecs late
13962657: interval is 0 usecs late
13982657: interval is 0 usecs late
14002657: interval is 0 usecs late

Whoops!  Turns out our test program was leading us down the garden path; the problem is that one timer fire appears to be early, and the next one makes up the difference; the net error is 0 over time. Reading the comments in cyclic.c, this isn't that surprising but does give us a place to start looking. At this point, I noticed that I can suspend the test program and still get the same intermittant late behavior. How often are we late?
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


fbt::timer_fire:entry
/(int) (((timestamp - last) - 20000000))/1000 != 0/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
}
fbt::timer_fire:entry
{
last = timestamp;
}
5974796: interval is -57 usecs late
5994854: interval is 56 usecs late
6974796: interval is -57 usecs late
6994854: interval is 56 usecs late

Interesting - once a second.

Reading the cyclic.c source code we see that cyclic_fire handles all the interesting timer stuff.  Let's trace cyclic_fire in addition to timer_fire:
2248600: interval is 0 usecs late
2263098: cyclic_fire
2268539: cyclic_fire
2268543: interval is -57 usecs late
2288597: cyclic_fire
2288600: interval is 57 usecs late

So did we misprogram the clock, or is something else going on? On line 929 ofcyclic.c, we see that we reprogram the interrupt source through a function pointer. Those are a pain. Dropping into mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba
s1394 nca fctl lofs nfs audiosup random sppp ipc ptm logindmux ]
> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc1ede0 1b 3 0 59 no no t-0 ffffffff8584e220 mdb
1 ffffffff825cb800 b 0 0 -1 no no t-516001 fffffe8000275c80 (idle)
> ffffffff825cb800::print cpu_t cpu_cyclic
cpu_cyclic = 0xffffffff80f45600
> 0xffffffff80f45600::print cyc_cpu_t cyp_backend
cyp_backend = 0xffffffff803d65a8
> 0xffffffff803d65a8::print cyc_backend_t
{
cyb_configure = cbe_configure
cyb_unconfigure = 0
cyb_enable = cbe_enable
cyb_disable = cbe_disable
cyb_reprogram = cbe_reprogram
cyb_softint = cbe_softint
cyb_set_level = cbe_set_level
cyb_restore_level = cbe_restore_level
cyb_xcall = cbe_xcall
cyb_suspend = 0
cyb_resume = 0
cyb_arg = 0xffffffff825cb800
}
>

So we need to trace calls to cbe_reprogram:
#pragma D option quiet
BEGIN
{
start = timestamp;
last = timestamp;
}


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
firetime = (arg1 - start)/1000;
printf("%d: cbe_reprogram: set timer to fire at %d\n",
(timestamp - start)/1000, firetime);
}

cyclic_fire:entry
/curcpu->cpu_id == 1/
{
now = (timestamp - start)/1000;
printf("%d: cyclic_fire: %d usecs late\n",
now, now - firetime );
}

fbt::timer_fire:entry
/curcpu->cpu_id == 1/
{
printf("%d: interval is %d usecs late\n",
(timestamp - start)/1000,
(int) (((timestamp - last) - 20000000))/1000);
last = timestamp;
}


3839612: cyclic_fire: 80 usecs late
3839614: cbe_reprogram: set timer to fire at 3859532
3839615: interval is 0 usecs late
3859613: cyclic_fire: 81 usecs late
3859614: cbe_reprogram: set timer to fire at 3879532
3859616: interval is 0 usecs late
3879612: cyclic_fire: 80 usecs late
3879614: cbe_reprogram: set timer to fire at 3894055
3879615: interval is 0 usecs late
3894114: cyclic_fire: 59 usecs late
3894115: cbe_reprogram: set timer to fire at 3899532
3899555: cyclic_fire: 23 usecs late
3899556: cbe_reprogram: set timer to fire at 3919532
3899558: interval is -57 usecs late
3919613: cyclic_fire: 81 usecs late
3919614: cbe_reprogram: set timer to fire at 3939532
3919616: interval is 57 usecs late
3939612: cyclic_fire: 80 usecs late
3939614: cbe_reprogram: set timer to fire at 3959532
3939616: interval is 0 usecs late

Ahaha!  Turns out  we're late all of the time; normally 80 usecs when there's a full 20 msec between cyclic firings, but less when the deadman cyclic has fired recently.  Computing error as a percentage of sleep time, we see that our error is a relatively constant .4% of the interval slept; it's the varying time bewteen firings (due to multiple cyclics with different periods) that's responsible for the jitter.  Ok, so what is  cbe_reprogram doing? Rats, calling through another function pointer.  Ok, this time we'll use DTrace to figure out where we're going rather than mdb:


cbe_reprogram:entry
/curcpu->cpu_id == 1/
{
self->trace = 1;
}


fbt:::entry,
fbt:::return
/self->trace/
{
}
cbe_reprogram:return
/curcpu->cpu_id == 1/
{
self->trace = 0;
}

1 -> cbe_reprogram
1 -> apic_timer_reprogram
1 -> gethrtime
1 -> tsc_gethrtime
1 <- tsc_gethrtime
1 <- gethrtime
1 <- apic_timer_reprogram
1 <- cbe_reprogram

Ok, now let's look at the source for apic_timer_reprogram

ticks = (uint_t)((time - now) / apic_nsec_per_tick);

This is introducing a .4 % error... I wonder what the value of apic_nsecs_per_tick really is?

# echo 'apic_nsec_per_tick/D' | mdb -k 
apic_nsec_per_tick:
apic_nsec_per_tick: 10

That's it. We're suffering from an imprecise conversion between nsecs and apic ticks; the appropriate value cannot be represented as an integer with this conversion accurately enough to keep excessive jitter from appearing. Since the deadman cyclic runs every second, if we want to keep worst case jitter due to quantization errors down to 1 usec we'll need 1 ppm resolution on the conversion factor.

This also explained why sometimes the bug isn't visible - if the a 20 msec cyclic always occurs just before the deadman cyclic fires, there is almost no change in the amount of time between cyclic firings.

At this point I filed a bug, 6266961.  The fix involves rewriting the conversion to use a different factor and redoing the calibration of the apic timer to use the kernel's idea of hi-res time rather than the pit timer.
/*
* deal with resolution issues by converting between ticks and
* nanoseconds using 64 bit math.
* use multiplication cause it's faster; divide by power of 2
* for speed.
*/

static int64_t apic_ticks_per_128Mnsecs; /* ticks/128*1024*1024 nsecs */
#define APIC_TICKS_TO_NSECS(ticks) (((long long)(ticks) * 134217728LL)/\
apic_ticks_per_128Mnsecs)
#define APIC_NSECS_TO_TICKS(nsecs) ((long long)(nsecs) * \
apic_ticks_per_128Mnsecs / 134217728LL)

After I get this finished up, it's back to malloc... more about that later on.


Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:
pageicon Wednesday Apr 27, 2005

Putting developer-defined DTrace probe points in an application

Well, it's been a while since blogging - time to post another example. As part of the Solaris 10 developer BOF at Usenix a couple of weeks ago, Liane asked me to put together a quick DTrace demo, so I cons'ed up a quick example of how to put static probes into an application. This technique is simple and can avoid the need for debug flags, conditional logging, etc in the application.

In order to have something to work with, I wrote a quick version of wc that I instrumented with two dtrace probe points.  These two probe points are defined in a file called simple_probes.d:

provider simple {
probe saw__word(int);
probe saw__line(int);
};

Note that dtrace treats double underscores specially in such definitions; they're converted to dash "-" characters in the finished probe names. Here's the source for the wc program, which I called simple.c:


#include <stdio.h>
#include <sys/sdt.h>

/*
 * simple example of defining sdt probes in a trivial program
 * Sdt probes can often completely replace debug levels, optional
 * log files, etc, in daemons... you can leverage the power of dtrace
 * to make your server/application more readily debuggable.
 */

int
main(int argc, char *argv[])
{
        int i;
        int characters, lines, words;
        characters = lines = words = 0;


        while (1) {

                if ((i = getchar()) == EOF) {
                        /*
                         * here we specify the name of the module,
                         * the name of the probe (modulo mapping
                         * '__' to '-') and pass in the parameter to be
                         * traced which in this case is the number of
                         * lines seen so far.
                         */
                        DTRACE_PROBE1(simple, saw__line, lines);
                        break;
                }

                characters++;

                if (i == '\n') {
                        lines++;
                        DTRACE_PROBE1(simple, saw__line, lines);
                        continue;
                }

                if (isblank(i)) /* eating white space */
                        continue;

                words++; /* in a word now */

                while (1) {

                        if ((i = getchar()) == EOF) {
                                DTRACE_PROBE1(simple, saw__word, words);
                                break;
                        }

                        characters++;

                        if (i == '\n') { /* EOL? ends word implicitly */
                                DTRACE_PROBE1(simple, saw__word, words);
                                lines++;
                                DTRACE_PROBE1(simple, saw__line, lines);
                                break;
                        }

                        if (isblank(i)) { /* white space ends words too */
                                DTRACE_PROBE1(simple, saw__word, words);
                                break;
                        }
                }
        }

        printf("%8d %8d %8d\n", lines, words, characters);

        exit(0);
}

Now we need a makefile to process the code and build the executable:

CC=gcc
CFLAGS=-m64
DTRACE=dtrace

simple: simple_probes.o simple.o
$(CC) -o simple -m64 simple_probes.o simple.o

simple_probes.o: simple.o simple_probes.d
$(DTRACE) -G -64 -s simple_probes.d simple.o

Note that we use dtrace to build a .o file that defines our new probes,
and we link this into our (64 bit amd) application:

barts@cyber:/home/barts/demos 144% /usr/ccs/bin/make
gcc -m64  -c  simple.c
dtrace -G -64 -s simple_probes.d simple.o
gcc -o simple -m64 simple_probes.o simple.o
barts@cyber:/home/barts/demos 145%

We can now take a look at the probes we've added into our program with
dtrace:

# dtrace -P simple'$target' -c ./simple -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
48630 simple19946            simple                              main saw-line
48631 simple19946            simple                              main saw-word
#

And we can now demonstrate the probes firing as follows.  First, running normally:

# echo "Hello World" | ./simple
       1        2       12
#

and now running with probes enabled from dtrace:

# echo "Hello World" | dtrace -P simple'$target' -c ./simple

dtrace: description 'simple$target' matched 2 probes
       1        2       12
CPU     ID                    FUNCTION:NAME
  0  48631                    main:saw-word
  0  48631                    main:saw-word
  0  48630                    main:saw-line
  0  48630                    main:saw-line
dtrace: pid 19917 has exited

Simple enough really, and this works with Forte, gcc, and Studio compilers for all Solaris platforms...

Hmmm... perhaps I should change the nscd to use this rather than generating log
files....



pageicon Wednesday Sep 15, 2004

Burning Man - what a foreign place!

I've been back from my first trip to Burning Man for more than a week, and I'm still finding it coming up in conversations and mid-meeting daydreams/doodles. Art show, rave fest, communal desert camping experience - I'm still not sure exactly what it is... I just get the feeling I'll have forgotten all about the heat, dust and porta-potties come end of August, and I'll go back for another visit. The folks are friendly there.

For some images of this year's events, try here, here and here. I guess you really need a nice techno dub to get the full flavor (not to mention a snootful of playa dust :-), but I'm not really sure how to do either in roller :-).

Burning Man is worth a visit.

- Bart
pageicon Friday Jul 30, 2004

Photos of OSCON SOlaris 10 BOF from jimgris

Look here for some candid shots of some of us kernel types talking at the OSCON Solaris 10 BOF. I was sure whacked by this point if the pictures are any indication... We'd spent a while sitting in the sun during lunch and then the airconditioning couldn't keep up in our 16th floor meeting room - it was more of a group sauna than . Kudos to everyone for dealing with the heat...

Back from OSCON - misc. thoughts



Spent the last few days at OSCON attending keynotes, presentations, BOFs and meetings.
Thanks to the Apple folks providing a lounge w/ good old wired internet service, I was able to stay sort of caught up w/ the usual firehose of email back at Sun using my Tecra 9100 laptop running Solaris 10 & a vpn tunnel running via IPSec through NAT. Such a nice feature - another S10 special :-)....

Now I really want to get the Cisco Aironet card that's built into the Tecra working w/ Solaris so I can be like the cool kids and surf related sites while attending talks... Also, the hotel several of us stayed in had only wireless in our rooms so I couldn't do my email at night - this was the Embassy Suites downtown, formerly the Multnomah. Cool place; I like buildings that have a history.

Adam and Eric already talked about the Solaris 10 BOF; they didn't mention the last folks left past 11:30, long after the beer, tee-shirts and pizza ran out. I guess there really wasn't any controversy - just questions about the license OpenSolaris will have... we don't know the answer to that one, either, until we see the white smoke from the execs :-)
We had a nice walk back to the hotel along the river - Portland is a nice place for a conference. I went to Powell's Books; what a massive place. Have to get back there... maybe after the trip to Lopez, WA. We ate at McCormick & Schmick's Seafood Restaurant (twice, the first night amazingly good (just at closing time), the second time merely good (place was packed). Also ate at Oba's; my cumin-coriander was tasty but not stellar and not nearly as hot as described... the seared rare ahi appetizers were primo though :-).

Used the MAX line to get from the airport to our hotel - very handy, and a bargin at $1.60 :-). Figure out the schedules, maps, etc from trimet.org.
pageicon Tuesday Jul 27, 2004

Off to Portland for OSCON



Several Sun kernel people will be at OSCON this week, learning about some of the open source communities and talking to people about open sourcing Solaris. We're sponsoring a BOF on Thursday night to talk about Solaris 10 and Open Source. If you're there, stop by and talk to Andy Tucker, Adam Leventhal, Eric Scrock or me... as the BOF description says:

There will be plenty of food, T-shirts, and demos. And perhaps some controversy, too.

« March 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 180