Reflections on OS integration Eric Schrock's Weblog
Musings about Fishworks, Operating Systems, and the software that runs on them.

Friday Jul 30, 2004

As you may have noticed from Adam's blog, our time at OSCON was a rousing success. Unfortunately, I don't have enough time to write up a real post, since I'm on vacation for the next few days. Adam summed things up pretty well; the two points I'd reiterate are:

  1. We are eager to learn how to do open Solaris right.

    Sun has a lot of experience with open source projects, with varying degrees of success. Our meeting with open source leaders was extremely informative; I myself never realized how difficult it is to build a developer community that really works. We're not just throwing source over the wall as a PR stunt or to get free labor; we're doing it (among other reasons) to build a thriving community centered around Solaris. And we need you to help us get it right.

  2. Solaris 10 technology sells itself.

    Before our BOF, most people we met were skeptical of Solaris. Because we're a proprietary UNIX, we've gained a reputation of being an old dinosaur: Linux is fast and new and evolving, Solaris is slow and old and stagnant. This couldn't be further from the truth, and it doesn't take a marketing campaign to convince the world otherwise. Once people see DTrace, Zones, Solaris Management Framework, Predictive Self Healing, ZFS, and all the other great features in Solaris 10, there's really no question that Solaris is alive and well. Whether you are an administrator or a developer, there will be something in Solaris that will blow you away. If you haven't seen Solaris 10 in action, get your Solaris Express today and spread the word.

Tuesday Jul 27, 2004

Sun is sending a contingent to the O'Reilly Open Source Convention in Portland this week. In a last minute change of schedule, I will be attending (even though my name is not in the official BOF description). But I'll be there, along with fellow kernel engineers Bart, Andy, and Adam. We will be learning about open source development and discussing Solaris. There will be a BOF Thursday night for all to attend. Come learn about Solaris 10 and open source Solaris, or at least show up for the free food and T-shirts.

Saturday Jul 24, 2004

In build 60 (Beta 5 or SX 7/04), I fixed a long standing Solaris bug: mounted filesystems could not contain spaces. We would happily mount the filesystem, but then all consumers of /etc/mnttab would fail. This resulted in sad situations like:

# df -h
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0d0s0         36G    13G    22G    38%    /
/devices                 0K     0K     0K     0%    /devices
/dev/dsk/c0d0p0:boot    11M   2.3M   8.4M    22%    /boot
/proc                    0K     0K     0K     0%    /proc
mnttab                   0K     0K     0K     0%    /etc/mnttab
fd                       0K     0K     0K     0%    /dev/fd
swap                  1002M    24K  1002M     1%    /var/run
swap                  1003M   1.3M  1002M     1%    /tmp
# mount -F lofs /export/space\ dir /mnt/space\ mnt
/export/space dir       /mnt/space mnt  lofs    dev=1980000     1090718041
# df -h
df: a line in /etc/mnttab has too many fields
#

Luckily you could unmount the filesystem, but it was quite annoying to say the least. The resulting fix was really an exploration into bad interface design.

/etc/mnttab

This file has been around since the early days of Unix (at least as far back as SVR3). Each line is a whitespace-delimited set of fields, including special device, mount point, filesystem type, mount options, and mount time (see mnttab(4) for more information). Historically, this was a plain text file. This meant that the user programs mount(1M) and umount(1M) were responsible for making sure its contents were kept up to date. This could be very problematic: imagine what would happen if the program died partway through adding an entry, or root accidently removed an entry without actually unmounting it. Once the contents were corrupted, the admin usually had to resort to rebooting, rather than trying to guess what the proper contents. Not to mention it makes mounting filesystems from within the kernel unnecessarily complicated.

In Solaris 8, we solved part of the problem by creating the mntfs pseudo filesystem. From this point onward, /etc/mnttab was no longer a regular text file, but a mounted filesystem. The contents are generated on-the-fly from the kernel data structures. This means that the contents are always in sync with the kernel1, and that the user can't accidentally change the contents. However, we still had the problem that the mount points could not contain spaces, because space was a delimiter with special meaning.

getmntent() and friends

On top of this broken interface, a C API was developed that had even worse problems. Consider getmntent(3c):

int getmntent(FILE *fp, struct mnttab *mp);

There are several problems with this interface:

  1. The user is responsible for opening and closing the file

    There is only one mount state for the kernel; why should the user have to know that /etc/mnttab is the place where the entries are stored?

  2. The first parameter is a FILE *

    If you're developing a system interface, you should not enforce using the C stdio library. Every other system API takes a normal file descriptor instead./p>

  3. The memory is allocated by the function on demand

    This causes all sorts of problems, including making multithreaded difficult, and preventing the user from controlling the size of the buffer used to read in the data.

  4. There is no relationship between the memory and the open file

    Because of this, a lazy programmer can close the file after the last call to getmntent() while still using the memory, so it must be kept around indefinitely.

By now, it should be obvious that this was an ill-conceived API built on top of a broken interface. Off the top of my head, if I were to re-design these interfaces I would come up with something more like:

mnttab_t *mnttab_init(void);
int mnttab_get(mnttab_t *mnttab, struct mntent *ent, void *scratch, size_t scratchlen);
void mnttab_fini(mnttab_t *mnttab);

The solution

Once /etc/mnttab became a filesystem, we could add ioctl(2) calls to do whatever we wanted. Once we're in the kernel, we know exactly how long each field of the structure is. We create a set of NULL-terminated strings directly in user space, and simply return pointers to them. This was more complicated than it sounds for the reasons outlined above. We also had to maintain the ability to read the file directly. With this fix, all C consumers "just work". Scripted programs will still choke on a mnttab entry with spaces, but this is a minority by far.

Note that the files /etc/vfstab and /etc/dfs/sharetab still suffer from this problem. There has been some discussion about how to resolve these issues, with the new Service Management Facility being touted as a possible solution. And ZFF (Sun's next generation filesystem) is avoiding /etc/vfstab altogether.


1 There is always the possibility that the mounted filesystems change between the time the file is opened and the data is read.

Just thought I'd call attention to the fact that the Service Management Facility (SMF) has successfully integrated into build 64 of Solaris 10. Stephen has posted some teasers, and will hopefully continue with more examples, as well as encouraging his fellow team members to get into the blogging mood. This is one of the most visible Solaris 10 features, and brings reliability, availability, and ease of administration to new levels. It is supposed to hit the streets as build 65, aka Beta 7, aka Solaris Express 9/04. Stay tuned...

Tuesday Jul 20, 2004

In a departure from recent musings on the inner workings of Solaris, I thought I'd examine one of the issues that Bryan has touched on in his blog. Bryan has been looking at some of the larger issues regarding OS innovation, commoditization, and academic research. I thought I'd take a direct approach by examining our nearest competitor, Linux.

Bryan probably said it best: We believe that the operating system is a nexus of innovation.

I don't have a lot of experience with the Linux community, but my impression is that the OS is perceived as a commodity. As a result, Linux is just another OS; albeit one with open source and a large community to back it up. I see a lot of comments like "Linux basically does everything Solaris does" and "Solaris has a lot more features, but Linux is catching up." Very rarely do I see mention of features that blow Solaris (or other operating systems) out of the water. Linus himself has said:

A lot of the exciting work ends up being all user space crap. I mean, exciting in the sense that I wouldn't car [sic], but if you look at the big picture, that's actually where most of the effort and most of the innovation goes.

So Linus seems to agree with my intuition, but I'm in unfamiliar territory here. So, I pose the question:

Is the Linux operating system a source of innovation?

This is a specific question: I'm interested only in software innovation relating to the OS. Issues such as open source, ISV suport, and hardware compatibility are irrelevant, as well as software which is not part of the kernel or doesn't depend on its facilities. I consider software such as the Solaris ptools as falling under the purview of the operating system, because they work hand-in-hand with the /proc filesystem, a kernel facility. Software such as GNOME, KDE, X, GNU tools, etc, are all independent of the OS and not germane to this discussion. I'm also less interested in purely academic work; one of the pitfalls of academic projects is that they rarely see the light of day in a real-world commercial setting. Of course, most innovative work must begin as research before it can be viable in the industry, but certainly proven technologies make better examples.

I can name dozens of Solaris innovations, but only a handful of Linux ones. This could simply be because I know so much about Solaris and so little about Linux; I freely acknowledge that I'm no Linux expert. So are there great Linux OS innovations out there that I'm just not aware of?

Saturday Jul 17, 2004

In my last post I described how watchpoints work in Solaris, or how they're supposed to work. The reality is that there have been some small problems that have prevented a large number of watchpoints from being practical for complicated programs. I've made some changes in Solaris 10 so that they work in all situations, which made it onto Adam's Top 11-20 Features in Solaris 10.

How watchpoints are used

Typically, watchpoints are used in one of two ways. First, they are used for debugging userland applications. If you know that memory is getting corrupted, or know that a variable is being modified from an unknown location, you can set a watchpoint through a debugger and be notified when the variable changes. In this case, we only have to keep track of a handful of watchpoints. But they are also used for memory allocator redzones, to prevent buffer overflows and memory corruption. For every allocation, you put a watched region on either end, so that if the program tries to access unknown territory, a SIGTRAP signal is sent so the program can be debugged. In this case, we have to deal with thousands of watchpoints (two for every allocation), and we fault on virtually every heap access1.

Watchpoints in strange places

Watchpoints have worked for the most part since they were put into Solaris. Whenever a watchpoint is tripped, we end up in the kernel, where we have to look at the instruction we faulted on and take appropriate action. There were some instructions that we didn't quite decode properly when there were watchpoints present. On SPARC, the cas and casx instructions (used heavily in recent C++ libraries) could cause a SEGV if they tried to access a watched page. On x86, instructions that accessed the stack (pushl and movl, for example) would cause a similar segfault if there was a watchpoint on a stack page.

Multithreaded programs

There has been a particularly nasty watchpoint problem for a while when dealing with lots of watchpoints in multithreaded programs. When one thread hit a watchpoint, we have to stop all the other threads. But in the process of stopping, those threads may trigger a watchpoint, we try to stop the original watchpoint thread. We end up spinning in the kernel, where the only solution is to reboot the system.

Scalability

In the past, watchpoints were kept in a linked list for each process. This means that every time a program added a watchpoint or accessed a watched page, it would spend a linear amount of time trying to find the watchpoint. This is fine when you only have a handful of watchpoints, but can be a real problem when you have thousands of them. These linked lists have since been replaced with AVL trees. Individual watchpoints may be slow, but 10,000 watchpoints have nearly the same impact as 10 watchpoints. This can result in as much as 100x improvement for large number of watchpoints.

All of the above problems have been fixed in Solaris 10. The end result is that tools like watchmalloc(3malloc) and dbx's memory checking features are actually practical on large programs.


1 Remember that we have to fault on every access to a page that contains a watchpoint, even if it's not the address we're actually interested in.

Tuesday Jul 13, 2004

As Adam noted in the Solaris Top 11-20, watchpoints are now much more useful in Solaris 10. Before I go into specific details regarding Solaris 10 improvements, I thought I'd give a little technical background on how watchpoints actually work. This will be my second highly technical entry in as many days; in my next post I promise to tie this into some real-world applications and noticeable improvements in S10.

The idea of watchpoints has been around for a long time. The basic idea is to allow a debugger to set a watchpoint on a region of memory within a process. When that region of memory is accessed, the debugger is notified in order to take appropriate action. This typically serves two purposes. First, it's useful for interactive debuggers when determining when a region of memory gets modified. Second, it can be used as a protection mechanism to avoid buffer overflows (more on this later).

As with most modern operating systems, Solaris implements a virtual memory system. A complete explanation of how virtual memory works is beyond the scope of a single blog post. The simplest way to explain it is that each process refers to memory by a virtual address, which corresponds to a physical piece of memory. Each piece of memory is called a page, which can be either mapped (resident in RAM) or unmapped (possibly stored on disk). The operating system has control over when and how pages get mapped in or out of memory. If a program tries to access memory that is unmapped, the OS will map in the necessary pages as needed. Once pages are mapped, accesses will be handled directly in hardware until the OS decides to unmap the memory1. There are many benefits of this, including the ability for processes to see a unified flat memory space, inability to access other processes' memory, and the ability to store unused pages on disk until needed.

To implement watchpoints, we need a way for the operating system to intercept accesses to a specific virtual page within a process. If we leave pages mapped, then accesses will be handled in hardware and the OS will have no say in the matter. So we keep pages with watchpoints unmapped until they are actually accessed. When the process tries to read/write/execute from the watched page, the OS gets notified via a trap2. At this point, we temporarily map in the page and single step over the instruction that triggered the trap. If the instruction touches a watched area (note that there can be more than one watched area within a page), then we notify the debugger through a SIGTRAP signal. Otherwise, the instruction executes normally and the process continues.

Things become a little more complicated in a multithreaded program. If we map in a page for a single thread, then all other threads in the process will be able to access that memory without OS intervention. If another thread accesses the memory while we're stepping over the instruction, we can miss triggering a watchpoint. To avoid this, we have to stop every thread in the process while we step over the faulting instruction. This can be very expensive; we're looking into more efficient methods. I won't spend too much time discussing how the debugger communicates with the OS when setting and reacting to watchpoints. Most of this information can be found in the proc(4) manpage.

With my next post I'll examine some of the specific enhancements made to watchpoints in Solaris 10.


1This is obviously a very simplistic view of virtual memory. Curious readers should try a good OS textbook or two for more detailed information.

2Traps are quite an interesting subject by themselves. On Solaris SPARC, you can see what traps are ocurring with the very cool trapstat(1M) utility that Bryan wrote.

Monday Jul 12, 2004

I'm finally back from vacation, and I'm here to help out with Adam's Top 11-20 Solaris Features. I'll be going into some details regarding one of the features I integrated into Solaris 10, pfiles with pathnames (which was edged out by libumem for the #11 spot by a lean at the finish line). This will be a technical discussion; for a good overview of why it's so useful, see my previous entry.

There were several motivations for this project:

  1. Provide path information for MDB and DTrace.
  2. Make pathname information available for pfiles(1).
  3. Improve the performance of getcwd(3c).

First of all, we needed to record the information in the kernel somewhere. In Solaris, we have what's known as the Virtual File System (VFS) layer. This is an abstract interface, where each file system fills in the implementation details so no other consumers has to know. Each file is represented by a vnode, which can be thought of as a superclass if you're familiar with inheritence. The end result of this is that we can open a UFS file in the same way we open a /proc file, and the only one who knows the difference is the underlying filesystem. We can also change things at the VFS layer and not have to worry about each individual filesystem.

To address concerns over performance and the difficulty of bookkeeping, it was necessary to adjust the constraints of the problem appropriately. It is extremely difficult, if not impossible, to ensure that the path is always correct (consider hard links, unlinked files, and directory restructuring). To make the problem easier, we make no claim that the path is currently correct, only that it was correct at one time. Whenever we translate from a path to a vnode (known as a lookup) for the first time, we store the path information within the vnode. The performance hit is negligible (a memory allocation and a few string copies) and it only occurs when first looking up the vnode. We must be prepared for situations where no pathname is available, as some files have no meaningful path (sockets, for example).

With the magic of CTF, MDB and DTrace need no modification. Crash dumps now have pathnames for every open file, and with a little translator magic we end up with a stable DTrace interface like the io provider. We also use this to improve getcwd performance. Normally, we would have to lookup "..", iterate over each entry until we find the matching vnode, record the entry name, lather, rinse, repeat. Now, we make a take a first stab at it by doing a forward lookup of the cached pathname, and if it's the same vnode, then we simply return the pathname. getcwd has very stringent correctness requirements, so we have to fall back to the old method when our shortcut fails.

The only remaining difficultly was exporting this information to userland for programs like pfiles to use. For those of you familiar with /proc, this is exactly the type of problem it was designed to solve. We added symbolic links in /proc/<pid>/path for the current working directory, the root directory, each open file descriptor, and each object mapped in the address space. This allows you to run ls -l in the directory and see the pathname for each file. More importantly, the modifications to pfiles become trivial. The only tricky part is security. Because a vnode can have only one name, and there can be hard links to files or changing permissions, it's possible for the user to be unable to access the path as it was originally saved. To avoid this, we do the equivalent of a resolvepath(2) in the kernel, and reject any paths which cannot be accessed or do not map to the same vnode. The end result of this is that we may lose this information is some exceptional circumstances (the directory layout of a filesystem is relatively static) but as Bart is fond of reminding us: performance is a goal, correctness is a constraint.

Thursday Jul 01, 2004

In a departure from my usual Solaris propaganda, I thought I'd try a little bit of history. This entry is aimed at all of you C programmers out there that enjoy the novelty of Obfuscated C. If you think you're a real C hacker, and haven't heard of the obfuscated C contenst, then you need to spend a few hours browsing their archives of past winners1.

If you've been reading manpages on your UNIX system, you've probably been using some form of troff2. This is an early typesetting language processor, dating back to pre-UNIX days. You can find some history here. The nroff and troff commands are essentially the same; they are built largely from the same source and differ only in their options and output formats.

The original troff was written by Joe F. Ossanna in assembly language for the PDP-11 in the early 70s. Along came this whizzy portable language known as C, so Ossana rewrote his formatting program. However, it was less of a rewrite and more of a direct translation of the assembly code. The result is a truly incomprehensible tangle of C code, almost completely uncommented. To top it off, Ossana was tragically killed in a car accident in 1977. Rumour has it that attempts were made to enhance troff, before Brian Kernighan caved in and rewrote it from scratch as ditroff.

If you're curious just how incomprehensible 7000 lines of uncommented C code can be, you can find a later version of it from The Unix Tree, an invaluable resource for the nostalgic among us. To begin with, the files are named n1.c, n2.c, etc. To quote from 'n6.c':

setch(){
	register i,*j,k;
	extern int chtab[];

	if((i = getrq()) == 0)return(0);
	for(j=chtab;*j != i;j++)if(*(j++) == 0)return(0);
	k = *(++j) | chbits;
	return(k);
}
find(i,j)
int i,j[];
{
	register k;

	if(((k = i-'0') >= 1) && (k <= 4) && (k != smnt))return(--k);
	for(k=0; j[k] != i; k++)if(j[k] == 0)return(-1);
	return(k);
}

If this doesn't convince you to write well-structured, well-commented code, I don't know what will. The scary thing is that there are at least 18 bugs in our database open against nroff or troff; one of the side-effects of promising full backwards compatibility. Anyone who has the courage to putback nroff changes earns a badge of honor here - it is a dark place that has claimed the free time of a few brave programmers3. Whenever an open bug report includes such choice phrases as this, you know you're in trouble:

I've seen this problem on non-Sun Unix as well, like Ultrix 3.1 so the problem likely came from Berkeley. The System V version of *roff (ditroff ?) doesn't have this problem.


1One of my personal favorites is this little gem, a 2000 winner 'natori'. It should be a full moon tomorrow night...

#include <stdio.h>
#include <math.h>
double l;main(_,o,O){return putchar((_--+22&&_+44&&main(_,-43,_),_&&o)?(main(-43,
++o,O),((l=(o+21)/sqrt(3-O*22-O*O),l*l<4&&(fabs(((time(0)-607728)%2551443)/
405859.-4.7+acos(l/2))<1.57))[" #"])):10);}

2On Solaris, most manpages are written in SGML, and can be found in /usr/share/man/sman*.

3I'd like to think that the x86 disassembler is a close second, but maybe that's just because I'm a survivor.

During S10 development, there have been numerous enhancements to the ptools (see proc(1)). Here are two recent additions that may have slipped through the cracks with all the hype surrounding Solaris 10. They're not quite as ground breaking as DTrace or Zones, but well-suited for some blog exposure.

pargs -l

The pargs command has a new option to display the command and all its arguments on a single line. This makes it possible to cut and paste to restart running commands with the same set of arguments.

$ pargs -l `pgrep sleep`
/usr/bin/sleep 10 here are some args
$

java support in pstack

This one won't hit the streets until build 59, which is due out as the next Solaris Express build. Thanks to the JVM guys, we've added support for pstack to display java frames. If you're using the latest java release (1.5, err... 5.0) and you run pstack on a java process, you'll get to see all the java functions, including line numbers. Note the java frames with an asterisk in the example below.

$ cat Main.java
public class Main {
        
        public static int go(int a) {

                if (a == 0) {
                        for (;;)
                                continue;
                }

                return (1 + go(a - 1));
        }

        public static void main(String[] argv) {
                System.out.println("running...");
                go(10);
        }
}
$ pstack `pgrep java`
144381: /usr/jdk/instances/jdk1.5.0/bin/java Main
-----------------  lwp# 1 / thread# 1  --------------------
 9940dfae * Main.go(I)I+0
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.go(I)I+11 (line 10)
 99402a3f * Main.main([Ljava/lang/String;)V+10 (line 15)
 9f4dbbe4 * StubRoutines (1)
 9f4dbbe4 __1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCa
llArguments_pnGThread__v2468_v_ (8047130, 8047038, 8047068, 8074538, 804702c, 9f
4dbee8) + 14
 9f4dbbe4 __1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCa
llArguments_pnGThread__v2468_v_ (9f4dbc90, 8047130, 8047038, 8047068, 8074538) +
 14
 9f4dbee8 __1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_
pnGThread__v_ (8047130, 8074ac4, 8047068, 8074538) + 28
 9f6ee200 __1cRjni_invoke_static6FpnHJNIEnv__pnJJavaValue_pnI_jobject_nLJNICallT
ype_pnK_jmethodID_pnSJNI_ArgumentPusher_pnGThread__v_ (80745f4, 8047130, 0, 0, 8
072681, 804713c) + 180
 9f59f7af jni_CallStaticVoidMethod (80745f4, 80750a0, 8072681, 80750b0) + 10f
 080526ee main     (0, 806fbf8, 8047a04) + a4c
 08051c0a ???????? (2, 8047ae0, 8047b05, 0, 8047b0a, 8047b31) + 8051c0a
-----------------  lwp# 2 / thread# 2  --------------------
 9fb53e3c lwp_cond_wait (8160740, 8160728, 0, 0)
 9f4b6182 __1cHMonitorEwait6Mil_i_ (81208c8, 1, 0) + 432
 9f6bfcad __1cNGCTaskManagerIget_task6MI_pnGGCTask__ (81606c0, 0) + 90

[ ... ]