|
This past week at OSCON I've spent my time trying to understand open source processes, talking about Solaris, and trying to figure out what OpenSolaris is going to look like.
Learning from Linux
I attended a talk by Greg Kroah-Hartman about Linux kernel development. As we work towards open sourcing Solaris, we're trying to figure out how to do it right -- source control, process, licenses, community etc. As I didn't know much about how Linux development works, I was hoping to learn from a largely successful open source operating system.
Linux development is built around fiefdoms maintained by folks like Greg. Ordinary folks can contribute to the repositories they maintain (either directly or by proxy based on some sort of Linux-street-cred it seems). Those repositories are then fed up to a combined unstable repository, and from there Linus himself ordains the patches and welcomes them into the circle of linux 2.6.x. This all seemed to make some sense and work alright. That is, until someone asked about firewire support. The answer, "I wouldn't run firewire -- it should build [laughter], but I wouldn't run it. The discussion then led to Linux testing which it seems is highly ad-hoc and unreliable. IBM and Novell are working on nightly testing runs, but very little exists today in terms of quality control tests or general tests that developers themselves can run before they integrate their changes.
In Solaris, testing can be arduous. Some changes are obvious and can be tested on just on architecture, but others require extensive tests on a variety of SPARC and x86 platforms. And linux supports so many more platforms! I have no idea how a developer working on his x86 box can ever be sure that some seemingly innocuous change hasn't broken 64-bit PPC (or whatever). Clearly this is something we have to solve for OpenSolaris -- reliability and testing are at the core of our DNA in the Solaris kernel group, and we need to not only export that idea to the community, but only some subset of facilities so that contributors can adhere to the same levels of quality.
OpenSolaris
Later in the day a bunch of us from Solaris met with some open source leaders (I don't know quite how one earns that title, but that's what our liaison told us they were). We first told them where we were: Yes, we really are going to open source Solaris; no, we don't know the license yet; no, we don't know if it's going to be GPL compatible; no, we aren't planning on moving the cool stuff in Solaris over to Linux ourselves; and, no, we do not know what the license is going to be, but we promise to tell you when we do.
We got a lot of helpful suggestions from folks involved with apache and other projects. "Bite sized bugs" sound like a great way to get new people involved with Solaris and contributing code without a huge investment of effort. Documentation, partitioning and documenting that partitioning will all be much more important that we had previously anticipated. We get the message: OpenSolaris will be easy to download, build, and install, and we'll make sure it's as easy as possible to get started with development.
The one thing that disappointed me was the lack of knowledge about Solaris 10 -- some comment suggested that the members of the panel think Solaris doesn't really have anything interesting. Fortunately we had the BOF that night...
The Solaris Community
Andy, Bart, Eric and I held a BOF session last night to talk about OpenSolaris and Solaris 10. After satiating the crowd's curiosity about open sourcing Solaris (no, we don't know what the license is going to be), we gave some Solaris 10 demonstrations.
Since we were tight on time, I buzzed through the DTrace demo in about 15 minutes touching on the syscall provider, aggregations, the ustack() action, user-land tracing with the pid provider and kernel tracing with the fbt provider. Whew. Then came the questions -- some of the audience members had used DTrace, others had heard of it; almost everyone had a question. When I demo DTrace, there's always this great moment of epiphany that people go through. I can see it on their faces. After the initial demo they look like people who just got off a roller coaster -- windblown and trying to understand what just happened, but there's always this moment, this ah-ha moment, when something -- the answer to a question, an additional example, an anecdote -- sparks them into understanding. It's great to see someone suddenly sit up in her chair and start nodding vigorously at every new site I point out in the DTrace guided tour.
My personal favorite piece of input about OpenSolaris was someone's claim that six months after Solaris goes open source there will be a port to PowerBook hardware. If that's true then everyone in the Solaris kernel group is going to have PowerBooks in 6 months plus a day.
Before the BOF, I was worried that we might not find our community for open source Solaris. Not only was there a good crowd at the BOF, but they were interested and impressed. That's our community, and those are the people who are going to be contributing to OpenSolaris. And I can't wait for it to happen.
Two members of the ZFS team have joined the blogging fray. Check out Matt Ahrens's and Val Henson's weblogs.
For the unintiated, ZFS is the brand new file system that's going to be in Solaris 10. ZFS is incredibly fast, reliable, and easy to manage. I recently moved my home directory from out UFS file server to an experimental ZFS file server. Opening my (extensive) mail spool went from 20 seconds to 3; doing an ls(1) sped up by more than a factor of two in my home directory; and the repository of crash dumps I keep went from 40G to 4G. This is really cool technology both under the hood and from the point of view of users and administrators -- stay tuned to their weblogs for all the details.
eWEEK has an article on DTrace. The analysis of DTrace is pretty accurate, but they refer opaquely to Bryan and me as "Sun officials". I'm looking forward to their in-depth comparison of DTrace and DProbes...
This afternoon, I'm leaving for OSCON (easily confused with the bi-mon-sci-fi-con). Here in Solaris Kernel Development we've been talking a bunch about the impending open sourcing of Solaris and what that's going to look like. I'm very excited about OpenSolaris itself, and I'm looking forward to talking to folks at OSCON to hear what they think.
The part they're going to find most surprising is that this is for real. Within a year or two, there are going to be people from outside of Sun contributing to Solaris. Period. It's going to be a little scary, but I'm excited about the possibilities of what this might mean for Solaris (my dream of Solaris on my PowerBook might even come true).
We're holding a BOF session on Thursday at 9pm. So come by if you want to talk to Andy, Eric, Bart or me about the cool stuff in Solaris 10 or about open sourcing Solaris. Stay for the free (as in beer) beer.
I haven't been as prolific a blog writer as I like for the last few days because I've been working morning, noon, and night on some pretty cool new stuff for DTrace. Here's a teaser, I promise I'll give you more later when I have it all working:
bash-2.05b# dtrace -l -n plockstat100694:::
ID PROVIDER MODULE FUNCTION NAME
37394 plockstat694 libc.so.1 mutex_lock_queue mutex-block
37395 plockstat694 libc.so.1 rwlock_lock rw-block
37396 plockstat694 libc.so.1 __mutex_trylock mutex-acquire
37397 plockstat694 libc.so.1 __mutex_unlock mutex-release
37398 plockstat694 libc.so.1 mutex_unlock_internal mutex-release
37399 plockstat694 libc.so.1 mutex_trylock_adaptive mutex-spin
37400 plockstat694 libc.so.1 rwlock_lock rw-block
37401 plockstat694 libc.so.1 cond_wait_queue mutex-acquire
37402 plockstat694 libc.so.1 _pthread_spin_trylock mutex-acquire
37403 plockstat694 libc.so.1 lmutex_lock mutex-acquire
37404 plockstat694 libc.so.1 __mutex_trylock mutex-acquire
37405 plockstat694 libc.so.1 mutex_lock_impl mutex-acquire
37406 plockstat694 libc.so.1 fast_process_lock mutex-acquire
...
In case you need a hint, note that plockstat rhymes with lockstat(1m)...
Another linker alien has joing the b.s.c. fray. Mike Walker already has some useful stuff about shared libraries that you should check out. If you have linker questions, do what I do: ask Mike.
go to the Solaris 10 top 11-20 list for more
Bart Smaalders has written some great stuff about event ports including an extensive coding example. Event ports provide a single API for tying together disparate souces of events. We had baby steps in the past with poll(2) and select(3c), but event ports let you have the file descriptor and timer monitoring as well as dealing with asynchronous I/O and your own custom events.
Corporate shill that I am, there's now a little article I wrote on the meet the architects page. The DTrace team already has a column as a group, but in this one I focus on application tracing which was my primary contribution to DTrace.
go to the Solaris 10 top 11-20 list for more
Here's a little secret about software development: different groups usually aren't that good at working with one another. That's probably not such a shocker for most of you, but the effects can be seen everywhere, and that's why tight integration can be such a distinguishing feature for a collection of software.
About a year and a half ago, we had the DTrace prototype working on much of the system: from kernel functions, through system calls, to every user-land function and instruction. But we were focused completely on C and C++ based applications and this java thing seemed to be catching on. In a radical move, we worked with some of the java guys to take the first baby step in making DTrace and Solaris's other observability tools begin to work with java.
ustack() action for java
One of the most powerful features of DTrace is its ability to correlate low level events in the kernel -- disk I/O, scheduler events, networking, etc. -- with user-land activity. What application is generating all this I/O to this disk? DTrace makes answering that a snap. But what about when you want to dive deeper? What is that application actually doing to generate all that kernel activity? The ustack() action records the user-land stack backtrace so even in that prototype over a year ago, you could hone in on the problem.
Java, however, was still a mystery. Stacks in C and C++ are fairly easy to record, but in java, some methods are interpretted and just-in-time (JIT) compilation means that other methods can move around in the java virtual machine's (JVM) address space. DTrace needed help from the JVM. Working with the java guys, we built a facility where the JVM actually contains a little bit of D (DTrace's C-like language) machinery that knows how to interpret java stacks. We enhanced the ustack() action to take an optional second argument for the number of bytes to record (we've also recently added the jstack() action; see the DTrace Solaris Express Schedule for when it will be available) so when we use the ustack() action in the kernel on a thread in the JVM, that embedded machinery takes over and fills in those bytes with the symbolic interpretation for those methods. Either Bryan or I will give a more complete (and comprehensible) description in the future, but an example should speak volumes:
# dtrace -n profile-100'/execname == "java"/{ @[ustack(50, 512)] = count() }'
...
java/security/AccessController.doPrivileged
java/net/URLClassLoader.findClass
java/lang/ClassLoader.loadClass
sun/misc/Launcher$AppClassLoader.loadClass
java/lang/ClassLoader.loadClass
java/lang/ClassLoader.loadClassInternal
StubRoutines (1)
...
It seems simple, but there's a lot of machinery behind this simple view, and this is actually an incredibly powerful and unique view of the system. Maybe you've had a java application that generated a lot of I/O or had some unexpected latency -- using DTrace and its java-enabled ustack() action, you can finally track the problem down.
pstack(1) for java
While we had the java guys in the room, we couldn't pass up the opportunity to collaborate on getting stacks working in another observability tool: pstack(1). The pstack(1) utility can print out the stack traces of all the threads in a live process or a core file. We implemented it slightly differently than DTrace's ustack() action, but pstack(1) now works on java processes and java core files.
Collaboration is a great thing, and I hope you find the fruits of collaborative effort useful. These are just the first steps -- we have much more planned for integrating Solaris and DTrace with java.
go to the Solaris 10 top 11-20 list for more
pmap(1)
For the uninitiated, pmap(1) is a tool that lets you observe the mappings
in a process. Here's some typical output:
311981: /usr/bin/sh
08046000 8K rw--- [ stack ]
08050000 80K r-x-- /sbin/sh
08074000 4K rwx-- /sbin/sh
08075000 16K rwx-- [ heap ]
C2AB0000 64K rwx-- [ anon ]
C2AD0000 752K r-x-- /lib/libc.so.1
C2B9C000 28K rwx-- /lib/libc.so.1
C2BA3000 16K rwx-- /lib/libc.so.1
C2BB1000 4K rwxs- [ anon ]
C2BC0000 132K r-x-- /lib/ld.so.1
C2BF1000 4K rwx-- /lib/ld.so.1
C2BF2000 8K rwx-- /lib/ld.so.1
total 1116K
You can use this to understand various adresses you might see from a debugger, or you can use other modes of pmap(1) to see the page sizes being used for various mappings, how much of the mappings have actually been faulted in, the attached ISM, DISM or System V shared memory segments, etc. In Solaris 10, pmap(1) has some cool new features -- after a little more thought, I'm not sure that this really belongs on the top 11-20 list, but this is a very cool tool and gets some pretty slick new features; anyways the web affords me the chance for some revisionist history if I feel like updating the list...
thread and signal stacks
When a process creates a new thread, that thread needs a stack. By default, that stack comes from an anonymous mapping. Before Solaris 10, those mappings just appeared as [ anon ] -- undifferentiated from other anonymous mappings; now we label them as thread stacks:
311992: ./mtpause.x86 2
08046000 8K rwx-- [ stack ]
08050000 4K r-x-- /home/ahl/src/tests/mtpause/mtpause.x86
08060000 4K rwx-- /home/ahl/src/tests/mtpause/mtpause.x86
C294D000 4K rwx-R [ stack tid=3 ]
C2951000 4K rwxs- [ anon ]
C2A5D000 4K rwx-R [ stack tid=2 ]
...
That can be pretty useful if you're trying to figure out what some address means in a debugger; before you could tell that it was from some anonymous mapping, but what the heck was that mapping all about? Now you can tell at a glance that its the stack for a particular thread.
Another kind of stack is the alternate signal stack. Alternate signal stacks let threads handle signals like SIGSEGV which might arise due to a stack overflow of the main stack (leaving no room on that stack for the signal handler). You can establish an alternate signal stack using the sigaltstack(2) interface. If you allocate the stack by creating an anonymous mapping using mmap(2) pmap(1) can now identify the per-thread alternate signal stacks:
...
FEBFA000 8K rwx-R [ stack tid=8 ]
FEFFA000 8K rwx-R [ stack tid=4 ]
FF200000 64K rw--- [ altstack tid=8 ]
FF220000 64K rw--- [ altstack tid=4 ]
...
core file content
Core files have always contained a partial snapshot of a process's memory mappings. Now that you can you manually adjust the content of a core file (see my previous entry) some ptools will give you warnings like this:
pargs: core 'core' has insufficient content
So what's in that core file? pmap(1) now let's you see that easily; mappings whose data is missing from the core file are marked with a *:
$ coreadm -P heap+stack+data+anon
$ cat
^\Quit - core dumped
$ pmap core
core 'core' of 312077: cat
08046000 8K rw--- [ stack ]
08050000 8K r-x--* /usr/bin/cat
08062000 4K rwx-- /usr/bin/cat
08063000 40K rwx-- [ heap ]
C2AB0000 64K rwx--
C2AD0000 752K r-x--* /lib/libc.so.1
C2B9C000 28K rwx-- /lib/libc.so.1
C2BA3000 16K rwx-- /lib/libc.so.1
C2BC0000 132K r-x--* /lib/ld.so.1
C2BF1000 4K rwx-- /lib/ld.so.1
C2BF2000 8K rwx-- /lib/ld.so.1
total 1064K
If you're looking at a core file from an earlier release or from a customer in the field, you can quickly tell if you're going to be able to get the data you need out of the core file or if the core file can only be interpreted on the original machine or whatever.
go to the Solaris 10 top 11-20 list for more
core files
Core files are snapshots of a process's state. They contain some of the memory segments (e.g. the stack and heap) as well as some of the in-kernel state associated with the process (e.g. the signal masks and register values). When a process gets certain signals, the kernel, by default, kills the process and produces a core file. You can also creat core files from running processes -- without altering the process -- using Solaris's gcore(1) utility.
So when your application crashed in the field, you could just take the core file and debug it right? Well, not exactly. Core files contained a partial snap-shot of the process's memory mappings -- in particular they omitted the read-only segments which contained the program text (instructions). As a result you would have to recreate the environment from the machine where the core file was produce exactly -- identical versions of the libraries, application binary and loadable modules. Consequently, core files were mostly useful for developers in development (and even then, an old core file could be useless after a recompilation). And this isn't just Solaris -- every OS I've every worked with has omitted program text from core files making those core files of marginal utility once they've left the machine that produced them.
coreadm(1M)
In Solaris 7 we introduced coreadm(1M) to let users and system administrators control the location and name of core files. Previously , core files had always been named "core" and resided in the current working directory of the process that dumped the core. With coreadm(1M) you can name core files whatever you want including meta characters that expand when the core is created; for example, "core.%f.%n" would expand to "core.staroffice.dels" if staroffice were to dump core on my desktop (named dels). System administrators can also set up a global repository for all cores produced on the system to keep an eye on programs unexpectedly dumping core (naturally in Solaris 10, zone administrators can set up per-zone core file repositories).
In Solaris 10, coreadm(1M) becomes an even more powerful tool. Now you can specify which parts of the processes image go into the core file. Program text is there by default, and you can also choose to omit or include the stack, heap, anonymous data, mapped files, system V shared memory segments, ISM, DISM, etc. Let's say you've got some multi-processed database that contains a big DISM segment; rather than having each process include the shared segment in its core file, you can set up just one of the processes (or none of them) to include the segment in the core file.
debugging core files from the field
Now that program text is included by default, core files from failures in the field can be useful without the incredibly arduous task of exactly replicating the original environment. The program text also includes a partial symbol table -- the dynsym -- so you can get accurate stack back traces, and correctly disassemble functions in your favorite post-mortem debugger. If the dynsym doesn't cut it, you can use coreadm(1M) to configure your process to include the full symbol table in its core dumps as well -- so don't strip those binaries!
Also new to Solaris 10, we've started building many libraries with embedded type information in a compressed format. This is more of a teaser, since we're not quite ready to ship the tools to generate that type information, but that type information is included in core files by default. So now not only can we in Solaris actually make headway on core files we get from customers, but we can make progress much more quickly.
If you've installed Solaris Express, go check out the man page for coreadm(1m) and figure out how to get the right content in your core files. Once you get your first core file from a Solaris 10 machine in the field I hope you'll appreciate how much easier it was to debug.
Solaris Express 7/04 is out an includes the I/O provider in DTrace. The I/O provider has just a few probes, but with them you can determine the source of I/O on your system (which processes, or users, or disks, etc.) as well as which files are being accessed, and many other facts that were previously difficult or impossible to find out. As always, you can find the documentation in the Solaris Dynamic Tracing Guide available on the DTrace home page. Here's some more good stuff about the I/O provider.
go to the Solaris 10 top 11-20 list for more
Eric Schrock has tagged in to talk about file names in pfiles(1). This is something we've wanted for forever; here's a teaser:
bash-2.05# pfiles 100354
100354: /usr/lib/nfs/mountd
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0666 dev:267,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_RDONLY
/devices/pseudo/mm@0:null
1: S_IFCHR mode:0666 dev:267,0 ino:6815752 uid:0 gid:3 rdev:13,2
O_WRONLY
/devices/pseudo/mm@0:null
...
11: S_IFCHR mode:0000 dev:267,0 ino:33950 uid:0 gid:0 rdev:105,45
O_RDWR
/devices/pseudo/tl@0:ticots
12: S_IFREG mode:0644 dev:32,0 ino:583850 uid:0 gid:1 size:364
O_RDWR|O_CREAT|O_TRUNC
/etc/rmtab
go to the Solaris 10 top 11-20 list for more
libumem
In Solaris 2.4 we replaced the old buddy allocator1 the slab allocator2 invented by Jeff Bonwick. The slab allocator is covered in pretty much every operating systems text book -- and that's because most operating systems are now using it. In Solaris 103, Jonathan Adams brought the slab allocator to user-land in the form of libumem4.
Getting started with libumem is easy; just do the linker trick of setting LD_PRELOAD to "libumem.so" and any program you execute will use libumem's malloc(3C) and free(3C) (or new and delete
if you're into that sort of thing). Alteratively, if you like what you see, you can start linking your programs against libumem by passing -lumem to your compiler or linker. But I'm getting ahead of myself; why is libumem so great?
Scalability
The slab allocator is designed for systems with many threads and many CPUs. Memory allocation with naive allocators can be a serious bottleneck (in fact we recently used DTrace to find such a bottleneck; using libumem got us a 50% improvement). There are other highly scalable allocators out there, but libumem is about the same or better in terms of performance, has compelling debugging features, and it's free and fully supported by Sun.
Debugging
The scalability and performance are impressive, but not unique to libumem; where libumem really sets itself apart is in debugging. If you've ever spent more than 20 seconds debugging heap corruption or chasing down a memory leak, you need libumem. Once you've used libumem it's hard to imagine debugging this sort of problem with out it.
You can use libumem to find double-frees, use-after-free, and many other problems, but my favorite is memory leaks. Memory leaks can really be a pain especially in large systems; libumem makes leaks easy to detect, and easy to diagnose. Here's a simple example:
$ LD_PRELOAD=libumem.so
$ export LD_PRELOAD
$ UMEM_DEBUG=default
$ export UMEM_DEBUG
$ /usr/bin/mdb ./my_leaky_program
> ::sysbp _exit
> ::run
mdb: stop on entry to _exit
mdb: target stopped at:
libc.so.1`exit+0x14: ta 8
mdb: You've got symbols!
mdb: You've got symbols!
Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 ]
> ::findleaks
CACHE LEAKED BUFCTL CALLER
0002c508 1 00040000 main+4
----------------------------------------------------------------------
Total 1 buffer, 24 bytes
> 00040000::bufctl_audit
ADDR BUFADDR TIMESTAMP THR LASTLOG CONTENTS CACHE SLAB NEXT
DEPTH
00040000 00039fc0 3e34b337e08ef 1 00000000 00000000 0002c508 0003bfb0 00000000
5
libumem.so.1`umem_cache_alloc+0x13c
libumem.so.1`umem_alloc+0x60
libumem.so.1`malloc+0x28
main+4
_start+0x108
Obviously, this is a toy leak, but you get the idea, and it's really that simple to find memory leaks. Other utilities exist for debugging memory leaks, but they dramatically impact performance (to the point where it's difficult to actually run the thing you're trying to debug), and can omit or incorrectly identify leaks. Do you have a memory leak today? Go download Solaris Express, slap your app on it and run it under libumem. I'm sure it will be well worth the time spent.
You can use other mdb dcmds like ::umem_verify to look for corruption. The kernel versions of these dcmds are described in the Solaris Modular Debugger Guide today; we'll be updating the documentation for Solaris 10 to describe all the libumem debugging commands.
Programmatic Interface
In addition to offering the well-known malloc() and free(), also has a programmatic interface for creating your own object caches backed by the heap or memory mapped files or whatever. This offers additional flexibility and precision and allows you to futher optimize your application around libumem. Check out the man pages for umem_alloc() and umem_cache_alloc() for all the details.
Summary
Libumem is a hugely important feature in Solaris 10 that just slipped off top 10 list, but I doubt there's a Solaris user (or soon-to-be Solaris user) that won't fall in love with it. I've only just touched on what you can do with libumem, but Jonathan Adams (libumem's author) will soon be joining the ranks of blogs.sun.com to tell you more. Libumem is fast, it makes debugging a snap, it's easy to use, and you can get down and dirty with it's expanded API -- what else couldn anyone ask for in an allocator?
1. Jeff's USENIX paper is definitely worth a read
2. For more about Solaris history, and the internals of the slab allocator check out Solaris Internals
3. Actually, Jonathan slipped libumem into Solaris 9 Update 3 so you might have had libumem all this time and not known...
4. Jeff and Jonathan wrote a USENIX paper about some additions to the allocator and its extension to user-land in the form of libumem
|
|