Jonathan Adams's Weblog

Thursday Jun 30, 2005

The FBT provider, opensolaris source, and fun with the environment

The FBT provider, opensolaris source, and fun with the environment

Now that opensolaris is out there, it's quite a bit easier for folks to use DTrace's FBT provider. FBT provides "function boundary tracing", i.e. it has probes for the entry and exit for almost every function in the kernel. This is amazingly powerful and flexible, but it leads to it being hard to use: with over 20,000 functions on a typical Solaris box, it's very hard to know where to start, especially without access to the source code.

With OpenSolaris, the source code is available. So to illustrate how you can use this newly available information to get something done, I thought I'd use a classic question: How can I examine and muck with the environment?

To start off, we need a plan of attack; since dtrace doesn't have any way of looping over datastructures, typically if you want to walk some datastructure, you just find a place where the kernel is already doing so, and use probes there to sneak a peak at the data as it goes by. The initial environment for a process is set up in the exec(2) system call, so lets see if we can find where we read in the data from the user process.

Looking at the source, exece() calls exec_common(), which is the main workhorse. There doesn't seem to be any direct mucking of the environment, but there is:

    222 	ua.fname = fname;
    223 	ua.argp = argp;
    224 	ua.envp = envp;
    225 
    226 	if ((error = gexec(&vp, &ua, &args, NULL, 0, &execsz,
    227 	    exec_file, p->p_cred)) != 0) {
Now we could continue to track all of this down, but it's much easier to just search usr/src/uts for references to the envp symbol. That leads us to stk_copyin(), the routine responsible for copying in everything which is needed on the stack. Here's the environment processing loop:
   1303 	if (envp != NULL) {
   1304 		for (;;) {
   1305 			if (stk_getptr(args, envp, &sp))
   1306 				return (EFAULT);
   1307 			if (sp == NULL)
   1308 				break;
   1309 			if ((error = stk_add(args, sp, UIO_USERSPACE)) != 0)
   1310 				return (error);
   1311 			envp += ptrsize;
   1312 		}
   1313 	}
Without even knowing what stk_getptr() and stk_add() do, we now have enough information to write a D script to get the environment of a process as it exec()s. Here's the basic outline:
  1. First, use fbt::stk_copyin:entry to stash a copy of the envp pointer.
  2. Second, use fbt::stk_getptr:entry to watch for reads from the stored envp address.
  3. Third, use fbt::stk_add:entry and fbt::stk_add:return to print out the environment.
  4. Lastly, use fbt::stk_copyin:return to clean up.
And here's our first script:
#!/usr/sbin/dtrace -s

#pragma D option quiet

fbt::stk_copyin:entry
{
	self->envp = (uintptr_t)args[0]->envp;
}

fbt::stk_getptr:entry
/ self->envp != 0 /
{
	/* check if we're looking at envp or envp+1 */
	self->on = ((arg1 - self->envp) <= sizeof (uint64_t));

	/* update envp if we're on */
	self->envp = self->on ? arg1 : self->envp;
}

fbt::stk_add:entry
/ self->on && args[2] == UIO_USERSPACE /
{
	self->ptr = arg1;
}

fbt::stk_add:return
/ self->ptr != 0 /
{
	printf("%.79s\n", copyinstr(self->ptr));
	self->ptr = 0;
	self->on = 0;
}

fbt::stk_copyin:return
{
	self->envp = 0;
	self->on = 0;
}
Note that we delay the copyinstr of stk_add()'s second argument until fbt::stk_add:return. This is due to the fact that dtrace(1M) cannot fault in pages; so if a probe tries to copyinstr an address which has not yet been touched, you'll get a runtime error like:
dtrace: error on enabled probe ID 4 (ID 12535: fbt:genunix:stk_add:entry):
invalid address (0x1818000) in action #1 at DIF offset 28
By waiting until the return probe, we avoid this problem; we know that the kernel just touched the page to read in its copy.

Now, looking at the environment is fun, but it would be even more interesting to change the environment of a process while it is being execed. This requires a bit more work, and access to the destructive action copyout(). I'm going to start with a script which requires a recent version of Solaris (snv_15+, or the OpenSolaris release), because Bryan introduced some nice string handling stuff recently. We'll adapt the script to S10 afterwards.

Lets start by saying we want to change the name of the environment variable "FOO" to "BAR", but leave the value the same. The basic idea is simple; copyin() the string in the fbt::stk_add:entry probe, and if it's the one we want to change, copyout() the changes. The kernel will then proceed to copyin() the changed string, and use it for the environment of the process. The complication is the same as before; what if the page hasn't yet been touched, or the copyout() operation fails (for example, if the string isn't writable)?

There's no simple solution, so I'm just going to check *afterwards* that we didn't miss changing it, and kill -9 the process if we did. It's vile, but effective. Here's the script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option destructive

self uintptr_t ptr;

inline int interesting = (strtok(copyinstr(self->ptr), "=") == "FOO");

fbt::stk_copyin:entry
{
	self->envp = (uintptr_t)args[0]->envp;
}

fbt::stk_getptr:entry
/ self->envp != 0 /
{
	/* check if we're looking at envp or envp+1 */
	self->on = ((arg1 - self->envp) <= sizeof (uint64_t));

	/* update envp if we're on */
	self->envp = self->on ? arg1 : self->envp;
	self->ptr = 0;
}

fbt::stk_add:entry
/ self->on && args[2] == UIO_USERSPACE /
{
	self->ptr = arg1;
	self->didit = 0;
}

fbt::stk_add:entry
/ self->ptr != 0 && interesting /
{
	printf("%d: %s: changed env \"%s\"\n",
	    pid, execname, copyinstr(self->ptr));
	copyout("BAR", self->ptr, 3);		/* 3 == strlen("BAR") */
	self->didit = 1;
}

fbt::stk_add:return
/ self->ptr != 0 && interesting && !self->didit /
{
	printf("%d: %s: killed, env \"%s\" couldn't be changed\n",
	    pid, execname, copyinstr(self->ptr));
	raise(9);
}

fbt::stk_copyin:return
{
	self->envp = 0;
	self->on = 0;
	self->ptr = 0;
}
The above works great on Solaris Nevada and OpenSolaris, but doesn't work on Solaris 10, because it uses "strtok". So to use it on Solaris 10, we'll have to do things slightly more manually. The only thing that needs to change is the definition of the "interesting" inline, and some more cleanup in fbt::stk_copyin:return:
inline int interesting =
    ((self->str = copyinstr(self->ptr), "=")[0] == 'F' &&
    self->str[1] == 'O' &&
    self->str[2] == 'O' &&
    self->str[3] == '=');
...
fbt::stk_copyin:return
{
	self->envp = 0;
	self->on = 0;
	self->ptr = 0;
	self->str = 0;
}
A final note is on stability; we're using private implementation details of Solaris to make this all work, and they are subject to change without notice at any time. This particular part of solaris isn't likely to change much, but you never know. A reasonable RFE would be for more Stable probe-points in the exec(3C) family, so that people can write things like this more stably.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Wednesday Jun 15, 2005

macros and powers of two

Unlike many userland applications, there are large portions of the kernel that need to do bit-level manipulations. These generally fall into a few distinct areas:

  • Drivers communicating with hardware
  • Hash functions, bitmasks, and other high-performance algorithms
  • Memory allocators, page manipulators, etc.
In this note, I'm going to peek into the last group, which typically needs to do a lot of power-of-2 based alignment work, and look at a set of macros used to abstract away some of the bit-fiddling.

The main low-level allocators in Solaris are kmem and vmem; kmem handles your standard page-backed memory allocation, while vmem handles the more abstract concept of allocating from spaces of integers. kmem actually layers on top of vmem, but the details are beyond the scope of this blog entry.

While implementing vmem, Jeff Bonwick introduced a set of handy macros for dealing with things relative to powers of two. The macros live in uts/common/sys/sysmacros.h, and are defined thusly:

/*
 * Macros for various sorts of alignment and rounding when the alignment
 * is known to be a power of 2.
 */
#define	P2ALIGN(x, align)		((x) & -(align))
#define	P2PHASE(x, align)		((x) & ((align) - 1))
#define	P2NPHASE(x, align)		(-(x) & ((align) - 1))
#define	P2ROUNDUP(x, align)		(-(-(x) & -(align)))
#define	P2END(x, align)			(-(~(x) & -(align)))
#define	P2PHASEUP(x, align, phase)	((phase) - (((phase) - (x)) & -(align)))
#define	P2CROSS(x, y, align)		(((x) ^ (y)) > (align) - 1)
/*
 * Determine whether two numbers have the same high-order bit.
 */
#define	P2SAMEHIGHBIT(x, y)		(((x) ^ (y)) < ((x) & (y)))
The purpose of this blog entry is to talk about what they do and how they work; if you don't want the challenge spoiled, you should stop reading now.








Alright, now that they've all left, we can dive in. First, remember the basic equation of two's complement arithmetic:
-x = ~x + 1
That is, to negate a number, complement all of its bits and add one. If I assume that the number has a lowest set bit,[1] and capital Xs represent complements of the corresponding xs, here are some relations for any non-zero binary number:
        x = ...xxxxx10000...
      x-1 = ...xxxxx01111...
       ~x = ...XXXXX01111...
-x = ~x+1 = ...XXXXX10000...
Now, for powers of two, there is only one set bit, so all of the xs are 0. This means that (x-1) is a mask for the offset modulo x, and -x is a mask which clears the offset. In addition, all numbers are unsigned, align is a power-of-2, and offset is less than align. These relations in mind, we can look at what each of the above macros do:
macroformulaaction
P2ALIGN x & -align clears all bits below align, rounding x down to the next lower multiple of align
P2PHASE x & (align - 1) clears all bits above or equal to align, getting (x % align), the phase of x with regards to align
P2NPHASE (-x) & (align - 1) This is the first tricky formula. We negate x, and get its phase. The result is the same as (align - P2PHASE(x, align)) % align, but is much faster to compute.
P2ROUNDUP -((-x) & -align) This can be re-written as -P2ALIGN(-x, align), which shows how the operation works; "negate, round down, negate" is the same as "round up". So we round x up to the next align boundary.
P2END -((~x) & -align) This can be re-written as -((-x-1) & -align) = P2ROUNDUP(x+1, align); round up (x+1) to the next align boundary. This gets the end address of the current align block.
P2PHASEUP (phase -
 ((phase - x) &
  -align))
This can be re-written as (phase + P2ROUNDUP(x - phase, align)). So what we've done is rounded up x to have a specific phase with regards to align. (If you've got some linear algebra under your belt, this is similar to transforming f(y) = P2ROUNDUP(y, align) into a coordinate system with x = y + phase)
P2CROSS (x ^ y) > (align-1) (x ^ y) has bits set wherever x and y differ. This just tests to see if they differ in any position above the alignment, which will only be the case if [x, y] crosses an align boundary.
P2SAMEHIGHBIT (x ^ y) < (x & y) Similar to P2CROSS; here, we're checking that the first bit that differs is less than the highest set bit the two have in common.

From the above, here are some trivial relations between the various functions:

P2ALIGN(x, align)   = x - P2PHASE(x, align)
P2ROUNDUP(x, align) = x + P2NPHASE(x, align)
P2END(x, align)     = P2ALIGN(x, align) + align

Finally, the one drawback of these macros is the fact that they are macros. This means that (as usual) the argument parsing can leave much to be desired. The biggest problem is that there is no type enforcement; if x, y, or align are not unsigned integers of the same width, bad results can occur. Hence the (recently added) macros, P2*_TYPED, defined underneath the originals, which allow an explicit type to be specified.

Despite this drawback, these macros tend to be much clearer than the same things done by hand in the code. They probably aren't used enough, in fact.

Footnotes

[1] Of course, zero doesn't have a lowest set bit, but you can treat it as being 2bits, and get the same answer.

Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005

The implementation of ::findleaks

Now that OpenSolaris is available, it's time to explain how it all works. I thought I'd start with some of the mdb(1) dcmds I've worked on, since they are relatively self-contained, and have very strong interface boundaries.

I'll start with my favorite dcmd, ::findleaks, a memory leak detector originally written by Bryan Cantrill, which I re-factored substantially late in Solaris 10. There were a few reasons the refactoring was necessary:

  • When I'd done the original ::findleaks for libumem implementation, I simply copied the original leaky.c and re-wrote it into submission. This worked fine, but was an additional maintenance burden; two dissimilar copies of the same code is to be avoided if possible.
  • The original code reported oversize leaks in an obscure way:
    findleaks: Oversize leak at seg 0x12345678!
    which, unless you are very familiar with the way the allocator works internally, was unlikely to lead you to the stack trace or size of the leaked buffer. (for the curious, 12345678$<vmem_seg was the necessary incantation in the old days)
  • The original ::findleaks algorithm was designed with running in a user process in mind. In particular, it assumed it had a lot of memory to play with. Unfortunately, with the coming of kmdb(1), memory space for dcmds was getting very tight. For ::findleaks to work under kmdb(1), it needed substantial belt-tightening.
  • There were some enhancements I wanted to make; in particular, there was no simple way to dump all of the leaked stack traces. Internally, the "leakbot" tool (also written by Bryan) automatically extracted the set of stack traces, but this didn't help people without access to that tool.
So I started a side project to re-work ::findleaks. The following bugs were part of it:
  4840780 ::findleaks oversized leak reporting needs overhaul
  4849669 ::findleaks' cleanup code needs work
  4931271 kmem walkers and ::findleaks use large amounts of memory
  5030758 ::findleaks should have an option to display the stack traces
but this note only covers the generic ::findleaks implementation.

The files of ::findleaks

Let's start with a lay of the land. The following files encompass the implementation of ::findleaks (to shorten things, I'm assuming a prefix of usr/src/cmd/mdb/common for all file references):

.../modules/genunix/leaky.h The dcmd, dcmd help, and walker function declarations, used by .../modules/genunix/genunix.c and .../modules/libumem/libumem.c to create the dcmd linkage for mdb(1).
.../modules/genunix/leaky_impl.h An implementation header, which defines (and documents, via a large block comment) the interface between the main engine, .../modules/genunix/leaky.c and the two target implementations.
.../modules/genunix/leaky.c The common ::findleaks engine.
.../modules/genunix/leaky_subr.c
.../modules/libumem/leaky_subr.c
The two target implementations, one for the kernel, one for libumem(3lib).

I'll use the term "target" when talking about the other side of ::findleaks's implementation. The non-static target interfaces all start with leaky_subr_.

The target interface is a link-time, functional binding; during the compilation process, alternate versions of the target routines are linked against essentially the same "generic" framework. This is less flexible than doing some kind of dynamic binding (i.e. using function pointers), but is sufficiently general as long as only one target interfaces is required for a given dmod.

The public interface

The top layer of interface consists of the dcmds (D commands) and walkers exported by the genunix and libumem dmods. In the genunix dmod, these are defined in .../modules/genunix/genunix.c:

static const mdb_dcmd_t dcmds[] = {
...
	/* from leaky.c + leaky_subr.c */
	{ "findleaks", FINDLEAKS_USAGE,
	    "search for potential kernel memory leaks", findleaks,
	    findleaks_help },
...
static const mdb_walker_t walkers[] = {
...
	/* from leaky.c + leaky_subr.c */
	{ "leak", "given a leaked bufctl or vmem_seg, find leaks w/ same "
	    "stack trace",
		leaky_walk_init, leaky_walk_step, leaky_walk_fini },
	{ "leakbuf", "given a leaked bufctl or vmem_seg, walk buffers for "
	    "leaks w/ same stack trace",
		leaky_walk_init, leaky_buf_walk_step, leaky_walk_fini },
(the structures used are part of the Debugger Module Linkage, defined in the Solaris Modular Debugger Guide)
The walkers are straightforward; they just walk the cached leak table. ::findleaks is the main event.

First up: initialize, estimate, slurp in the buffers

When ::findleaks is invoked, mdb(1) calls findleaks(), which validates its arguments, calls leaky_cleanup() to clean up any interrupted state, then checks if there is cached data. If not, we start the run in earnest.

The first call into the target interface is here:

	if ((ret = leaky_subr_estimate(&est)) != DCMD_OK)
		return (ret);
This calls over to here or here, which first check that finding memory leaks is possible (i.e. savecore -L dumps are not a consistent snapshot, so we refuse to run on them), then calculate an upper bound on the number of allocated segments in the system.

Shiny new estimate in hand, ::findleaks allocates a (possibly huge) array of leak_mtab_ts, calls into the target to fill it, and sorts the result:

	lk_mtab = mdb_zalloc(est * sizeof (leak_mtab_t), UM_SLEEP | UM_GC);
	lmp = lk_mtab;

	if ((ret = leaky_subr_fill(&lmp)) != DCMD_OK)
		return (ret);

	lk_nbuffers = lmp - lk_mtab;

	qsort(lk_mtab, lk_nbuffers, sizeof (leak_mtab_t), leaky_mtabcmp);
leaky_subr_fill(), defined here and here, fills in the array with the details of every allocated buffer in the system. Each entry has a base, limit, and control pointer, the last being completely target-defined. The qsort(3C) call sorts the array by base address.

Stage two: find everything reachable

::findleaks is, at its heart, a simulation of a conservative garbage collection run. All of the lk_mtab entries start in the "unmarked" state; that is, they are not known to be reachable. We call into the target:

	if ((ret = leaky_subr_run()) != DCMD_OK)
		return (ret);
( here and here) to scan the "root set"; that is, all statically defined objects. For each such object, the target implementation calls leaky_grep() or one of its variants. These look for pointers, and mark each lk_mtab entry they find, along with any (recursively) reachable entries. When the run is finished, any unmarked buffers must be unreachable leaks.

While the interface for leaky_grep() has a simple description, it invites a spectrum of implementation possibilities, each with different trade-offs in speed, memory usage, and maintainability.[1] Any findleaks run spends virtually all of its time and energy in leaky_grep(), so it is worth the effort of optimizing. The most straightforward implementation is something like:

leaky_grep(addr, size)
	allocate a buffer of size bytes
	read [addr, size) into the buffer
	for each pointer in buffer {
		look up mtab entry
		if it is marked, continue
		mark it
		leaky_grep(mtab's base, mtab's size);
	}
	free the buffer
The immediate problem with this is the recursive depth (hundreds of thousands of call frames are typical in a normal dump.)[2] The next is memory usage; each recursive step requires additional storage for the entire contents of a memory buffer; with hundreds of thousands of recursive invocations occurring at once, that quickly adds up to tens of megabytes of peak storage.

Instead of storing a buffer for each recursive step, the current algorithm uses a single small (16 kbyte) buffer. The following comment describes how we use the buffer:

		/*
		 * The buffer looks like:  ('+'s are unscanned data)
		 *
		 * -----------------------------++++++++++++++++
		 * |				|		|
		 * buf				cur		end
		 *
		 * cur scans forward.  When we encounter a new buffer, and
		 * it will fit behind "cur", we read it in and back up cur,
		 * processing it immediately.
		 */
The idea is that we read in a chunk of data at the end of the buffer, then start scanning for pointers. When we find a pointer to an unmarked mtab entry, we mark it, then either read it in to the buffer immediately, or put it on a stack of unfinished work. Since each stack entry only has to name an entry in the lk_mtab array, the stack is quite compact. Small buffers will almost always be processed in line, and the overall memory usage is quite small; the only large allocation is the lk_mtab array itself.

Stage three: collect the leaks

Once leaky_subr_run() returns successfully, we're on the home stretch. First, we scan the lk_mtab array for unmarked buffers, and report them to the target implementation:

	for (i = 0; i < lk_nbuffers; i++) {
		if (LK_MARKED(lk_mtab[i].lkm_base))
			continue;
		leaky_subr_add_leak(&lk_mtab[i]);
	}
leaky_subr_add_leak(), defined here and here, reads in the available debugging information about the leak, and invokes leaky_add_leak() with the details. leaky_add_leak() adds each leak to a global hash table, coalescing any duplicates. Each entry in the table is a leak_bufctl_t, which is used later for leak reporting. Once all of the leaks have been processed, the final processing is done:
	leaky_sort();
which groups all of the representative leaks by type, and sorts each type's list of buffers using another target interface, leaky_subr_bufctl_cmp(), which defines a "display" order for the leaks.

Final stage: report the leaks

Once leaky_sort has completed, the results have been stored and cached in static data. leaky_dump() dumps the data, using the algorithm explained in .../modules/genunix/leaky_impl.h:

 * void leaky_subr_dump_start(type)
 * void leaky_subr_dump(lkb, verbose)
 * void leaky_subr_dump_end(type)
 *	Used to dump the table of discovered leaks.  invoked as:
 *
 *		for i in 0 .. LK_NUM_TYPES
 *			leaky_subr_dump_start(i)
 *			for lkb in (possibly a subset of) the type i leaks
 *				leaky_subr_dump(lkb, 0)
 *			leaky_subr_dump_end(i)
 *
 *	leaky_subr_dump_start()/end() are always invoked for each type, even
 *	if there are no leaks of that type.  If '-d' was passed to
 *	::findleaks, then we loop through the sequence of lkbs:
 *
 *		for i in 0 .. LK_NUM_TYPES
 *			for lkb of type i, same subset and order as above
 *				leaky_subr_dump(lkb, 1)
 *
 *	leaky_subr_dump() can use the leaks chained off of lkb_next to
 *	access coalesced leaks.  lkb_dups is the length of the dup list.
The targets use these invocations to produce tabular output.

Conclusion and Final Thoughts

The generic findleaks code is compact (913 lines), and for the most part without external dependency.[3] It is a useful shell for running the leak processing, but knows nothing of how to discover allocated buffers, root sets, or any other implementation details. All such external dependencies are covered by the target implementations. (It should be possible to do a target implementation for any allocator which has enough state to track all allocated buffers.)

One interesting aspect of this re-working is that the "target" interface places a huge set of requirements on the generic framework; given just the contents of .../modules/genunix/leaky_impl.h, someone writing a clean-room implementation of leaky.c would be constrained to writing something extremely similar to what is there today. On the other hand, the target implementation is less constrained by the target interface. Instead, it is mostly determined by how it must retrieve the needed data from the process, memory image, or core file mdb(1) is running against. This trade off of flexibility on two sided of a software interface is a common pattern, and interesting to see on a smaller scale.

Footnotes:

[1] Interestingly, speed is pretty low on the priority list, especially in the post-mortem case: ::findleaks, when run against a dump, is almost entirely I/O bound.
[2] Which will blow out the stack, causing a core dump. Simple recursion removal techniques lead you to the original leaky_grep() implementation: an iterative form of the same algorithm, with the stack of pending work stored on the heap. This implementation served ::findleaks well, until kmdb's memory constraints came along.
[3] Ignoring the "verbose" output, the only external calls not part of the target interface is the call to mdb_vread() in leaky_grep(), which reads from the program's address space, and mdb_getopts(), used to parse the arguments to ::findleaks.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Calendar

Feeds

Search

Navigation

Referrers