|
Using libdtrace
Using libdtrace
On this occasion of the full release of OpenSolaris I thought I would
share some of my experiences using libdtrace. libdtrace is the library to
programmatically control DTrace,
and I have gained some familiarity with it in the course of developing
a Java DTrace API (available soon!). While libdtrace is still private (and
therefore still subject to change), I thought it would be worthwhile to
at least describe what's there and some of the issues I encountered.
I first approached libdtrace
with the obvious questions: How do I use it? What functionality is
available? How do I get data out of it? I started with
existing programs that use libdtrace.
From the source for the dtrace
command (dtrace.c)
I learned the natural life cycle of a DTrace consumer:
open() returns a handle
needed to access libdtrace
functionality (a dtrace_hdl_t pointer, the
first argument to all consumer functions). Two compile() functions both take a
D program, one as a string and the other as a file. A single
consumer may compile multiple programs, as happens when you specify
multiple -n or -s options to the dtrace command. I was at
first confused by exec() and go(), which sound like the same
thing, until I learned that exec()
is what enables probes, instrumenting code at matching probe
points. It is called once for each compiled D program and as an
extra favor gives back information about the resulting program enablings
such as the number of matching probes. go() is a non-blocking call
issued once to begin tracing data from all enabled probes, writing the
output of probe actions into per-consumer DTrace buffers whenever a
probe fires.
stop() on second thought
isn't exactly a natural stage in the life of a DTrace consumer. A
consumer dies of natural causes when it encounters the exit() action in a D program,
and there is no need to call stop(). stop() is a programmatic way of
interrupting a consumer outside of any D program action. Whether
or not the consumer dies of natural causes, it leaves any unconsumed
data available to functions like dtrace_aggregate_snap() and dtrace_aggregate_walk().
Finally, close() releases
all system resources held by the consumer.
After dtrace.c, I looked
at the source for the intrstat
command (intrstat.c),
another example of a DTrace consumer. The biggest difference between it
and the dtrace command is
the absence of calls to dtrace_sleep() and dtrace_work().
Modeling the Java DTrace API after intrstat
was a bit of a false start, based partly on the assumption that API
users were in most cases interested only in aggregations. It
took a while before I understood the difference between these two
approaches to consuming DTrace data. In particular, I remember
going back to these examples one day and wondering why dtrace.c did
not call dtrace_aggregate_print()
until after exiting its
consumer loop:
do {
if (!g_intr && !done)
dtrace_sleep(g_dtp);
if (g_newline) {
/*
* Output a newline just to make the output look
* slightly cleaner. Note that we do this even in
* "quiet" mode...
*/
oprintf("\n");
g_newline = 0;
}
if (done || g_intr || (g_psc != 0 && g_pslive == 0)) {
done = 1;
if (dtrace_stop(g_dtp) == -1)
dfatal("couldn't stop tracing");
}
switch (dtrace_work(g_dtp,
g_ofp, chew, chewrec, NULL)) {
case DTRACE_WORKSTATUS_DONE:
done = 1;
break;
case DTRACE_WORKSTATUS_OKAY:
break;
default:
if (!g_impatient && dtrace_errno(g_dtp) != EINTR)
dfatal("processing aborted");
}
if (fflush(g_ofp) == EOF)
clearerr(g_ofp);
} while (!done);
oprintf("\n");
if (!g_impatient) {
if (dtrace_aggregate_print(g_dtp,
g_ofp, NULL) == -1 &&
dtrace_errno(g_dtp) != EINTR)
dfatal("failed to print aggregations");
}
dtrace_close(g_dtp);
whereas intrstat.c was calling dtrace_aggregate_snap(), dtrace_aggregate_walk(),and dtrace_aggregate_clear() repeatedly
inside of its consumer loop:
while
(indefinite || iter) {
(void) sigsuspend(&set);
(void) dtrace_status(g_dtp);
if (g_intr == 0)
continue;
iter--;
g_intr--;
check_pset();
now = gethrtime();
g_interval = now - last;
last = now;
if (dtrace_aggregate_snap(g_dtp)
!= 0)
fatal("failed to add to aggregate");
g_start = g_end = 0;
do {
g_header = 1;
if (dtrace_aggregate_walk_keyvarsorted(g_dtp,
walk, NULL) != 0)
fatal("failed to sort aggregate");
if (g_start == g_end)
break;
} while ((g_start = g_end) < g_max_cpus);
dtrace_aggregate_clear(g_dtp);
}
The answer is that the various aggregate
functions act on all aggregations at once - all aggregations from all
enabled D programs - regardless of the actions in the probe clauses of
those D programs (a programmatic alternative to relying on D
actions). That approach is not appropriate for the dtrace command, which generates
output only from D program actions until interrupted with Ctrl-C, when it prints all
unconsumed aggregation data before exiting. Consider the following
D program:
syscall:::entry
{
@calls[execname] = count();
@fcalls[probefunc] = count();
}
profile:::tick-1sec
{
printa(@calls);
clear(@calls);
}
Without the tick probe,
this program produces no output when run with dtrace -s (except the number of
matching probes) until Ctrl-C,
when dtrace_aggregate_print()
prints both the @calls and
the @fcalls aggregations
together. With the tick
probe, however, dtrace prints the @calls
aggregation once every second, but waits to print the @fcalls aggregation until Ctrl-C. In fact, it
skips all but unprinted
aggregations (i.e. aggregations that have not already been printed at
least once as a result of the printa()
action), in this case printing only the @fcalls aggregation after Ctrl-C.
The D program in intrstat.c, on the other
hand, has no need of a tick
probe, because intrstat
relies completely on aggregate
functions to process all aggregations at once:
static
const char *g_prog =
"interrupt-start"
"{"
"
self->ts = vtimestamp;"
"}"
""
"interrupt-complete"
"/self->ts/"
"{"
"
this->devi = (struct dev_info *)arg0;"
"
@counts[stringof(`devnamesp[this->devi->devi_major].dn_name),"
"
this->devi->devi_instance] = count();"
"
@times[stringof(`devnamesp[this->devi->devi_major].dn_name),"
"
this->devi->devi_instance] = sum(vtimestamp - self->ts);"
"
self->ts = 0;"
"}"
This is convenient for a command that collects aggregated
statistics on its own programmatic interval. It works because intrstat does not need to
support DTrace options like aggrate
and switchrate (so no need
for dtrace_sleep() ), and
because it handles all output itself and does not need any
output-producing actions in its D program.
The Java DTrace API, however, needs to support all kinds of D programs,
those that use output-producing actions and those that do not.
Like dtrace (and unlike intrstat) it needs to call dtrace_work(), because that
is the only way to consume the output of D actions like trace(), printf(), and printa(). It also needs
to support options like aggrate
and switchrate, so it uses dtrace_sleep() to regulate
its consumer loop, which
wakes up as often as needed to honor all D options that control
consumption rates. However, this introduces another problem,
because dtrace_work()
sends its output to a file (stdout
by default), whereas the Java DTrace API must not print to a file but
rather generate data instances for the user of the API and leave that
user in control of how (or whether) to output that data. For
example, an application using the API might not print any text at all,
but instead display the information in a graph. Like intrstat, the Java API needs to
take control of what to do with the data it consumes.
libdtrace provides for this
need with dtrace_handle_buffered(),
which lets a user register a function to be called whenever there is
buffered output that would otherwise be written to a file. This is
similar to the callback mechanism libdtrace
employs to handle data drops, runtime errors, and target process
completion (of interest when running dtrace -c). For example,
in dtrace.c:
if
(g_mode != DMODE_LIST) {
if (dtrace_handle_err(g_dtp, &errhandler, NULL) == -1)
dfatal("failed to establish error handler");
if (dtrace_handle_drop(g_dtp, &drophandler, NULL) == -1)
dfatal("failed to establish drop handler");
if (dtrace_handle_proc(g_dtp, &prochandler, NULL) == -1)
dfatal("failed to establish proc handler");
}
These calls register three user-defined functions (implemented in dtrace.c) to call under the
appropriate circumstances: errhandler(), drophandler(), and prochandler(). When
the Java DTrace API registers a similar callback for buffered output
with dtrace_handle_buffered(),
it has the added effect of turning off output to the usual output
file. In all of these callback registration examples, the
resulting calls to the registered handler function occur in the same
thread that calls dtrace_work().
Here I found something that I consider especially considerate on the
part of the libdtrace
developers. The NULL
argument to each of the three handler registrations (above) is passed to
a parameter (unused in this case) that could instead take a pointer to a user argument that is forwarded to
every handler invocation. Anything special needed by your handler
function can be provided via the user arg set when the handler is
registered. In the case of the Java API, this gives the handler
access to the JNI environment, including cached classes and method IDs,
needed to pass data back to interested Java listeners.
Once the Java API registers a buffered output handler, it can use dtrace_work() to get probe
data printed to an in-memory buffer instead of to a file. Now that
the Java API can support output-producing D actions like trace(), printf(), and printa(), we still want it to
support the convenience enjoyed by intrstat,
that is, the ability to request all aggregations at once on a
programmatic interval outside of any D actions. Supporting both
approaches simultaneously requires a per-consumer lock to prevent the
needed dtrace_aggregate_print()
from interfering with dtrace_work().
(These two approaches to data consumption are inherently asynchronous.)
While on the subject of threads, it is also possible to run multiple
consumers simultaneously, as long as you protect calls to libdtrace with a single global
lock. Fortunately exempt from this requirement are the most
time-consuming calls, dtrace_work()
and its competing dtrace_aggregate_print(),
which only need to be kept from interfering with each other. Note
that the API uses dtrace_aggregate_print()
instead of dtrace_aggregate_walk()
in order to leverage the buffered output handler (which otherwise does
not get called during aggregate traversal).
Using the libdtrace Buffered Output Handler
The purpose of the buffered output handler (bufhandler) is to make the
output of the libdtrace
print routines available without writing it to a file (such as stdout). This is needed
for the stack(), ustack(), and jstack() actions to get
human-readable stack values, since there is no public function in libdtrace to convert the
internal representation of stack values. It is also used to get
the formatted string resulting from the D printf() action. The
mechanism anticipates other similar formatting requirements for which a
public libdtrace conversion
function is not suitable.
dtrace_handle_buffered()
takes a handler function of type dtrace_handle_buffered_f
and a user argument forwarded to each callback. In addition to the
user argument, the bufhandler takes a pointer to a dtrace_bufdata_t
structure:
typedef
struct dtrace_bufdata {
dtrace_hdl_t
*dtbda_handle;
/* handle to DTrace library */
const char
*dtbda_buffered;
/* buffered output */
dtrace_probedata_t
*dtbda_probe; /* probe data */
dtrace_recdesc_t
*dtbda_recdesc; /* record
description */
dtrace_aggdata_t
*dtbda_aggdata; /* aggregation
data, if agg. */
} dtrace_bufdata_t;
In addition to the formatted output (dtbda_buffered) of the libdtrace print routines
normally sent to a file, the bufdata
structure contains any probedata (including the current probedata
record) and any aggregation data processed while generating that output.
The bufhandler is called once for each output-producing,
non-aggregating D action, such as trace()
or printf(), and once for
each libdtrace aggregation
record (whether in response to the D printa() action or an
asynchronous aggregate request). There is one libdtrace aggregation record per
tuple element plus one for the corresponding value. The bufhandler
is called with a null
record at the end of each complete tuple/value pair.
Using the buffered output handler leads to a few interesting
observations:
- The buffered output handler (bufhandler) is never called with
action type DTRACEACT_PRINTA.
- The bufhandler never sees the D exit() action
- The base probedata pointer (bufdata->dtbda_probe->dtpda_data)
is overwritten during probedata consumption and needs to be cached
ahead of the bufhandler callback (in the probe-handling function, see
below)
The points listed above highlight the need to coordinate use of the
bufhandler with two other callback functions passed to dtrace_work():
Both of these functions also take a pointer to a user arg optionally
passed to dtrace_work() and
forwarded to each callback.
The probe-consuming function, dtrace_consume_probe_f,
is called once each time a probe fires and ahead of any calls to the
record consuming function or the bufhandler (for that probe's
data). It takes a probedata argument that includes an array of
records corresponding to actions and statements in that probe. In
the case of the printf()
action it includes a record for each unformatted printf element. A
statement that is not a D action, such as assignment to a variable, can
be distinguished from an action by the fact that it has no size (rec->dtrd_size == 0).
The record-consuming function, dtrace_consume_rec_f,
is called once for each unskipped probedata record and ahead of any
call to the bufhandler (resulting from output associated with that
record). It differs from the bufhandler as follows:
- It does not have access to libdtrace-formatted
output.
- It is called for each D program statement, including non-actions
(size in probedata buffer is zero); the bufhandler is not bothered with
non-actions, since they produce no output.
- It is called for the D exit() action; the bufhandler is not.
- In response to the printa()
action, it is called with a record having an action of type DTRACEACT_PRINTA. The
bufhandler never sees that action type, only the output-producing
aggregation records (with action types such as DTRACEACT_STACK and DTRACEAGG_COUNT).
- It is called with a NULL
record at the end of each probedata.
I said the record handler is called for each "unskipped" record because
it is called only once for the printf()
action, even though there may be multiple probedata records
corresponding to the unformatted elements of that printf(). The
knowledge of how many probedata records to consume to process a single printf() is private to libdtrace, so we cannot know
ahead of time how many probedata records belong to the current printf(). Instead, we
need to look back at how many records since the previous call were
skipped to reach the current probedata buffer offset, for example:
for
(r = prev_probedata_index;
((r < edesc->dtepd_nrecs) &&
(edesc->dtepd_rec[r].dtrd_offset < rec->dtrd_offset));
++r);
Here edesc refers
to the enabled probe description (metadata that includes an array of all
the probedata records). Records from prev_probedata_index up to and
not including the current record index, r, are associated with the
previous printf()
action. For proof of these observations, the following dtrace command gives a glimpse
into the internal workings just described:
$
dtrace -n 'BEGIN { n = 1; printf("%s %d\n", "cat", 9); trace(1/--n);
exit(0); }'
dtrace: description 'BEGIN '
matched 1 probe
dtrace: error on enabled probe ID
1 (ID 1: dtrace:::BEGIN): divide-by-zero in action #4 at DIF offset 20
$
The divide-by-zero error message says it failed in action #4, even
though there are only two actions in the program specified with the -n option: printf() and trace(). Internally,
however, the assignment n = 1
is the first probedata record with action type DTRACEACT_DIFEXPR and size
zero. The printf()
accounts for the next two probedata records, one for each unformatted
element: "cat" and 9. The fourth record (with action type DTRACEACT_DIFEXPR)
is the trace() action with
the divide-by-zero error (action type DTRACEACT_DIFEXPR indicates the trace() action unless size is
zero). From an implementation point of view, trace() is in this case "action #4".
By the time the action type DTRACEACT_PRINTF reaches the
bufhandler, the probe- and record-consuming functions have already had
their chance to see all information about that printf() except for the final formatted
string. That is now available as bufdata->dtbda_buffered.
The bufhandler should normally return DTRACE_HANDLE_OK, but in
the case of an unexpected program error, returning DTRACE_HANDLE_ABORT stops
the consumer prematurely. Similarly, the record- and
probe-consuming functions normally return DTRACE_CONSUME_THIS, but
may return DTRACE_CONSUME_ABORT to stop
the consumer. If the current record is the exit() action or
is NULL (to delimit
probedata), return DTRACE_CONSUME_NEXT.
Unlike the probe- and record-consuming functions, the bufhandler is
called not only from dtrace_work()
but also from dtrace_aggregate_print()
(in an intrstat-style
aggregate request).
This means that the user arg passed to dtrace_handle_buffered() is
forwarded to the bufhandler in both contexts, even though it may contain
state that is invalid outside of the thread that called dtrace_handle_buffered() (the
same thread that calls dtrace_work()
). To avoid the resulting (probably surprising) errors in
callbacks from dtrace_aggregate_print(),
you can use pthread_key_create()
to create a user arg capable of obtaining user data specific to each
context. You just need to associate user data with the created key
in each thread using pthread_setspecific(),
then in the bufhandler use pthread_getspecific()
to obtain the user data specific to the current thread (dtrace_work() or dtrace_aggregate_print()
). In general, take care that state recorded in the bufhandler
during dtrace_work() does
not interfere with state recorded in the bufhandler during dtrace_aggregate_print() and
vice-versa.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
( Jun 14 2005, 08:06:26 AM PDT / Jun 14 2005, 07:28:42 AM PDT )
Permalink
Trackback: http://blogs.sun.com/tomee/entry/tom_erickson_s_weblog
|
Posted by Simon on January 05, 2007 at 02:56 AM PST #