|
So I'll be at OSCON this week, where I'll be giving two presentations on
DTrace. The
first is a free tutorial that
Keith and I will be giving
on
OpenSolaris development with DTrace.
This tutorial is on Tuesday from 1:30p to 5p in room D140, and did I mention that this tutorial is free?
So even if you didn't plan to attend any of the tutorials, if you're going to be in Portland on Tuesday afternoon, you should
feel free to swing by -- no need to preregister. This tutorial should be fun -- we're going to keep in very hands-on, and
it will be a demo-intensive introduction to both DTrace and to our larger tool-set that we use to both build OpenSolaris
and to diagnose it when it all goes horribly wrong. Hopefully you can join us!
The second session is
a presentation exclusively on DTrace. This is
quite a bit shorter (it's 45 minutes), so this presentation is going to give a quick review of the DTrace fundamentals,
and then focus on the confluence of DTrace and open source -- both what DTrace can do for your open source project,
and what you can do (if you're so inclined) for the DTrace open source project. This session is on Thursday at 4:30 in Portland
room 255.
Other than that, my schedule is filled only with odds and ends; if you're going to be at OSCON and you want to connect, drop
me a line or leave a message for me at my hotel, the 5th Avenue Suites. See you in Portland!
Brian Utterback has a great blog entry describing
using DTrace to debug a really nasty problem in NTP.
This problem is a good object lesson for two reasons:
- The pathology -- a signal of mysterious origin killing an app -- is a canonically nasty problem and (before DTrace) it
was very difficult (or damned near impossible?) to diagnose.
- While Brian was able to use DTrace to get a big jump on the problem, completely understanding it took
some great insight on Brian's part. This has been said before, but it merits reemphasis: DTrace is a tool, not a magician. That is,
DTrace still needs to be used by someone with a brain. And not, by the way, because DTrace is difficult to use, but rather
because systems -- especially misbehahaving or suboptimal ones -- have endemic complexity. We have tried to make it as easy
as possible to use DTrace to understand that complexity, but the complexity exists nonetheless.
Summary: DTrace allows you to solve problems that were previously unsolvable (or damned near) -- but it
means you'll be using your brain more, not less, so you'd better stop stabbing it with Q-tips.
DTrace
is a big piece of technology, and it can be easy to lose the
principles in the details.
But understanding these principles is key to understanding the design
decisions that we have made -- and to understanding the design decisions
that we will make in the future.
Of these principles, the most fundamental is the principle of safety:
DTrace must not be able to accidentally induce system failure. It is
our strict
adherence to this principle that allows DTrace to be used with confidence
on production systems -- and it is its use on production systems that
most fundamentally separates DTrace from what has come before it.
Of course, it's easy to say that this should be true, but what does the
safety constraint
mean? First and foremost,
given that DTrace allows for dynamic instrumentation, this means
that the user must not be allowed to instrument
code and contexts that are unsafe to instrument. In any
sufficiently dynamic
instrumentation framework, such code and contexts exist (if nothing
else, the framework itself cannot be instrumented without inducing
recursion), and this must be dealt with architecturally to assure
safety.
We have designed DTrace such that
probes are provided by instrumentation providers that guarantee their
safety.
That is, instead of the user picking some random point to instrument,
instrumentation providers make available only the points that can be
safely instrumented -- and the user is restricted to selecting among
these published probes. This puts the responsibility for instrumentation
safety where it belongs: in the provider. The specific techniques that
the providers use to assure safety are a bit too arcane to discuss
here,[1]
but suffice it to say that the providers are very conservative
in their instrumentation.
This addresses one aspect of instrumentation safety -- instrumenting
wholly unsafe contexts -- but it doesn't address the recursion issue,
where the code required to process a probe (a context that we call
probe context) ends up itself encountering an
enabled probe.
This kind of recursion can be dealt with in one of two ways: lazily (that is,
the recursion can be detected when it happens, and processing of the
probe that induced the recursion can be aborted) or proactively (the
system can be designed such that recursion is impossible). For
a myriad of reasons, we elected for the second
approach: to make recursion architecturally impossible. We achieve
this by mandating that
while in probe context, DTrace itself must not call into any
facilities in the kernel
at-large.
This means both implicit and explicit transfers of control into
the kernel-at-large -- so just as DTrace must avoid (for example) allocating
memory in probe context, it must also avoid inducing scheduler activity
by blocking.[2]
Once the fundamental safety issues of instrumentation are addressed,
focus turns to the safety of user-specified actions and predicates.
Very early in our thinking on DTrace, we knew that we wanted actions
and predicates to be completely programmable, giving rise to
a natural question: how are they executed? For us, the answer was
so clear that it was almost unspoken: we knew that we needed to develop
a virtual machine that could act as a target instruction set for a
custom compiler. Why was this the clear choice? Because the
alternative -- to execute user-specified code natively in the kernel --
is untenable from a safety perspective. Executing user-specified
code natively in the kernel is untenable for
many reasons:
-
Explicit stores to memory would have to be forbidden.
To allow for user-defined variables (as we knew we wanted to do), one
must clearly allow data to be stored. But if one executes natively,
one has no way of differentiating a store to legal variable memory from
a stray store to arbitrary kernel state. One is reduced to either
forbidding stores completely (destroying a critical architectural
component in the process), rewriting the binary to add checks around
the store, or to emulating stores and manually checking
the target address in the emulation code. The first option is
unacceptable from a feature perspective, the second option is a
non-trivial undertaking rife with new failure modes, and the
third option -- emulating stores -- shouldn't be thought of as native
execution but rather executing a
a virtual machine that happens to match the underlying instruction
set architecture.
-
Loops would have to be dynamically detected. One cannot allow user-defined
code to spin on the CPU in the kernel, so loops must be dynamically detected
and halted. Static analysis might be tempting, but
in a Turing-complete system such analysis will always be heuristic --
one cannot solve the
Halting Problem.
While heuristics are fine
when trying to achieve performance, they are not acceptable when
correctness is the constraint. The problem must be therefore
solved dynamically,
and dynamic detection of loops isn't
simple -- one must detect loops using any control transfer mechanism,
including procedure calls.
As with stores, one is reduced to either
forbidding calls and backwards branches, rewriting the binary to add
dynamic checks before all control transfers, or emulating them and manually
checking the target address in the emulation code. And again, emulating
these instructions negates the putative advantages of native execution.
-
Code would have to be very carefully validated for illegal
instructions.
For example, any instruction that operates on floating point state
is (generally) illegal to execute in the kernel (Solaris -- like
most operating systems -- doesn't save and restore the floating point
registers on a kernel/user context switch; using the floating point
registers in the kernel would corrupt user floating point state);
floating point operations would have to be detected and code containing
them rejected.
There are many such examples (e.g. register-indirect transfers of control must
be prohibited to prevent user-specified code from transferring
control into the kernel at large, privileged instructions must be
prohibited to prevent user-specified code from hijacking the operating system,
etc. etc.), and detecting them isn't
fail-safe:
if one fails to detect
so much as one of these cases, the entire system is vulnerable.
-
Executing natively isn't portable. This might seem counterintuitive
because executing natively seems to
"automatically" work on any instruction set architecture
that the operating system supports -- it leverages the existing tool
chain for compiling the user-specified code. But this leverage is
Fools' Gold:
as described above, the techniques to assure safety for native execution are
profoundly specific to the instruction set -- and any new instruction set
would require completely new validation code. And again, this isn't
fail-safe: so much as one slip-up in a new instruction set architecture
means that the entire system is at risk on the new architecture.
We left these many drawbacks of native execution largely
unspoken because the
alternative -- a
purpose-built virtual machine for executing user-specified code -- was
so clearly the better choice.
The virtual machine that we designed, the
D Intermediate Format (DIF) virtual machine,
has the following safety properties:
-
It has no mechanism for storing to arbitrary addresses; the only store
opcodes represent stores to specific kinds of user-defined variables.
This solves two problems in a single design decision: it prohibits
stores to arbitrary kernel memory, and it allows us to distinguish
stores to different kinds of variables (global, thread-local,
clause-local, etc.) from the virtual instruction opcode itself. This allows
us to off-load intelligence about variable management from the instruction
set and into the runtime where it belongs.
-
It has no backwards branches, and supports only calls to defined
runtime support routines -- eliminating the possibility of user-defined
loops altogether. This may seem unnecessarily severe (this makes loops
an impossibility
by architecture), but to us it was an acceptable tradeoff to achieve
absolute safety.
-
It is sufficiently simple to be easily (and rigorously) validated,
as can be seen from the straightforward
DIF object validation code.
-
It is completely portable, allowing the validation and emulation
code to be written and debugged once -- accelerating bringup of DTrace
on new platforms.
Just having an appropriately restricted virtual machine addressed
many safety issues, but several niggling safety issues still had to be
dealt with explicitly:
- Runtime errors like division-by-zero or misaligned loads. While
a virtual machine doesn't solve these in and of itself, it makes them
trivial to solve: the emulator simply refuses to perform such operations,
aborting processing and indicating a runtime error.
- Loads from I/O space. In the Solaris kernel, devices have memory
that can be mapped into the kernel's address space. Loads from these
memory ranges can have
side-effects; they must be prohibited. This is only slightly more
complicated than dealing with divisions-by-zero; before performing a load, the
emulator checks that the virtual address does not fall in a range reserved
for memory mapped devices, aborting processing if it does.
- Loads from unmapped memory. Given that we wanted to allow user-defined
code to chase pointers in the kernel, we knew that we had to deal with
user-defined code attempting to load from an unmapped address. This can't
be dealt with strictly in the emulator, as it would require probing
kernel VM structures from probe context (which, if allowed, would prohibit
instrumentation of the VM system). We dealt with this instead by
modifying the kernel's page fault handler to check if a load has been
DIF-directed before vectoring into the VM system to handle the fault.
If the fault is DIF-directed, the kernel sets a bit indicating that the
load has faulted, increments the trapping instruction pointer past the
faulting load, and returns from the trap. The emulation code checks the
faulted bit after emulating each instruction, aborting processing if it
is set.
So DTrace is not safe by accident -- DTrace is safe by deliberate design and
by careful execution. DTrace's safety comes from
a probe discovery process that assures safe instrumentation,
a purpose-built virtual machine that assures safe execution, and
a careful implementation that assures safe exception handling.
Could a safe system for dynamic instrumentation
be built with a different set of design decisions?
Perhaps -- but we believe that were such a system to be as safe, it would
be either be so under-powered or so over-complicated as to invalidate
those design decisions.
[1]
The best place to see this provider-based safety
is in the implementation of the FBT provider
(x86,
SPARC) and in
the implementation of
the pid provider
(x86,
SPARC).
[2]
While it isn't a safety issue per se, this led us to two other important
design decisions:
probe context is
lock-free
(and almost always
wait-free),
and
interrupts are disabled in probe context.
Technorati tags:
OpenSolaris
Solaris
DTrace
I read with some interest about
the GNOME startup bounty.
As
Stephen O'Grady
pointed out, this problem is indeed perfect for
DTrace.
To get a feel for the problem, I wrote a very simple D script:
#!/usr/sbin/dtrace -s
#pragma D option quiet
proc:::exec-success
/execname == "gnome-session"/
{
start = timestamp;
go = 1;
}
io:::start
/go/
{
printf("%10d { -> I/O %d %s %s %s }\n",
(timestamp - start) / 1000000, pid, execname,
args[0]->b_flags & B_READ ? "reads" : "writes",
args[2]->fi_pathname);
}
io:::done
/go/
{
printf("%10d { <- I/O to %s }\n",
(timestamp - start) / 1000000, args[2]->fi_pathname);
}
io:::start
/go && ((struct buf *)arg0)->b_file != NULL &&
((struct buf *)arg0)->b_file->v_path == NULL/
{
printf("%10s (vp %p)\n", "", ((struct buf *)arg0)->b_file);
}
io:::start
/go/
{
@apps[execname] = count();
@files[args[2]->fi_pathname] = count();
@appsfiles[execname, args[2]->fi_pathname] = count();
}
proc:::exec
/go/
{
self->parent = execname;
}
proc:::exec-success
/self->parent != NULL/
{
printf("%10d -> %d %s (from %d %s)\n",
(timestamp - start) / 1000000, pid, execname,
curpsinfo->pr_ppid, self->parent);
self->parent = NULL;
}
proc:::exit
/go/
{
printf("%10d <- %d %s\n",
(timestamp - start) / 1000000, pid, execname);
}
profile-101hz
/go && arg1 != NULL/
{
printf("%10d [ %d %s ]\n",
(timestamp - start) / 1000000, pid, execname);
}
profile-101hz
/go && arg1 == NULL &&
(curlwpsinfo->pr_flag & PR_IDLE)/
{
printf("%10d [ idle ]\n",
(timestamp - start) / 1000000);
}
END
{
printf("\n %-72s %s\n", "APPLICATION", "I/Os");
printa(" %-72s %@d\n", @apps);
printf("\n %-72s %s\n", "FILE", "I/Os");
printa(" %-72s %@d\n", @files);
printf("\n %-16s %-55s %s\n", "APPLICATION", "FILE", "I/Os");
printa(" %-16s %-55s %@d\n", @appsfiles);
}
This script uses a combination of CPU sampling and I/O tracing to determine
roughly what's going on over the course of logging in. I ran the above
script on on my
Acer Ferrari 3400
laptop running
OpenSolaris by dropping to the
command-line login
after a reboot and running:
# nohup ./login.d > /var/tmp/login.out &
I then logged in, and made sure that the first thing I did was launch a
gnome-terminal to pkill dtrace.
(This could obviously be made quite a bit more precise, but it works as
a crude way of indicating the completion of login.)
Here is
the output from performing this experiment.
The first column is the millisecond offset from the start of gnome-session.
CPU samples are contained within square brackets, launched programs
are contained withing curly braces, and I/O is explicitly noted.
An I/O summary is provided at the end.
A couple of observations:
- Nearly one third of all I/O is to shared object text and read-only data.
This is classic death-of-a-thousand-cuts, and it's hard to see that there's
an easy way to fix this. But perhaps text could be reordered to save some
I/O? More investigation is probably merited here.
- Over a quarter of all I/O is from
GConf
-- and many of these are from
wandering around an expansive directory hierarchy looking for configuration
information. It is well-known that the XML backend is a big performance
problem, and a better backend is apparently being worked on. At any rate,
solving this problem is clearly in
GConf's future.
- Of the remaining I/O, a bunch
is to icon files.
Glynn Foster
pointed out that this looks to be addressed by the
GTK+ gtk-update-icon-cache, new in
GTK+ 2.6 and contained within
GNOME 2.10, so this experiment will obviously need to
be repeated on
GNOME 2.10.
- CPU utilization doesn't look like a big issue -- or it's one that is at
least dwarfed by the cost of performing I/O. That said, gconfd-2 looks to
be a bit piggy in this department as well. We're only using CPU sampling --
and we're using a pretty coarse grained sample -- and gconfd-2 still
showed up. More precise investigation into CPU utilization can be
performed with the sched provider and its on-cpu and
off-cpu probes.
And an end-note: you might note many files named " <unknown>" in the
output.
This is due to I/Os being induced by lookups that are going through
directories and/or symlinks that haven't been explicitly opened. For these
I/Os, my script also gives you the pointer to the vnode structure in
the kernel; you can get the path to these by using ::vnode2path
in MDB:
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip
sctp usba uhci s1394 random nca lofs nfs audiosup sppp ptm ipc ]
> ffffffff8f6851c0::vnode2path
/usr/sfw/lib/libXrender.so.1
>
Yes, having to do this sucks a bit, and
it's
a known issue.
And rumor has
it that
Eric even has a workspace
with the fix, so stay tuned...
Update: Eric pointed me to a prototype with the fix, and I reran the script on
a GNOME cold start;
here is the output
from that run. Interestingly, because the symlinks now show up, a little postprocessing
reveals that we chased nearly eighty symlinks on startup! From the output, reading many
of these took 10 to 20 milliseconds; we might be spending as much as one second of GNOME
startup blocked on I/O just to chase symlinks! Ouch! Again, it's not clear how one
would fix this; having an app link to libpango-1.0.so.0 and not libpango-1.0.so.400.1 is clearly a Good Thing, and having this be a symlink
instead of a hardlink is clearly a Good Thing -- but all of that goodness leaves you
with a read dependency that's hard to work around. Anyway, be looking for Eric's fix
in OpenSolaris and then in an upcoming Solaris Express release;
it makes this kind of analysis much easier -- and thanks Eric for the quick prototype!
Technorati tags:
OpenSolaris
Solaris
DTrace
GNOME
|
|