| |
Adam Leventhal's Weblog
inside the sausage factory
All
|
DTrace
|
OpenSolaris
|
Other
|
Software
|
ZFS
Monday March 19, 2007 |
|
Ian Murdock has left the Linux Foundation to lead the operating systems
strategy here at Sun. The last few years have seen some exciting changes
at Sun: releasing Solaris 10 (which includes several truly revolutionary
technologies), embracing x86, leading on x64, and taking Solaris open source.
That a luminary of the Linux world was enticed by the changes we've made
and the technologies we're creating is a huge vote of confidence. From my
(admittedly biased) view, OpenSolaris has been breaking away from the pack with
technologies like DTrace, ZFS, Zones, SMF, FMA and others. I'm looking forward
to the contributions Ian will bring from his experience with Debian, and to
the wake-up call this may be to Linux devotees: OpenSolaris is breaking away.
(2007-03-19 11:33:51.0/2007-03-19 10:24:02.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/linux_defection
|
|
|
Tuesday December 13, 2005 |
|
With BrandZ, it's now possible to use DTrace on
Linux applications. For the uninitiated, DTrace is the dynamic tracing facility
in OpenSolaris; it allows for systemic analysis of a scope and precision unequalled in the industry. With DTrace, administrators and
developers can trace low
level services like I/O and scheduling, up the system stack through
kernel functions calls, system calls, and system library calls, and into
applications written in C and C++ or any of a host of dynamic languages like
Java, Ruby, Perl or php. One of my contributions to BrandZ was to extend
DTrace support for Linux binaries executed in a branded Zone.
DTrace has several different instrumentation providers that
know how to instrument a particular part of the system and provide relevant
probes for that component. The io provider lets you trace disk
I/O, the fbt (function boundary tracing) provider lets you trace
any kernel function call, etc. A typical system will start with more than 30,000 probes but providers can create probes dynamically to trace new kernel modules
or user-land processes. When strictly focused on a user-land application, the
most useful providers are typically the syscall provider to
examine system calls and
the pid provider that can trace any instruction in a any process
executing on the system.
For Linux processes, the pid provider just worked (well, once Russ built a library to understand the Linux run-time linker),
and we introduced a new
provider -- the lx-syscall provider -- to trace entry and return for
emulated Linux system calls. With these providers it's possible to understand
every facet of a Linux application's behavior and with the other DTrace probes
it's possible to reason about an application's use of system resources. In other words, you can take that sluggish Linux application, stick it in a branded
Zone, dissect it using Solaris tools, and then bring it back to a native Linux
system with the fruits of your DTrace investigation[1].
To give an example of using DTrace on Linux applications, I needed an
application to examine. I wanted a well known program that either didn't run
on Solaris or operated sufficiently differently such examining the Linux
version rather than the Solaris port made sense. I decided on
/usr/bin/top partly because of the dramatic differences between how it
operates on Linux vs. Solaris (due to the differences in /proc), but mostly
because of what I've heard my colleague, Bryan, refer to as the
"top problem": your system is slow, so you run top. What's the top
process? Top!
Running top in the Linux branded zone, I opened a shell in the
global (Solaris) zone to use DTrace.
I started as I do on Solaris applications: I looked at system calls. I was
interested to see which system calls were being executed most frequently
which is easily expressed in DTrace:
bash-3.00# dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }'
dtrace: description 'lx-syscall:::entry' matched 272 probes
^C
fstat64 322
access 323
gettimeofday 323
gtime 323
llseek 323
mmap2 323
munmap 323
select 323
getdents64 1289
lseek 1291
stat64 3545
rt_sigaction 5805
write 6459
fcntl64 6772
alarm 8708
close 11282
open 14827
read 14830
Note the use of the aggregation denoted with the '@'.
Aggregations are the mechanism by which DTrace allows users to examine
patterns of system behavior rather than examining each individual datum
-- each system call for example.
(In case you also noticed the strange discrepancy between the number of open
and close system calls, many of those opens are failing so it makes sense that
they would have no corresponding close. I used the lx-syscall provider to suss
this out, but I omitted that investigation in a vain appeal for brevity.)
There may well be something fishy about this output, but nothing struck me as so compellingly fishy to explore immediately. Instead, I fired up vi and wrote a
short D script to see which system calls were taking the most time:
lx-sys.d
#!/usr/sbin/dtrace -s
lx-syscall:::entry
/execname == "top"/
{
self->ts = vtimestamp;
}
lx-syscall:::return
/self->ts/
{
@[probefunc] = sum(vtimestamp - self->ts);
self->ts = 0;
}
This script creates a table of system calls and the time spent in them (in
nanoseconds). The results were fairly interesting.
bash-3.00# ./lx-sys.d
dtrace: script './lx-sys.d' matched 550 probes
^C
llseek 4940978
gtime 5993454
gettimeofday 6603844
fstat64 14217312
select 26594875
lseek 30956093
mmap2 43463946
access 49033498
alarm 72216971
fcntl64 188281402
rt_sigaction 197646176
stat64 268188885
close 417574118
getdents64 781844851
open 1314209040
read 1862007391
write 2030173630
munmap 2195846497
That seems like a lot of time spent in munmap for top. In fact, I'm
rather surprised that there's any mapping and unmapping going on at all (I
guess that should have raised an eyebrow after my initial system call count).
Unmapping memory is a pretty expensive operation that gets more expensive
on bigger systems as it requires the kernel to do some work on every CPU
to completely wipe out the mapping.
I then modified lx-sys.d to record the total amount of time top spent on the
CPU and the total amount of time spent in system calls to see how large a chunk
of time these seemingly expensive unmap operations were taking:
lx-sys2.d
#!/usr/sbin/dtrace -s
lx-syscall:::entry
/execname == "top"/
{
self->ts = vtimestamp;
}
lx-syscall:::return
/self->ts/
{
@[probefunc] = sum(vtimestamp - self->ts);
@["- all syscalls -"] = sum(vtimestamp - self->ts);
self->ts = 0;
}
sched:::on-cpu
/execname == "top"/
{
self->on = timestamp;
}
sched:::off-cpu
/self->on/
{
@["- total -"] = sum(timestamp - self->on);
self->on = 0;
}
I used the sched provider to see when top was going on and off
of the CPU, and I added a row to record the total time spent in all
system call. Here were the results (keep in mind I was just hitting ^C to
end the experiment after a few seconds so it's expected that these numbers
would be different from those above; there are ways to have more accurately
timed experiments):
bash-3.00# ./lx-sys2.d
dtrace: script './lx-sys2.d' matched 550 probes
^C
llseek 939771
gtime 1088745
gettimeofday 1090020
fstat64 2494614
select 4566569
lseek 5186943
mmap2 7300830
access 8587484
alarm 11671436
fcntl64 31147636
rt_sigaction 33207341
stat64 45223200
close 69338595
getdents64 131196732
open 220188139
read 309764996
write 340413183
munmap 365830103
- all syscalls - 1589236337
- total - 3258101690
So system calls are consuming roughly half of top's time on the CPU and
the munmap syscall is consuming roughly a quarter of that. This was enough
to convince me that there was probably room for improvement and further
investigation might bear fruit.
Next, I wanted to understand what this mapped memory was being used for so
I wrote a little script that traces all the functions called in the process
between when memory is mapped using the mmap2(2) system call and when it's
unmapped and returned to the system through the munmap(2) system call:
map.d
#!/usr/sbin/dtrace -s
#pragma D option quiet
lx-syscall::mmap2:return
/pid == $target/
{
self->ptr = arg1;
self->depth = 10;
printf("%*.s <- %s syscall\n", self->depth, "", probefunc);
}
pid$target:::entry
/self->ptr/
{
self->depth++;
printf("%*.s -> %s`%s\n", self->depth, "", probemod, probefunc);
}
pid$target:::return
/self->ptr/
{
printf("%*.s <- %s`%s\n", self->depth, "", probemod, probefunc);
self->depth--;
}
lx-syscall::munmap:entry
/arg0 == self->ptr/
{
self->depth++;
printf("%*.s -> %s syscall\n", self->depth, "", probefunc);
self->ptr = 0;
self->depth = 0;
exit(0);
}
This script uses the $target variable which means that we need to run it with
the -p option where is the process ID of top. When mmap2 returns,
we set a thread local variable, 'ptr', which stores the address at the base
of the mapped region; for every function entry and return in the process we
call printf() if self->ptr is set; finally, we exit DTrace when
munmap is called with that same address. Here are the results:
bash-3.00# ./map.d -p `pgrep top`
<- mmap2 syscall
<- LM2`libc.so.1`syscall
<- LM2`lx_brand.so.1`lx_emulate
<- LM2`lx_brand.so.1`lx_handler
<- libc.so.6`mmap
<- libc.so.6`malloc
-> libc.so.6`memset
<- libc.so.6`memset
-> libc.so.6`cfree
<- libc.so.6`cfree
-> libc.so.6`munmap
<- LM2`lx_brand.so.1`lx_handler_table
-> LM2`lx_brand.so.1`lx_handler
-> LM2`lx_brand.so.1`lx_emulate
-> LM2`libc.so.1`syscall
-> munmap syscall
I traced the probemod (shared object name) in addition to probefunc (function
name) so that we'd be able to differentiate proper Linux functions (in this
case all in libc.so.6) from functions in the emulation library
(LM2`lx_brand.so.1). What we can glean from this is that the mmap call is a
result of a call to malloc() and the unmap is due to a call to free(). What's
unfortunate is that we're not seeing any function calls in top itself. For some
reason, the top developer chose to strip this binary (presumably to save
precious 2k the symbol table would have used on disk) so we're stuck with no symbolic information for top's
functions and no ability to trace the individual function calls[2], but we can
still reason about this a bit more.
A little analysis in mdb revealed that cfree (an alias for free)
makes a tail-call to a function that calls munmap. It seems strange to me that
freeing memory would immediately results in an unmap operation (since it would
cause exactly the high overhead we're seeing here. To explore this, I wrote
another script which looks at what proportion of calls to malloc result in a
call to mmap():
malloc.d
#!/usr/sbin/dtrace -s
pid$target::malloc:entry
{
self->follow = arg0;
}
lx-syscall::mmap2:entry
/self->follow/
{
@["mapped"] = count();
self->follow = 0;
}
pid$target::malloc:return
/self->follow/
{
@["no map"] = count();
self->follow = 0;
}
Here are the results:
bash-3.00# ./malloc.d -p `pgrep top`
dtrace: script './malloc.d' matched 11 probes
^C
mapped 275
no map 3024
So a bunch of allocations result in a mmap, but not a huge number. Next I
decided to explore if there might be a correlation between the size of the
allocation and whether or not it resulted in a call to mmap using the following
script:
malloc2.d
#!/usr/sbin/dtrace -s
pid$target::malloc:entry
{
self->size = arg0;
}
lx-syscall::mmap2:entry
/self->size/
{
@["mapped"] = quantize(self->size);
self->size = 0;
}
pid$target::malloc:return
/self->size/
{
@["no map"] = quantize(self->size);
self->size = 0;
}
Rather than just counting the frequency, I used the quantize
aggregating action to built a power-of-two histogram on the number of bytes
being allocated (self->size). The output was quite illustrative:
bash-3.00# ./malloc2.d -p `pgrep top`
dtrace: script './malloc2.d' matched 11 probes
^C
no map
value ------------- Distribution ------------- count
2 | 0
4 |@@@@@@@ 426
8 |@@@@@@@@@@@@@@@ 852
16 |@@@@@@@@@@@ 639
32 |@@@@ 213
64 | 0
128 | 0
256 | 0
512 |@@@@ 213
1024 | 0
mapped
value ------------- Distribution ------------- count
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 213
524288 | 0
All the allocations that required a mmap were huge -- between 256k and
512k. Now it makes sense why the Linux libc allocator would treat these
allocations a little differently than reasonably sized allocations. And this is
clearly a smoking gun for top performance: it would do much better to
preallocate a huge buffer and grow it as needed (assuming it actually needs it
at all) than to malloc it each time. Tracking down the offending line of code
would just require a non-stripped binary and a little DTrace invocation like
this:
# dtrace -n pid`pgrep top`::malloc:entry'/arg0 >= 262144/{@[ustack()] = count()}'
From symptoms to root cause on a Linux application in a few DTrace scripts --
and it took me approximately 1000 times longer to cobble together some vaguely
coherent prose describing the scripts than it did for me to actually perform
the investigation. BrandZ opens up some pretty interesting new vistas for
DTrace. I look forward to seeing Linux applications being brought in for
tune-ups on BrandZ and then reaping those benefits either back on their
mother Linux or sticking around to enjoy the fault management, ZFS,
scalability, and, of course, continued access to DTrace in BrandZ.
[1] Of course, results may vary since the guts of the Linux kernel differ
significantly from those of the Solaris kernel, but they're often fairly
similar. I/O or scheduling problems will be slightly different, but often
not so different that the conclusions lack applicability.
[2] Actually, we can can still trace function calls -- in fact, we can trace any instruction -- but it takes something of a heroic effort. We could disassemble parts of top to identify calls sites and then use esoteric pid123::-:address probe format to trace the stripped function. I said we could do it; I never said it would be pretty.
Technorati Tags:
BrandZ
DTrace
Solaris
OpenSolaris
(2006-05-02 09:14:30.0/2005-12-13 13:00:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/dtrace_for_linux
|
|
|
Sunday November 27, 2005 |
|
The good folks over at LugRadio (that LUG; not the other LUG of course) invited me on their show to answer some of their questions about OpenSolaris. You can find it in the latest episode. Check your volume before you start playing the episode: the first reference to masturbation is approximately 6 seconds in.
Understandably, the LugRadio guys didn't have much exposure to OpenSolaris, but they were certainly interested and recognize that it represents a pretty exciting development in the world of open source. Mostly I addressed some of the differences between Solaris and Linux. They seemed quite keen on Nexenta and I recommend that Solaris and Debian users alike try it out. There's one small correction I'd like to make regarding my comments on the show: I erroneously claimed that Solaris 7 -- the first version of Solaris to support 64-bit processors -- came out in 1996. In fact, Solaris 7 came out in November of 1998 (64-bit support having been first integrated in October of 1997). That cuts our 64-bit experience to 7 or 8 years from the 10 that I claimed.
On a vaguely related note, I'll be in London speaking at the OpenSolaris user group meeting on 12/19/2005. That's after a tromp accross Europe which includes a DTrace cum Java talk and an OpenSolaris BoF at JavaPolis in Belgium.
Technorati tags:
OpenSolaris
LugRadio
JavaPolis
(2005-11-27 22:31:35.0/2005-11-27 18:59:59.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_on_lugradio
|
|
|
Wednesday October 19, 2005 |
|
Actually, Fall, Winter and Spring of code. Sun just announced the Solaris 10 Univesity Challenge Content. That's right, it's a challenge and a contest and with three times the staying power of single season of code. Apparently in a modern day treaty of Brest-Litovsk, Google ceded the rest of the year to Sun. Perhaps this was the real beef of the recent Sun/Google agreement.
This is actually pretty cool: be a college student, do something cool on OpenSolaris, take a shot at winning the grand prize of $5k and a sweet box (I imagine there might be prizes for other interesting entries). There are some ideas projects on the contest/challenge/seasonal coding page ranging from good (MythTV), to mundane (support for inkjet printers from Epson), to confusing (internet gaming -- I thought online gamling was its own reward), to inane ("A joystick driver - for gaming", oh for gaming? I've been using it for system administration). Here's my list off the top of my head -- if you want more ideas, feel free to drop me a line.
Work on an existing open source project
- pearpc runs ppc applications on x86. I started working on porting it over and was able to boot Darwin.
- valgrind is very cool (I've only just seen it). It would be great to port it or to use pieces like KCacheGrind and plug in DTrace as a backend.
- Port over your favorite application, or get it running in some emulation environment of your own devising.
- Make something go faster. MySQL, Gnome, mozilla, some random system call, whatever; there's a lot of inefficiency out there.
Write something new (using cools stuff in Solaris)
- I'd love to see more dynamic languages with native DTrace support. We've already got support for
Java,
php,
Ruby, and
Perl in some form; make it better or add support for some other language you know and love (TCL, python, scheme, LISP, ML, etc.).
- Build another kind of analysis tool on top of DTrace. We're working on a Java binding which is going to make this easier.
- Write a device driver for your favorite crazy device (which I assume is your new iPod nano or something; you're such a hipster Apple fanboy).
- Build a tool to simulate a distributed environment on Zones and use DTrace to monitor the communication. WARNING: your distributed systems professor will be your new best friend.
That's what I'd do, but if you have a better idea, go do that instead.
Technorati tags:
OpenSolaris
Summer o' Code
(2005-10-20 00:21:25.0/2005-10-19 16:15:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/fall_of_code
Today at EuroOSCON, I attended a introductory talk on svn by Chia-liang Kao. I was hopeful that svk might address some of the issues that I thought would prevent us from adopting Subversion for OpenSolaris. In particular, Subversion requires a centralized repository whereas svk, which is built on top of Subversion, provides the distributed revision control system that we'd need. After the talk, my overall impression was that svk seemed to lack a certain polish, but after struggling to phrase that in less subjective terms, I'm coming around a bit.
I got a little nervous when the first example of svk's use was for keeping the contents of /etc under revision control. The big problem that svk solved was having random directories (.svc, SCCS, whatever) in, for example, /etc/skel. Talk about trivia (almost as relevant as a demo implementing a weblog in Ruby on Rails). I guess it's nice that svk solves a problem for that particularly esoteric scenario, but wasn't there some mention that this might be used to, you know, hold onto source code? Was that actually the design center for svk?
Fortunately we did get to the details of using svk for a source code repository. I think this is just my bias coming from teamware, but some of the mechanisms seem a bit strange. In particular, you do svk mirror to make one kind of copy of the main repository (the kind that's a "local repository"), and svk checkout to make a different kind of copy (the kind that's the "working area"). In other words, you have a tree structure, but the branches and leaves are different entities entirely and editing can only be done on the leaves. I guess that's not such a huge pain, but I think this reflects the origins: taking a local revision control system and making it distributed. Consequentially, there's a bunch of stuff left over from Subversion (like branches) that seem rather redundant in a distributed revision control system (don't take branch, make another local repository, right?); it's not that these actually hurt anything, it just means that there's a bunch of complexity for what's essentially ancillary functionality.
Another not-a-big-deal-that-rubs-me-the-wrong-way is that svk is a pile of perl modules (of course, there's probably a specific perlism for that; "epocalyptus" or something I'm sure). I suppose we'll only have to wrestle that bear to the ground once, and stuff it in a tar ball (yes, Allan, or a package). To assuage my nervousness, I'd at least like to be confident that we could maintain this ourselves, but I don't think we have the collective perl expertise (except for the aforementioned Alan) to be able to fix a bug or add a feature we desperately need.
I want this thing to work, because svk seems to be the best option currently available, but I admit that I was a bit disappointed. If we were going to use this for OpenSolaris, we'd certainly need to distribute it in a self-contained form, and really take it through the paces to make sure it could do all the things we need in the strangest edge cases. As I mentioned, we currently use teamware which is an old Sun product that's been in constant use despite being end-of-lifed several years ago. While I like it's overall design, there's a bunch of work that would need to happen for it to be suitable for OpenSolaris. In particular, it currently requires a shared NFS filesystem, and we require you to be on the host machine when committing a change. Adding networked access capabilities to it would probably be a nightmare. It also relies on SCCS (an ancient version control system) for managing individual files; not a problem per se, but a little crufty. Teamware is great and has withstood the test of time, but svk is probably closer to satisfying our requirements.
Update: But there are quite a few other options I hadn't looked into. Svk no longer looks like a front runner. If there are other systems you think are worth considering, let me know.
I'll play with svk some more and try to psych myself up for this brave new world. I'd appreciate any testimonials or feedback, and, of course, please correct all the factual errors I'm sure I committed.
Technorati tags:
EuroOSCON
OpenSolaris
svk
(2005-10-22 09:26:52.0/2005-10-19 09:30:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_and_svk
|
|
|
Monday October 17, 2005 |
|
OpenSolaris and Subversion
I just attended Brian W. Fitzpatrick's talk on Subversion at EuroOSCON. Brian did a great job and Subversion looks like a really complete replacement for cvs -- the stated goal of the project. What I was particularly interested in was the feasibility of using Subversion as the revision control system for OpenSolaris; according to the road map we still have a few months to figure it out, but, as my grandmother always said while working away at her mechanical Turing machine, time flies when you're debating the merits of various revision control systems.
While Subversion seems like some polished software, I don't think it's a solution -- or at least a complete solution -- to the problem we have with OpenSolaris. In particular, it's not a distributed revision control system meaning that there's one master repository that manages everything including branches and sub-branches. This means that if you have a development team at the distal point on the globe from the main repository (we pretty much do), all that team's work has to traverse the globe. Now the Subversion folks have ensured that the over the wire protocol is lean, but that doesn't really address the core of the problem -- the concern isn't potentially slow communication, it's that it happens at all. Let's say a bunch of folks -- inside or outside of Sun -- start working on a project; under Subversion there's a single point of failure -- if the one server in Menlo Park goes down (or the connection to it does down), the project can't accept any more integrations. I'm also not clear if branches can have their own policies for integrations. There are a couple other issues we'd need to solve (e.g. comments are made per-integration rather than per-file), but this is by far the biggest.
Brian recommeded a talk on svk later this week; svk is a distributed revision control and source management system that's built on Subversion. I hope svk solves the problems OpenSolaris would have with Subversion, but it would be great if Subversion could eventually migrate to a distributed model. I'd also like to attend this BoF on version control systems, but I'll be busy at the OpenSolaris User Group meeting -- where I'm sure you'll be as well.
Technorati tags:
EuroOSCON
OpenSolaris
(2005-10-18 16:02:04.0/2005-10-17 08:30:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_and_subversion
|
|
|
Sunday October 16, 2005 |
|
This week I'll be in Amsterdam attending and speaking at Euro OSCON. I'll be running a session on DTrace on Thursday at 1:30. For the uninitiated, DTrace is the systemic observability framework in OpenSolaris that allows for tracing of the kernel and applications on production systems. At OSCON in Portland, Bryan and Wez extended DTrace into php (it already worked on C/C++ and Java); since then, the DTrace community has extended DTrace into Ruby and Perl -- we'll see if we can come up with something new this week.
I'll also be tagging in on the OpenSolaris Users' Group Meeting and the OpenSolaris Technologies BoF along with Casper, Alan, Patrick, Simon, and others.
I've had a bit of a chance to explore Amsterdam a bit and it's spectacular.
My California-thinned blood won't let me forgo a comment on the rather cold weather, but my Northeastern upbringing won't let me complain too loudly.
Technorati tags:
OSCON
EuroOSCON
DTrace
(2005-10-16 13:36:47.0/2005-10-16 13:30:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/at_euro_oscon
|
|
|
Thursday September 15, 2005 |
|
I hadn't been fully aware that I felt this way, but I recently had a
realization: I love the linker. It's a technology that's amazing in both its
simplicity and its complexity. I'm sure my feelings are influenced in no small
way by the caliber of the engineers working on it -- Rod
and Mike are always eager to explain how the some facet of
the linker works or to add something new and whizzy if it can't quite do what
I need.
Over the course of developing
user-level
statically defined tracing USDT, I've worked (and continue to work) with
the linker guys to figure out the best way to slot the two technologies
together. Recently, some users of USDT have run into a problem where
binaries compiled with USDT probes weren't actually making them available to
the system. We eventually tracked it down to incorrect use of the linker. I
thought it would be helpful to describe the problem and the solution in case
other people bump into something similar.
First a little bit on initialization. In a C compiler, you can specify an
initialization function like this: #pragma init(my_init). The
intention of this is to have the specified function (e.g. my_init)
called when the binary is loaded into the program. This is a good place to do
initialization like memory allocation or other set up used in the rest of the
binary. What the compiler actually does when you specify this is create a
".init" section which contains a call to the specified function.
As a concrete example (and the example relevant to this specific
manifestation of the problem), take a look at this code in
usr/src/lib/libdtrace/common/drti.c:
88 #pragma init(dtrace_dof_init)
89 static void
90 dtrace_dof_init(void)
91 {
When we compile this into an object file (which we then deliver in
/usr/lib/dtrace/drti.o), the compiler generates a .init ELF section that
contains a call to dtrace_dof_init() (actually it contains a call with
a relocation that gets filled into to be the address of
dtrace_dof_init(), but that's a detail for another blog entry).
The linker doesn't really do anything special with .init ELF sections -- it
just concatenates them like it does all other sections with the same name. So
when you compile a bunch of object files with .init sections, they just get
crammed together -- there's still nothing special that causes them to get
executed with the binary is loaded.
Here's the clever part, when a compiler invokes the linker, it provides two
special object files: crti.o at the beginning, and crtn.o at the end. You can
find those binaries on your system in /usr/lib/ or in /usr/sfw/lib/gcc/...
for the gcc version. Those binaries are where the clever part happens;
crti.o's .init section contains effectively an open brace and crtn.o contains
the close brace (the function prologue and epilogue respectively):
$ dis -t .init /usr/lib/crti.o
section .init
_init()
_init: 55 pushl %ebp
1: 8b ec movl %esp,%ebp
3: 53 pushl %ebx
4: e8 00 00 00 00 call +0x5 <0x9>
9: 5b popl %ebx
a: 81 c3 03 00 00 00 addl $0x3,%ebx
$ dis -t .init /usr/lib/crtn.o
section .init
0: 5b popl %ebx
1: c9 leave
2: c3 ret
By now you may see the punch-line: by bracketing the user-generated object
files with these crti.o and crtn.o the resulting .init section is the
concatenation of the function prologue, all the calls in the user's object
files, and finally the function epilogue. All of this is contained in the
symbol called _init.
The linker then has some magic that identifies the _init function as special
and includes a dynamic entry (DT_INIT) that causes _init to be called by the
the run-time linker (ld.so.1) when the binary is loaded. In the binary that
was built with USDT but wasn't working properly, there was a .init section
with the call to dtrace_dof_init(), but no _init symbol. The
problem was, of course, that crti.o and crtn.o weren't being specified in the
linker invocation resulting in a .init section, but no _init symbol so no
DT_INIT section, so no initialization and no USDT.
(2005-11-20 17:17:56.0/2005-09-15 14:30:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/the_mysteries_of_init
|
|
|
Tuesday July 26, 2005 |
|
I really never thought I'd be sticking a picture of a penguin on my weblog, but here it is...
Joined by Bryan and Eric, I'll be speaking (and listening) at the OpenSolaris BoF at LinuxWorld (August 9, 5:30-7:00). We'll be talking about using and developing on OpenSolaris, how you can participate, and most likely comparing Solaris and Linux as well as giving the mandatory DTrace pitch. I expect it will be a little like the times I've been to Yankee stadium in full Red Sox regalia.
|
|
Technorati Tags:
OpenSolaris
LinuxWorld
DTrace
(2005-07-27 04:58:19.0/2005-07-26 00:00:01.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/linuxworld
|
|
|
Tuesday June 14, 2005 |
|
Debugging cross calls on OpenSolaris
I think the thing I love most about debugging software is that each tough
bug can seem like an insurmountable challenge -- until you figure it out.
But until you do, each tough bugs is the hardest problem you've
ever had to solve. There are weeks when every morning presents me with
a seemingly impossible challenge, and each afternoon I get to spike my
keyboard and do a little victory dance before running for the
train.
For my first OpenSolaris blog post, I
thought I talk about one of my favorite such bugs.
This particularly
nasty bug
had to do with a tricky interaction between
VMware and
DTrace (pre-production
versions of each to be clear). My buddy
Keith --
a fellow Brown
CS Grad -- gave me a call and told me
about some strange behavior he was seeting running DTrace inside of a VMware
VM. Keith is a big fan
of DTrace, but an intermittant, but reproducable problem was putting a
damper on his DTrace enthusiasm. Every once in a while, running DTrace would
cause the system to freeze. Because Solaris was running in the virtual
environment, he could see that both virtual CPUs where spinning away, but
they weren't making any forward progress. After a couple of back and forths
over email, I made the trip down to Palo Alto so we could work on the problem
together.
Using some custom debugging tools, we were able to figure out where the two
virtual CPUs were spinning. One CPU was in xc_common() and the other
was in xc_serv() -- code to handle cross calls. So what was going on?
cross calls
Before I can really delve into the problem, I want to give just a brief
overview of cross calls. In general terms, a cross call (xcall) is used in a
multi-processor (MP) system when one CPU needs to get another CPU to do some
work. It works by sending a special type of interrupt which the remote CPU
handles. You may have heard the term interprocessor interrupt (IPI) -- same
thing. One example of when xcalls are used is when unmapping a memory region.
To unmap a region, a process will typically call the munmap(2)
system call. Remember that in an MP system, any processor may have run
threads in this process so those mappings may be present in that any CPU's
TLB. The unmap operation executes on one CPU, but the other CPUs need to
remove the relevant mappings from their own TLBs. To accomplish this
communication, the kernel uses a xcall.
DTrace uses xcalls synchronize data used by all CPUs by ensuring that all
CPUs have reached a certain point in the code. DTrace executes actions with
interrupts disabled (an explanation of why this must be so is well beyond the
bounds of this discussion) so we can tell that a CPU isn't in probe context
if its able to handle our xcall. When DTrace is stopping tracing activity,
for example, it will update some data that affects all CPUs and then use a
xcall to make sure that every CPU has seen its effects before proceeding:
dtrace_state_stop()
10739 /*
10740 * We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync
10741 * to be sure that every CPU has seen it. See below for the details
10742 * on why this is done.
10743 */
10744 state->dts_activity = DTRACE_ACTIVITY_DRAINING;
10745 dtrace_sync();
dtrace_sync() sends a xcall to all other CPUs and has them spin in a
holding pattern until all CPUs have reached that point at which time the CPU
which sent the xcall releases them all (and they go back to whatever they had
been doing when they received the interrupt). That's the high level overview;
let's go into a little more detail on how xcalls work (well, actually a lot
more detail).
xcall implementation
If you follow the sequence of functions called by
dtrace_sync()
(and
I encourage you to do so), you'll find that this eventually calls
xc_common() to do the heavy lifting. It's important to note that
this call to
xc_common()
will have the sync argument set to
1. What's that mean? In a text book example of good software
engineering, someone did a
good job documenting what this value means:
xc_common()
411 /*
412 * Common code to call a specified function on a set of processors.
413 * sync specifies what kind of waiting is done.
414 * -1 - no waiting, don't release remotes
415 * 0 - no waiting, release remotes immediately
416 * 1 - run service locally w/o waiting for remotes.
417 * 2 - wait for remotes before running locally
418 */
419 static void
420 xc_common(
421 xc_func_t func,
422 xc_arg_t arg1,
423 xc_arg_t arg2,
424 xc_arg_t arg3,
425 int pri,
426 cpuset_t set,
427 int sync)
Before you start beating your brain out trying to figure out what you're
missing here, in this particular case, this destinction bewteen sync having the
value of 1 and 2 is nil: the service (function pointer specified by the
func argument) that we're running is
dtrace_sync_func()
which does literally nothing.
Let's start picking apart xc_common():
xc_common()
446 /*
447 * Request service on all remote processors.
448 */
449 for (cix = 0; cix < NCPU; cix++) {
450 if ((cpup = cpu[cix]) == NULL ||
451 (cpup->cpu_flags & CPU_READY) == 0) {
452 /*
453 * In case CPU wasn't ready, but becomes ready later,
454 * take the CPU out of the set now.
455 */
456 CPUSET_DEL(set, cix);
457 } else if (cix != lcx && CPU_IN_SET(set, cix)) {
458 CPU_STATS_ADDQ(CPU, sys, xcalls, 1);
459 cpup->cpu_m.xc_ack[pri] = 0;
460 cpup->cpu_m.xc_wait[pri] = sync;
461 if (sync > 0)
462 cpup->cpu_m.xc_state[pri] = XC_SYNC_OP;
463 else
464 cpup->cpu_m.xc_state[pri] = XC_CALL_OP;
465 cpup->cpu_m.xc_pend[pri] = 1;
466 send_dirint(cix, xc_xlat_xcptoipl[pri]);
467 }
468 }
We take a first pass through all the processors; if the processor is ready to
go and is in the set of processors we care about (they all are in the case of
dtrace_sync()) we set the remote CPU's ack flag to 0, it's
wait flag to sync (remember, 1 in this case), and it's
state flag to XC_SYNC_OP and then actually send the interrupt
with the call to send_dirint().
Next we wait for the remote CPUs to acknowledge that they've executed the
requested service which they do by setting the ack flag to 1:
xc_common()
479 /*
480 * Wait here until all remote calls complete.
481 */
482 for (cix = 0; cix < NCPU; cix++) {
483 if (lcx != cix && CPU_IN_SET(set, cix)) {
484 cpup = cpu[cix];
485 while (cpup->cpu_m.xc_ack[pri] == 0) {
486 ht_pause();
487 return_instr();
488 }
489 cpup->cpu_m.xc_ack[pri] = 0;
490 }
491 }
That while loop spins waiting for ack to become 1. If you
look at the definition of
return_instr()
it's name is actually more descriptive that you might imagine: it's just a
return instruction -- the most trivial function possible. I'm not absolutely
certain, but I think it was put there so the compiler wouldn't "optimize" the
loop away. The call to the inline function
ht_pause()
is so that the thread spins in such a way that's considerate on an
hyper-threaded
CPU. The call to ht_pause() is probably sufficient to prevent the
compiler from being overly clever, but the legacy call to
return_instr() remains.
Now let's look at the other side of this conversation: what happens on a
remote CPU as a result of this interrupt? This code is in xc_serv()
xc_serv()
138 /*
139 * Acknowledge that we have completed the x-call operation.
140 */
141 cpup->cpu_m.xc_ack[pri] = 1;
142
I'm sure it comes as no surprise that after executing the given function, it
just sets the ack flag.
Since in this case we're dealing with a synchronous xcall, the remote CPU
then needs to just chill out until the CPU that initiated the xcall discovers
that all remote CPUs have executed the function and are ready to be released:
xc_serv()
146 /*
147 * for (op == XC_SYNC_OP)
148 * Wait for the initiator of the x-call to indicate
149 * that all CPUs involved can proceed.
150 */
151 while (cpup->cpu_m.xc_wait[pri]) {
152 ht_pause();
153 return_instr();
154 }
155
156 while (cpup->cpu_m.xc_state[pri] != XC_DONE) {
157 ht_pause();
158 return_instr();
159 }
And here's the code on the initiating side that releases all the remote CPUs
by setting the wait and state flags to the values that the
remote CPUs are waiting to see:
xc_common()
502 /*
503 * Release any waiting CPUs
504 */
505 for (cix = 0; cix < NCPU; cix++) {
506 if (lcx != cix && CPU_IN_SET(set, cix)) {
507 cpup = cpu[cix];
508 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
509 cpup->cpu_m.xc_wait[pri] = 0;
510 cpup->cpu_m.xc_state[pri] = XC_DONE;
511 }
512 }
513 }
there's a problem
Wait! Without reading ahead in the code, does anyone see the problem?
Back at VMware, Keith hacked up a version of the virtual machine monitor
which allowed us to trace certain points in the code and figure out the
precise sequence in which they occurred. We traced the entry and return to
xc_common() and xc_serv(). Almost every time we'd see
something like this:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_serv() on CPU 1
- exit xc_common() on CPU 0
or:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_common() on CPU 0
- exit xc_serv() on CPU 1
But the problem happened when we saw a sequence like this:
- enter xc_common() on CPU 0
- enter xc_serv() on CPU 1
- exit xc_common() on CPU 0
- enter xc_common() on CPU 0
And nothing futher. What was happening was that after releasing remote CPUs,
CPU 0 was exiting from the call to xc_common() and calling it again
before the remote invocation of xc_serv() on the other CPU had a
change to exit.
Recall that one of the the first things that xc_common() does is set
the state flag. If the first call to xc_common() sets the
state flag to release the remote CPU from xc_sync(), but when
things go wrong, xc_common() was overwriting that flag before the
remote CPU got a change to see it.
the problem
We were seeing this repeatably under VMware, but no one had seen this at all
on real hardware (yet). The machine Keith and I were using was a 2-way box
running Linux. On VMware, each virtual CPU is represented by a thread on the
native OS so rather than having absolute control of the CPU, the execution
was more or less at the whim of the Linux scheduler.
When this code is running unadulterated on physical CPUs, we won't see this
problem. It's just a matter of timing -- the remote CPU has many many fewer
instructions to execute before the state flag gets overwritten by a
second xcall so there's no problem. On VMware, the Linux scheduler might
decide that's your second virtual CPU's right to the physical CPU is trumped
by moving the hands on your xclock (why not?) so there are no garuantees
about how long these operations can take.
the fix
There are actually quite a few ways to fix this problem -- I'm sure you can
think of at least one or two off the top of your head. We just need to make
sure that subsequent xcalls can't interfere with each other. When we found
this, Solaris 10 was wrapping up -- we were still making changes, but only
those deemed of the absolute highest importance. Making changes to the xcall
code (which is rather delicate and risky to change) for a bug that only
manifests itself on virtual hardware (and which VMware could work around
using some clever trickery[1]) didn't seem worthy of being designated a show-
stopper.
Keith predicted a few possible situations where this same bug could
manifest itself on physical CPUs: on hyper-threaded CPUs, or in the presence
of service management interrupts. And that prediction turned out to be spot
on: a few weeks after root causing the bug under VMware, we hit the same
problem on a system with four hyper-threaded chips (8 logical CPUs).
Since at that time we were even closer to shipping Solaris 10, I chose the
fix I thought was the safest and least likely to have nasty side effects.
After releasing remote CPUs, the code in xc_common() would now wait
for remote CPUs to check in -- wait for them to acknowledge receipt of the
directive to proceed.
xc_common()
515 /*
516 * Wait for all CPUs to acknowledge completion before we continue.
517 * Without this check it's possible (on a VM or hyper-threaded CPUs
518 * or in the presence of Service Management Interrupts which can all
519 * cause delays) for the remote processor to still be waiting by
520 * the time xc_common() is next invoked with the sync flag set
521 * resulting in a deadlock.
522 */
523 for (cix = 0; cix < NCPU; cix++) {
524 if (lcx != cix && CPU_IN_SET(set, cix)) {
525 cpup = cpu[cix];
526 if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
527 while (cpup->cpu_m.xc_ack[pri] == 0) {
528 ht_pause();
529 return_instr();
530 }
531 cpup->cpu_m.xc_ack[pri] = 0;
532 }
533 }
534 }
In that comment, I tried to summarize in 6 lines what has just taken me
several pages to describe. And maybe I should have said "livelock" -- oh
well. Here's the complementary code in xc_serv():
xc_serv()
170 /*
171 * Acknowledge that we have received the directive to continue.
172 */
173 ASSERT(cpup->cpu_m.xc_ack[pri] == 0);
174 cpup->cpu_m.xc_ack[pri] = 1;
conclusions
That was one of my favorite bugs to work on, and it's actually fairly typical
of a lot of the bugs I investigate: something's going wrong; figure out why.
I think the folks who work on Solaris tend to love that kind of stuff as a
rule. We spend tons of time building facilities like DTrace, mdb(1), kmdb,
CTF, fancy core files, and libdis so that the hart part of investigating
mysterious problems isn't gathering data or testing hypotheses, it's
thinking of the questions to answer and inventing new hypotheses.
It's my hope that OpenSolaris will attract those types of inquisitive minds
that thrive on the (seemingly) unsolvable problem.
[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently
(and not surprisingly) there are problems like this in nearly every operating
system where the code implicitly or explicitly relies on the relative timing
of certain operations. In these cases, VMware has hacks to do things like
execute the shifty code in lock step.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb
(2005-06-14 09:15:43.0/2005-06-14 09:15:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/debugging_cross_calls_on_opensolaris
|
|
|
Wednesday March 02, 2005 |
|
Beautiful S10/RHEL drag race
Network World Fusion has an article comparing Solaris 10 to Red Hat Enterprise Linux 4.0 (the self-proclaimed Solaris killer. The sweetest part of this article is the graph showing relative performance numbers. Solaris gets beat in 32-bit mode, but edges out Linux in 64-bit mode on identical hardware. AND, we've got a bunch more x86-64 performance through the pipeline. This benchmark is just in one very specific area of the system, but it's exciting nonetheless.
(2005-03-02 10:38:54.0/2005-03-02 10:40:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/beautiful_s10_rhe_drag_race
|
|
|
Thursday February 17, 2005 |
|
With all the FUD coming out of Red Hat, I thought the online rags would fall all over themselves to call shotgun on the anti-Solaris bandwagon. But LinuxInsider seems to have planted their flag on rational reporting:
Stacey Quandt, Open Source Practice Leader for the Robert Frances Group, agreed. "Although many organizations continue to migrate Solaris workloads from UltraSPARC to Linux on x86 and other platforms, the latest version of Red Hat does not eliminate the need for Solaris in the enterprise," she said.
"The performance characteristics of Solaris on AMD Opteron licensing of Solaris widens the choice of open-source operating systems on commodity hardware," Quandt told LinuxInsider.
Did you catch that? Not only is Solaris a performance leader, but it's an open source operating system.
But not so fast! In another FUD barrage, it turns out there are too many OSI-approved licenses -- guess what newly opened sourced operating system is using one of the apparently extraneous licenses. Kudos to slashdot for transmogrifying the title "OSI Urged To Reform Open-Source Licensing" to "OSI Hopes To Decrease Number of Licenses" -- turns out a change in voice can have a radical change in meaning...
(2005-02-17 17:25:20.0/2005-02-17 18:00:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/sanity_and_fud
|
|
|
Thursday November 25, 2004 |
|
Solaris 10 in the news: week 1
It's been just over a week since we officially launched Solaris 10 and the reactions from the press have been all over the map. Consider these two declarations:
Not many open source aficionados will realize the impact, but by making Solaris 10 free and capable of operating on any kind of hardware, Sun is making a coup in the server market.
...
As a result Linux will probably not grow much beyond its current market share of about 10% leaving Red Hat and especially Novell with a big problem.
link
Sun's announcement of the launch of free open source Solaris 10 has garnered a mild positive response from the investment community. However it also has raised a lot of skepticism from the technical community. The current plans are not good enough.
link
In case you didn't notice the URLs: those are the same guy posted about 20ns apart. Why the sudden reversal? slashdot! Yes, the Linus-loyal quickly rose to Linux's defence citing the multitude of reasons Open Solaris will fail.
It will be a while before anyone knows the relative success or failure of Open Solaris -- we're not even sure of the terms of the license -- so this is last challenge I'm going to rise to. But look: the Linux that exists today is a direct result literally and philosophically of its community. And while Solaris has so far been developed by a closed community, it also reflects the timbre of that community. For example: Linux adheres to a spartan philosophy of debugging help -- ergo no crash dumps; Solaris takes a different tack -- we require projects to come with debugger commands to examine crash dumps. It's the community that defines the project and, as is evident, it's the community that defends it, but don't discount the very real Solaris community.
The dumbest slight is against Solaris's hardware compatabiIity: do you think Linux always ran on as much stuff as it does now? Of course not. A month ago, did Solaris run on the (sick) new laptop I'm typing this on? Nope. The hardware these two operating systems support doesn't define their communities, bur rather the reverse. When people can hack up Solaris to work on their gear, they will just as they've been doing on Linux for years.
I can't wait for the license to be announced, for opensolaris.org to open for bidness, and to have people start contributing code from outside of Sun. Does anyone really doubt that these things will happen? The only question is if Open Solaris will take off -- wait and see.
(2005-03-01 12:27:06.0/2004-11-25 08:00:00.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/solaris_10_in_the_news
|
|
|
Wednesday November 17, 2004 |
|
I was at the Solaris 10 launch for most of Monday, and it was a pretty fantastic day for everyone working on Solaris 10. I spent about two hours helping to answer questions in an online chat -- here's the transcript -- about Solaris 10 in what was dubbed a webchat sweatshop. There were a bunch of us from the Solaris group as well as Scott and some other execs all huddled around laptops while the HR folks beat the drum at a slow, but steady pace: These answers have to be on the streets of Hong Kong by morning.
I spent the rest of the afternoon talking to customers and the press -- mostly about DTrace -- and the exciting thing was that even more than before, they're getting it and they're excited about Solaris. One of the most indelible moments was when a group of us from the kernel group were talking to Jem Matzan of The Jem Report and he challenged the claim that Solaris 10 was the most innovative operating system ever. As we all painted the picture of Solaris 10, I realized that this was true. Solaris 10 isn't just a random collection of neat crap, but it comes together as a cohesive whole that innovates in each and every place an operating system can innovate. Maybe that's a stretch, but it's not far off..
(2004-11-17 17:15:43.0/2004-11-17 14:30:41.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/solaris_10_launch
|
|
|
Thursday November 11, 2004 |
|
Check out LinuxInsider's article on Solaris 10. In particular, the last paragraph has high praise for Solaris 10 and Sun's forthcoming hardware ventures:
Bring them together in production systems and what do you get? The ability to organize your business around a single physical computer redundantly implemented in processors spread across your network -- meaning that a lot of business processes now limited by technology costs and software complexity can be simplified right down to affordability. That's what Solaris 10 is really about, and the 10-year impact is likely to be like nothing we've seen before.
Come Monday, I hope to see similar thoughtful analysis coming from other publications.
(2004-11-11 16:11:58.0/2004-11-11 16:10:55.0)
Permalink
Trackback: http://blogs.sun.com/ahl/entry/linuxinsider_on_solaris
|
|
|
|
| « May 2008 | | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
| | | | | 1 | 2 | 3 | 4 | | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | | | | | | | | | | Today |
ahl at sun
Recent Entries
|