Adam Leventhal's Weblog

inside the sausage factory

All | DTrace | OpenSolaris | Other | Software | ZFS

20070319 Monday March 19, 2007

 Linux Defection

Ian Murdock has left the Linux Foundation to lead the operating systems strategy here at Sun. The last few years have seen some exciting changes at Sun: releasing Solaris 10 (which includes several truly revolutionary technologies), embracing x86, leading on x64, and taking Solaris open source. That a luminary of the Linux world was enticed by the changes we've made and the technologies we're creating is a huge vote of confidence. From my (admittedly biased) view, OpenSolaris has been breaking away from the pack with technologies like DTrace, ZFS, Zones, SMF, FMA and others. I'm looking forward to the contributions Ian will bring from his experience with Debian, and to the wake-up call this may be to Linux devotees: OpenSolaris is breaking away.



(2007-03-19 11:33:51.0/2007-03-19 10:24:02.0) Permalink Comments [3]
Trackback: http://blogs.sun.com/ahl/entry/linux_defection

20051213 Tuesday December 13, 2005

 DTrace for Linux

With BrandZ, it's now possible to use DTrace on Linux applications. For the uninitiated, DTrace is the dynamic tracing facility in OpenSolaris; it allows for systemic analysis of a scope and precision unequalled in the industry. With DTrace, administrators and developers can trace low level services like I/O and scheduling, up the system stack through kernel functions calls, system calls, and system library calls, and into applications written in C and C++ or any of a host of dynamic languages like Java, Ruby, Perl or php. One of my contributions to BrandZ was to extend DTrace support for Linux binaries executed in a branded Zone.

DTrace has several different instrumentation providers that know how to instrument a particular part of the system and provide relevant probes for that component. The io provider lets you trace disk I/O, the fbt (function boundary tracing) provider lets you trace any kernel function call, etc. A typical system will start with more than 30,000 probes but providers can create probes dynamically to trace new kernel modules or user-land processes. When strictly focused on a user-land application, the most useful providers are typically the syscall provider to examine system calls and the pid provider that can trace any instruction in a any process executing on the system.

For Linux processes, the pid provider just worked (well, once Russ built a library to understand the Linux run-time linker), and we introduced a new provider -- the lx-syscall provider -- to trace entry and return for emulated Linux system calls. With these providers it's possible to understand every facet of a Linux application's behavior and with the other DTrace probes it's possible to reason about an application's use of system resources. In other words, you can take that sluggish Linux application, stick it in a branded Zone, dissect it using Solaris tools, and then bring it back to a native Linux system with the fruits of your DTrace investigation[1].

To give an example of using DTrace on Linux applications, I needed an application to examine. I wanted a well known program that either didn't run on Solaris or operated sufficiently differently such examining the Linux version rather than the Solaris port made sense. I decided on /usr/bin/top partly because of the dramatic differences between how it operates on Linux vs. Solaris (due to the differences in /proc), but mostly because of what I've heard my colleague, Bryan, refer to as the "top problem": your system is slow, so you run top. What's the top process? Top!

Running top in the Linux branded zone, I opened a shell in the global (Solaris) zone to use DTrace. I started as I do on Solaris applications: I looked at system calls. I was interested to see which system calls were being executed most frequently which is easily expressed in DTrace:

bash-3.00# dtrace -n lx-syscall:::entry'/execname == "top"/{ @[probefunc] = count(); }'
dtrace: description 'lx-syscall:::entry' matched 272 probes
^C

  fstat64                                                         322
  access                                                          323
  gettimeofday                                                    323
  gtime                                                           323
  llseek                                                          323
  mmap2                                                           323
  munmap                                                          323

  select                                                          323
  getdents64                                                     1289
  lseek                                                          1291
  stat64                                                         3545
  rt_sigaction                                                   5805
  write                                                          6459
  fcntl64                                                        6772
  alarm                                                          8708
  close                                                         11282
  open                                                          14827
  read                                                          14830

Note the use of the aggregation denoted with the '@'. Aggregations are the mechanism by which DTrace allows users to examine patterns of system behavior rather than examining each individual datum -- each system call for example. (In case you also noticed the strange discrepancy between the number of open and close system calls, many of those opens are failing so it makes sense that they would have no corresponding close. I used the lx-syscall provider to suss this out, but I omitted that investigation in a vain appeal for brevity.)

There may well be something fishy about this output, but nothing struck me as so compellingly fishy to explore immediately. Instead, I fired up vi and wrote a short D script to see which system calls were taking the most time:

lx-sys.d

#!/usr/sbin/dtrace -s

lx-syscall:::entry
/execname == "top"/
{
        self->ts = vtimestamp;
}

lx-syscall:::return
/self->ts/
{
        @[probefunc] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

This script creates a table of system calls and the time spent in them (in nanoseconds). The results were fairly interesting.

bash-3.00# ./lx-sys.d
dtrace: script './lx-sys.d' matched 550 probes
^C

  llseek                                                      4940978
  gtime                                                       5993454
  gettimeofday                                                6603844
  fstat64                                                    14217312
  select                                                     26594875
  lseek                                                      30956093
  mmap2                                                      43463946
  access                                                     49033498
  alarm                                                      72216971
  fcntl64                                                   188281402
  rt_sigaction                                              197646176
  stat64                                                    268188885
  close                                                     417574118
  getdents64                                                781844851
  open                                                     1314209040
  read                                                     1862007391
  write                                                    2030173630
  munmap                                                   2195846497

That seems like a lot of time spent in munmap for top. In fact, I'm rather surprised that there's any mapping and unmapping going on at all (I guess that should have raised an eyebrow after my initial system call count). Unmapping memory is a pretty expensive operation that gets more expensive on bigger systems as it requires the kernel to do some work on every CPU to completely wipe out the mapping.

I then modified lx-sys.d to record the total amount of time top spent on the CPU and the total amount of time spent in system calls to see how large a chunk of time these seemingly expensive unmap operations were taking:

lx-sys2.d

#!/usr/sbin/dtrace -s

lx-syscall:::entry
/execname == "top"/
{
        self->ts = vtimestamp;
}

lx-syscall:::return
/self->ts/
{
        @[probefunc] = sum(vtimestamp - self->ts);
        @["- all syscalls -"] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

sched:::on-cpu
/execname == "top"/
{
        self->on = timestamp;
}

sched:::off-cpu
/self->on/
{
        @["- total -"] = sum(timestamp - self->on);
        self->on = 0;
}

I used the sched provider to see when top was going on and off of the CPU, and I added a row to record the total time spent in all system call. Here were the results (keep in mind I was just hitting ^C to end the experiment after a few seconds so it's expected that these numbers would be different from those above; there are ways to have more accurately timed experiments):

bash-3.00# ./lx-sys2.d
dtrace: script './lx-sys2.d' matched 550 probes
^C

  llseek                                                       939771
  gtime                                                       1088745
  gettimeofday                                                1090020
  fstat64                                                     2494614
  select                                                      4566569
  lseek                                                       5186943
  mmap2                                                       7300830
  access                                                      8587484
  alarm                                                      11671436
  fcntl64                                                    31147636
  rt_sigaction                                               33207341
  stat64                                                     45223200
  close                                                      69338595
  getdents64                                                131196732
  open                                                      220188139
  read                                                      309764996
  write                                                     340413183
  munmap                                                    365830103
  - all syscalls -                                         1589236337
  - total -                                                3258101690

So system calls are consuming roughly half of top's time on the CPU and the munmap syscall is consuming roughly a quarter of that. This was enough to convince me that there was probably room for improvement and further investigation might bear fruit.

Next, I wanted to understand what this mapped memory was being used for so I wrote a little script that traces all the functions called in the process between when memory is mapped using the mmap2(2) system call and when it's unmapped and returned to the system through the munmap(2) system call:

map.d

#!/usr/sbin/dtrace -s

#pragma D option quiet

lx-syscall::mmap2:return
/pid == $target/
{
        self->ptr = arg1;
        self->depth = 10;

        printf("%*.s <- %s syscall\n", self->depth, "", probefunc);
}

pid$target:::entry
/self->ptr/
{
        self->depth++;
        printf("%*.s -> %s`%s\n", self->depth, "", probemod, probefunc);
}

pid$target:::return
/self->ptr/
{
        printf("%*.s <- %s`%s\n", self->depth, "", probemod, probefunc);
        self->depth--;
}

lx-syscall::munmap:entry
/arg0 == self->ptr/
{
        self->depth++;
        printf("%*.s -> %s syscall\n", self->depth, "", probefunc);

        self->ptr = 0;
        self->depth = 0;
        exit(0);
}

This script uses the $target variable which means that we need to run it with the -p option where is the process ID of top. When mmap2 returns, we set a thread local variable, 'ptr', which stores the address at the base of the mapped region; for every function entry and return in the process we call printf() if self->ptr is set; finally, we exit DTrace when munmap is called with that same address. Here are the results:

bash-3.00# ./map.d -p `pgrep top`
           <- mmap2 syscall
           <- LM2`libc.so.1`syscall
          <- LM2`lx_brand.so.1`lx_emulate
         <- LM2`lx_brand.so.1`lx_handler
        <- libc.so.6`mmap
       <- libc.so.6`malloc
       -> libc.so.6`memset
       <- libc.so.6`memset
       -> libc.so.6`cfree
       <- libc.so.6`cfree
       -> libc.so.6`munmap
       <- LM2`lx_brand.so.1`lx_handler_table
       -> LM2`lx_brand.so.1`lx_handler
        -> LM2`lx_brand.so.1`lx_emulate
         -> LM2`libc.so.1`syscall
          -> munmap syscall

I traced the probemod (shared object name) in addition to probefunc (function name) so that we'd be able to differentiate proper Linux functions (in this case all in libc.so.6) from functions in the emulation library (LM2`lx_brand.so.1). What we can glean from this is that the mmap call is a result of a call to malloc() and the unmap is due to a call to free(). What's unfortunate is that we're not seeing any function calls in top itself. For some reason, the top developer chose to strip this binary (presumably to save precious 2k the symbol table would have used on disk) so we're stuck with no symbolic information for top's functions and no ability to trace the individual function calls[2], but we can still reason about this a bit more.

A little analysis in mdb revealed that cfree (an alias for free) makes a tail-call to a function that calls munmap. It seems strange to me that freeing memory would immediately results in an unmap operation (since it would cause exactly the high overhead we're seeing here. To explore this, I wrote another script which looks at what proportion of calls to malloc result in a call to mmap():

malloc.d

#!/usr/sbin/dtrace -s

pid$target::malloc:entry
{
        self->follow = arg0;
}

lx-syscall::mmap2:entry
/self->follow/
{
        @["mapped"] = count();
        self->follow = 0;
}

pid$target::malloc:return
/self->follow/
{
        @["no map"] = count();
        self->follow = 0;
}

Here are the results:

bash-3.00# ./malloc.d -p `pgrep top`
dtrace: script './malloc.d' matched 11 probes
^C

  mapped                                                          275
  no map                                                         3024

So a bunch of allocations result in a mmap, but not a huge number. Next I decided to explore if there might be a correlation between the size of the allocation and whether or not it resulted in a call to mmap using the following script:

malloc2.d

#!/usr/sbin/dtrace -s

pid$target::malloc:entry
{
        self->size = arg0;
}

lx-syscall::mmap2:entry
/self->size/
{
        @["mapped"] = quantize(self->size);
        self->size = 0;
}

pid$target::malloc:return
/self->size/
{
        @["no map"] = quantize(self->size);
        self->size = 0;
}

Rather than just counting the frequency, I used the quantize aggregating action to built a power-of-two histogram on the number of bytes being allocated (self->size). The output was quite illustrative:

bash-3.00# ./malloc2.d -p `pgrep top`
dtrace: script './malloc2.d' matched 11 probes
^C

  no map
           value  ------------- Distribution ------------- count
               2 |                                         0
               4 |@@@@@@@                                  426
               8 |@@@@@@@@@@@@@@@                          852
              16 |@@@@@@@@@@@                              639
              32 |@@@@                                     213
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |@@@@                                     213
            1024 |                                         0

  mapped
           value  ------------- Distribution ------------- count
          131072 |                                         0
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 213
          524288 |                                         0

All the allocations that required a mmap were huge -- between 256k and 512k. Now it makes sense why the Linux libc allocator would treat these allocations a little differently than reasonably sized allocations. And this is clearly a smoking gun for top performance: it would do much better to preallocate a huge buffer and grow it as needed (assuming it actually needs it at all) than to malloc it each time. Tracking down the offending line of code would just require a non-stripped binary and a little DTrace invocation like this:

# dtrace -n pid`pgrep top`::malloc:entry'/arg0 >= 262144/{@[ustack()] = count()}'

From symptoms to root cause on a Linux application in a few DTrace scripts -- and it took me approximately 1000 times longer to cobble together some vaguely coherent prose describing the scripts than it did for me to actually perform the investigation. BrandZ opens up some pretty interesting new vistas for DTrace. I look forward to seeing Linux applications being brought in for tune-ups on BrandZ and then reaping those benefits either back on their mother Linux or sticking around to enjoy the fault management, ZFS, scalability, and, of course, continued access to DTrace in BrandZ.


[1] Of course, results may vary since the guts of the Linux kernel differ significantly from those of the Solaris kernel, but they're often fairly similar. I/O or scheduling problems will be slightly different, but often not so different that the conclusions lack applicability.
[2] Actually, we can can still trace function calls -- in fact, we can trace any instruction -- but it takes something of a heroic effort. We could disassemble parts of top to identify calls sites and then use esoteric pid123::-:address probe format to trace the stripped function. I said we could do it; I never said it would be pretty.


Technorati Tags:



(2006-05-02 09:14:30.0/2005-12-13 13:00:00.0) Permalink Comments [9]
Trackback: http://blogs.sun.com/ahl/entry/dtrace_for_linux

20051127 Sunday November 27, 2005

 OpenSolaris on LugRadio

The good folks over at LugRadio (that LUG; not the other LUG of course) invited me on their show to answer some of their questions about OpenSolaris. You can find it in the latest episode. Check your volume before you start playing the episode: the first reference to masturbation is approximately 6 seconds in.

Understandably, the LugRadio guys didn't have much exposure to OpenSolaris, but they were certainly interested and recognize that it represents a pretty exciting development in the world of open source. Mostly I addressed some of the differences between Solaris and Linux. They seemed quite keen on Nexenta and I recommend that Solaris and Debian users alike try it out. There's one small correction I'd like to make regarding my comments on the show: I erroneously claimed that Solaris 7 -- the first version of Solaris to support 64-bit processors -- came out in 1996. In fact, Solaris 7 came out in November of 1998 (64-bit support having been first integrated in October of 1997). That cuts our 64-bit experience to 7 or 8 years from the 10 that I claimed.

On a vaguely related note, I'll be in London speaking at the OpenSolaris user group meeting on 12/19/2005. That's after a tromp accross Europe which includes a DTrace cum Java talk and an OpenSolaris BoF at JavaPolis in Belgium.


Technorati tags:



(2005-11-27 22:31:35.0/2005-11-27 18:59:59.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_on_lugradio

20051019 Wednesday October 19, 2005

 Fall of Code?

Actually, Fall, Winter and Spring of code. Sun just announced the Solaris 10 Univesity Challenge Content. That's right, it's a challenge and a contest and with three times the staying power of single season of code. Apparently in a modern day treaty of Brest-Litovsk, Google ceded the rest of the year to Sun. Perhaps this was the real beef of the recent Sun/Google agreement.

This is actually pretty cool: be a college student, do something cool on OpenSolaris, take a shot at winning the grand prize of $5k and a sweet box (I imagine there might be prizes for other interesting entries). There are some ideas projects on the contest/challenge/seasonal coding page ranging from good (MythTV), to mundane (support for inkjet printers from Epson), to confusing (internet gaming -- I thought online gamling was its own reward), to inane ("A joystick driver - for gaming", oh for gaming? I've been using it for system administration). Here's my list off the top of my head -- if you want more ideas, feel free to drop me a line.

Work on an existing open source project

  • pearpc runs ppc applications on x86. I started working on porting it over and was able to boot Darwin.
  • valgrind is very cool (I've only just seen it). It would be great to port it or to use pieces like KCacheGrind and plug in DTrace as a backend.
  • Port over your favorite application, or get it running in some emulation environment of your own devising.
  • Make something go faster. MySQL, Gnome, mozilla, some random system call, whatever; there's a lot of inefficiency out there.

Write something new (using cools stuff in Solaris)

  • I'd love to see more dynamic languages with native DTrace support. We've already got support for Java, php, Ruby, and Perl in some form; make it better or add support for some other language you know and love (TCL, python, scheme, LISP, ML, etc.).
  • Build another kind of analysis tool on top of DTrace. We're working on a Java binding which is going to make this easier.
  • Write a device driver for your favorite crazy device (which I assume is your new iPod nano or something; you're such a hipster Apple fanboy).
  • Build a tool to simulate a distributed environment on Zones and use DTrace to monitor the communication. WARNING: your distributed systems professor will be your new best friend.

That's what I'd do, but if you have a better idea, go do that instead.


Technorati tags:



(2005-10-20 00:21:25.0/2005-10-19 16:15:00.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/ahl/entry/fall_of_code

 OpenSolaris and svk

Today at EuroOSCON, I attended a introductory talk on svn by Chia-liang Kao. I was hopeful that svk might address some of the issues that I thought would prevent us from adopting Subversion for OpenSolaris. In particular, Subversion requires a centralized repository whereas svk, which is built on top of Subversion, provides the distributed revision control system that we'd need. After the talk, my overall impression was that svk seemed to lack a certain polish, but after struggling to phrase that in less subjective terms, I'm coming around a bit.

I got a little nervous when the first example of svk's use was for keeping the contents of /etc under revision control. The big problem that svk solved was having random directories (.svc, SCCS, whatever) in, for example, /etc/skel. Talk about trivia (almost as relevant as a demo implementing a weblog in Ruby on Rails). I guess it's nice that svk solves a problem for that particularly esoteric scenario, but wasn't there some mention that this might be used to, you know, hold onto source code? Was that actually the design center for svk?

Fortunately we did get to the details of using svk for a source code repository. I think this is just my bias coming from teamware, but some of the mechanisms seem a bit strange. In particular, you do svk mirror to make one kind of copy of the main repository (the kind that's a "local repository"), and svk checkout to make a different kind of copy (the kind that's the "working area"). In other words, you have a tree structure, but the branches and leaves are different entities entirely and editing can only be done on the leaves. I guess that's not such a huge pain, but I think this reflects the origins: taking a local revision control system and making it distributed. Consequentially, there's a bunch of stuff left over from Subversion (like branches) that seem rather redundant in a distributed revision control system (don't take branch, make another local repository, right?); it's not that these actually hurt anything, it just means that there's a bunch of complexity for what's essentially ancillary functionality.

Another not-a-big-deal-that-rubs-me-the-wrong-way is that svk is a pile of perl modules (of course, there's probably a specific perlism for that; "epocalyptus" or something I'm sure). I suppose we'll only have to wrestle that bear to the ground once, and stuff it in a tar ball (yes, Allan, or a package). To assuage my nervousness, I'd at least like to be confident that we could maintain this ourselves, but I don't think we have the collective perl expertise (except for the aforementioned Alan) to be able to fix a bug or add a feature we desperately need.

I want this thing to work, because svk seems to be the best option currently available, but I admit that I was a bit disappointed. If we were going to use this for OpenSolaris, we'd certainly need to distribute it in a self-contained form, and really take it through the paces to make sure it could do all the things we need in the strangest edge cases. As I mentioned, we currently use teamware which is an old Sun product that's been in constant use despite being end-of-lifed several years ago. While I like it's overall design, there's a bunch of work that would need to happen for it to be suitable for OpenSolaris. In particular, it currently requires a shared NFS filesystem, and we require you to be on the host machine when committing a change. Adding networked access capabilities to it would probably be a nightmare. It also relies on SCCS (an ancient version control system) for managing individual files; not a problem per se, but a little crufty. Teamware is great and has withstood the test of time, but svk is probably closer to satisfying our requirements.

Update: But there are quite a few other options I hadn't looked into. Svk no longer looks like a front runner. If there are other systems you think are worth considering, let me know.

I'll play with svk some more and try to psych myself up for this brave new world. I'd appreciate any testimonials or feedback, and, of course, please correct all the factual errors I'm sure I committed.


Technorati tags:

(2005-10-22 09:26:52.0/2005-10-19 09:30:00.0) Permalink Comments [5]
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_and_svk

20051017 Monday October 17, 2005

 OpenSolaris and Subversion

I just attended Brian W. Fitzpatrick's talk on Subversion at EuroOSCON. Brian did a great job and Subversion looks like a really complete replacement for cvs -- the stated goal of the project. What I was particularly interested in was the feasibility of using Subversion as the revision control system for OpenSolaris; according to the road map we still have a few months to figure it out, but, as my grandmother always said while working away at her mechanical Turing machine, time flies when you're debating the merits of various revision control systems.

While Subversion seems like some polished software, I don't think it's a solution -- or at least a complete solution -- to the problem we have with OpenSolaris. In particular, it's not a distributed revision control system meaning that there's one master repository that manages everything including branches and sub-branches. This means that if you have a development team at the distal point on the globe from the main repository (we pretty much do), all that team's work has to traverse the globe. Now the Subversion folks have ensured that the over the wire protocol is lean, but that doesn't really address the core of the problem -- the concern isn't potentially slow communication, it's that it happens at all. Let's say a bunch of folks -- inside or outside of Sun -- start working on a project; under Subversion there's a single point of failure -- if the one server in Menlo Park goes down (or the connection to it does down), the project can't accept any more integrations. I'm also not clear if branches can have their own policies for integrations. There are a couple other issues we'd need to solve (e.g. comments are made per-integration rather than per-file), but this is by far the biggest.

Brian recommeded a talk on svk later this week; svk is a distributed revision control and source management system that's built on Subversion. I hope svk solves the problems OpenSolaris would have with Subversion, but it would be great if Subversion could eventually migrate to a distributed model. I'd also like to attend this BoF on version control systems, but I'll be busy at the OpenSolaris User Group meeting -- where I'm sure you'll be as well.


Technorati tags:

(2005-10-18 16:02:04.0/2005-10-17 08:30:00.0) Permalink Comments [13]
Trackback: http://blogs.sun.com/ahl/entry/opensolaris_and_subversion

20051016 Sunday October 16, 2005

 At Euro OSCON

This week I'll be in Amsterdam attending and speaking at Euro OSCON. I'll be running a session on DTrace on Thursday at 1:30. For the uninitiated, DTrace is the systemic observability framework in OpenSolaris that allows for tracing of the kernel and applications on production systems. At OSCON in Portland, Bryan and Wez extended DTrace into php (it already worked on C/C++ and Java); since then, the DTrace community has extended DTrace into Ruby and Perl -- we'll see if we can come up with something new this week.

I'll also be tagging in on the OpenSolaris Users' Group Meeting and the OpenSolaris Technologies BoF along with Casper, Alan, Patrick, Simon, and others.

I've had a bit of a chance to explore Amsterdam a bit and it's spectacular. My California-thinned blood won't let me forgo a comment on the rather cold weather, but my Northeastern upbringing won't let me complain too loudly.


Technorati tags:

(2005-10-16 13:36:47.0/2005-10-16 13:30:00.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/at_euro_oscon

20050915 Thursday September 15, 2005

 The mysteries of _init

I hadn't been fully aware that I felt this way, but I recently had a realization: I love the linker. It's a technology that's amazing in both its simplicity and its complexity. I'm sure my feelings are influenced in no small way by the caliber of the engineers working on it -- Rod and Mike are always eager to explain how the some facet of the linker works or to add something new and whizzy if it can't quite do what I need.

Over the course of developing user-level statically defined tracing USDT, I've worked (and continue to work) with the linker guys to figure out the best way to slot the two technologies together. Recently, some users of USDT have run into a problem where binaries compiled with USDT probes weren't actually making them available to the system. We eventually tracked it down to incorrect use of the linker. I thought it would be helpful to describe the problem and the solution in case other people bump into something similar.

First a little bit on initialization. In a C compiler, you can specify an initialization function like this: #pragma init(my_init). The intention of this is to have the specified function (e.g. my_init) called when the binary is loaded into the program. This is a good place to do initialization like memory allocation or other set up used in the rest of the binary. What the compiler actually does when you specify this is create a ".init" section which contains a call to the specified function.

As a concrete example (and the example relevant to this specific manifestation of the problem), take a look at this code in usr/src/lib/libdtrace/common/drti.c:

     88 #pragma init(dtrace_dof_init)
     89 static void
     90 dtrace_dof_init(void)
     91 {

When we compile this into an object file (which we then deliver in /usr/lib/dtrace/drti.o), the compiler generates a .init ELF section that contains a call to dtrace_dof_init() (actually it contains a call with a relocation that gets filled into to be the address of dtrace_dof_init(), but that's a detail for another blog entry).

The linker doesn't really do anything special with .init ELF sections -- it just concatenates them like it does all other sections with the same name. So when you compile a bunch of object files with .init sections, they just get crammed together -- there's still nothing special that causes them to get executed with the binary is loaded.

Here's the clever part, when a compiler invokes the linker, it provides two special object files: crti.o at the beginning, and crtn.o at the end. You can find those binaries on your system in /usr/lib/ or in /usr/sfw/lib/gcc/... for the gcc version. Those binaries are where the clever part happens; crti.o's .init section contains effectively an open brace and crtn.o contains the close brace (the function prologue and epilogue respectively):

$ dis -t .init /usr/lib/crti.o
section .init
_init()
    _init:                  55                 pushl  %ebp
    1:                      8b ec              movl   %esp,%ebp
    3:                      53                 pushl  %ebx
    4:                      e8 00 00 00 00     call   +0x5      <0x9>
    9:                      5b                 popl   %ebx
    a:                      81 c3 03 00 00 00  addl   $0x3,%ebx

$ dis -t .init /usr/lib/crtn.o
section .init
    0:                      5b                 popl   %ebx
    1:                      c9                 leave
    2:                      c3                 ret

By now you may see the punch-line: by bracketing the user-generated object files with these crti.o and crtn.o the resulting .init section is the concatenation of the function prologue, all the calls in the user's object files, and finally the function epilogue. All of this is contained in the symbol called _init.

The linker then has some magic that identifies the _init function as special and includes a dynamic entry (DT_INIT) that causes _init to be called by the the run-time linker (ld.so.1) when the binary is loaded. In the binary that was built with USDT but wasn't working properly, there was a .init section with the call to dtrace_dof_init(), but no _init symbol. The problem was, of course, that crti.o and crtn.o weren't being specified in the linker invocation resulting in a .init section, but no _init symbol so no DT_INIT section, so no initialization and no USDT.



(2005-11-20 17:17:56.0/2005-09-15 14:30:00.0) Permalink Comments [2]
Trackback: http://blogs.sun.com/ahl/entry/the_mysteries_of_init

20050726 Tuesday July 26, 2005

 Believe it or not...

I really never thought I'd be sticking a picture of a penguin on my weblog, but here it is...
Joined by Bryan and Eric, I'll be speaking (and listening) at the OpenSolaris BoF at LinuxWorld (August 9, 5:30-7:00). We'll be talking about using and developing on OpenSolaris, how you can participate, and most likely comparing Solaris and Linux as well as giving the mandatory DTrace pitch. I expect it will be a little like the times I've been to Yankee stadium in full Red Sox regalia.

Technorati Tags:



(2005-07-27 04:58:19.0/2005-07-26 00:00:01.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/linuxworld

20050614 Tuesday June 14, 2005

 Debugging cross calls on OpenSolaris

I think the thing I love most about debugging software is that each tough bug can seem like an insurmountable challenge -- until you figure it out. But until you do, each tough bugs is the hardest problem you've ever had to solve. There are weeks when every morning presents me with a seemingly impossible challenge, and each afternoon I get to spike my keyboard and do a little victory dance before running for the train.

For my first OpenSolaris blog post, I thought I talk about one of my favorite such bugs. This particularly nasty bug had to do with a tricky interaction between VMware and DTrace (pre-production versions of each to be clear). My buddy Keith -- a fellow Brown CS Grad -- gave me a call and told me about some strange behavior he was seeting running DTrace inside of a VMware VM. Keith is a big fan of DTrace, but an intermittant, but reproducable problem was putting a damper on his DTrace enthusiasm. Every once in a while, running DTrace would cause the system to freeze. Because Solaris was running in the virtual environment, he could see that both virtual CPUs where spinning away, but they weren't making any forward progress. After a couple of back and forths over email, I made the trip down to Palo Alto so we could work on the problem together.

Using some custom debugging tools, we were able to figure out where the two virtual CPUs were spinning. One CPU was in xc_common() and the other was in xc_serv() -- code to handle cross calls. So what was going on?

cross calls

Before I can really delve into the problem, I want to give just a brief overview of cross calls. In general terms, a cross call (xcall) is used in a multi-processor (MP) system when one CPU needs to get another CPU to do some work. It works by sending a special type of interrupt which the remote CPU handles. You may have heard the term interprocessor interrupt (IPI) -- same thing. One example of when xcalls are used is when unmapping a memory region. To unmap a region, a process will typically call the munmap(2) system call. Remember that in an MP system, any processor may have run threads in this process so those mappings may be present in that any CPU's TLB. The unmap operation executes on one CPU, but the other CPUs need to remove the relevant mappings from their own TLBs. To accomplish this communication, the kernel uses a xcall.

DTrace uses xcalls synchronize data used by all CPUs by ensuring that all CPUs have reached a certain point in the code. DTrace executes actions with interrupts disabled (an explanation of why this must be so is well beyond the bounds of this discussion) so we can tell that a CPU isn't in probe context if its able to handle our xcall. When DTrace is stopping tracing activity, for example, it will update some data that affects all CPUs and then use a xcall to make sure that every CPU has seen its effects before proceeding:

dtrace_state_stop()

10739    	/*
10740    	 * We'll set the activity to DTRACE_ACTIVITY_DRAINING, and issue a sync
10741    	 * to be sure that every CPU has seen it.  See below for the details
10742    	 * on why this is done.
10743    	 */
10744    	state->dts_activity = DTRACE_ACTIVITY_DRAINING;
10745    	dtrace_sync();

dtrace_sync() sends a xcall to all other CPUs and has them spin in a holding pattern until all CPUs have reached that point at which time the CPU which sent the xcall releases them all (and they go back to whatever they had been doing when they received the interrupt). That's the high level overview; let's go into a little more detail on how xcalls work (well, actually a lot more detail).

xcall implementation

If you follow the sequence of functions called by dtrace_sync() (and I encourage you to do so), you'll find that this eventually calls xc_common() to do the heavy lifting. It's important to note that this call to xc_common() will have the sync argument set to 1. What's that mean? In a text book example of good software engineering, someone did a good job documenting what this value means:

xc_common()

 411    /*
 412     * Common code to call a specified function on a set of processors.
 413     * sync specifies what kind of waiting is done.
 414     *	-1 - no waiting, don't release remotes
 415     *	0 - no waiting, release remotes immediately
 416     *	1 - run service locally w/o waiting for remotes.
 417     *	2 - wait for remotes before running locally
 418     */
 419    static void
 420    xc_common(
 421    	xc_func_t func,
 422    	xc_arg_t arg1,
 423    	xc_arg_t arg2,
 424    	xc_arg_t arg3,
 425    	int pri,
 426    	cpuset_t set,
 427    	int sync)

Before you start beating your brain out trying to figure out what you're missing here, in this particular case, this destinction bewteen sync having the value of 1 and 2 is nil: the service (function pointer specified by the func argument) that we're running is dtrace_sync_func() which does literally nothing.

Let's start picking apart xc_common():

xc_common()

 446    	/*
 447    	 * Request service on all remote processors.
 448    	 */
 449    	for (cix = 0; cix < NCPU; cix++) {
 450    		if ((cpup = cpu[cix]) == NULL ||
 451    		    (cpup->cpu_flags & CPU_READY) == 0) {
 452    			/*
 453    			 * In case CPU wasn't ready, but becomes ready later,
 454    			 * take the CPU out of the set now.
 455    			 */
 456    			CPUSET_DEL(set, cix);
 457    		} else if (cix != lcx && CPU_IN_SET(set, cix)) {
 458    			CPU_STATS_ADDQ(CPU, sys, xcalls, 1);
 459    			cpup->cpu_m.xc_ack[pri] = 0;
 460    			cpup->cpu_m.xc_wait[pri] = sync;
 461    			if (sync > 0)
 462    				cpup->cpu_m.xc_state[pri] = XC_SYNC_OP;
 463    			else
 464    				cpup->cpu_m.xc_state[pri] = XC_CALL_OP;
 465    			cpup->cpu_m.xc_pend[pri] = 1;
 466    			send_dirint(cix, xc_xlat_xcptoipl[pri]);
 467    		}
 468    	}

We take a first pass through all the processors; if the processor is ready to go and is in the set of processors we care about (they all are in the case of dtrace_sync()) we set the remote CPU's ack flag to 0, it's wait flag to sync (remember, 1 in this case), and it's state flag to XC_SYNC_OP and then actually send the interrupt with the call to send_dirint().

Next we wait for the remote CPUs to acknowledge that they've executed the requested service which they do by setting the ack flag to 1:

xc_common()

 479    	/*
 480    	 * Wait here until all remote calls complete.
 481    	 */
 482    	for (cix = 0; cix < NCPU; cix++) {
 483    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 484    			cpup = cpu[cix];
 485    			while (cpup->cpu_m.xc_ack[pri] == 0) {
 486    				ht_pause();
 487    				return_instr();
 488    			}
 489    			cpup->cpu_m.xc_ack[pri] = 0;
 490    		}
 491    	}

That while loop spins waiting for ack to become 1. If you look at the definition of return_instr() it's name is actually more descriptive that you might imagine: it's just a return instruction -- the most trivial function possible. I'm not absolutely certain, but I think it was put there so the compiler wouldn't "optimize" the loop away. The call to the inline function ht_pause() is so that the thread spins in such a way that's considerate on an hyper-threaded CPU. The call to ht_pause() is probably sufficient to prevent the compiler from being overly clever, but the legacy call to return_instr() remains.

Now let's look at the other side of this conversation: what happens on a remote CPU as a result of this interrupt? This code is in xc_serv()

xc_serv()

 138    	/*
 139    	 * Acknowledge that we have completed the x-call operation.
 140    	 */
 141    	cpup->cpu_m.xc_ack[pri] = 1;
 142    

I'm sure it comes as no surprise that after executing the given function, it just sets the ack flag.

Since in this case we're dealing with a synchronous xcall, the remote CPU then needs to just chill out until the CPU that initiated the xcall discovers that all remote CPUs have executed the function and are ready to be released:

xc_serv()

 146    	/*
 147    	 * for (op == XC_SYNC_OP)
 148    	 * Wait for the initiator of the x-call to indicate
 149    	 * that all CPUs involved can proceed.
 150    	 */
 151    	while (cpup->cpu_m.xc_wait[pri]) {
 152    		ht_pause();
 153    		return_instr();
 154    	}
 155    
 156    	while (cpup->cpu_m.xc_state[pri] != XC_DONE) {
 157    		ht_pause();
 158    		return_instr();
 159    	}

And here's the code on the initiating side that releases all the remote CPUs by setting the wait and state flags to the values that the remote CPUs are waiting to see:

xc_common()

 502    	/*
 503    	 * Release any waiting CPUs
 504    	 */
 505    	for (cix = 0; cix < NCPU; cix++) {
 506    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 507    			cpup = cpu[cix];
 508    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 509    				cpup->cpu_m.xc_wait[pri] = 0;
 510    				cpup->cpu_m.xc_state[pri] = XC_DONE;
 511    			}
 512    		}
 513    	}

there's a problem

Wait! Without reading ahead in the code, does anyone see the problem?

Back at VMware, Keith hacked up a version of the virtual machine monitor which allowed us to trace certain points in the code and figure out the precise sequence in which they occurred. We traced the entry and return to xc_common() and xc_serv(). Almost every time we'd see something like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_serv() on CPU 1
  • exit xc_common() on CPU 0

or:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • exit xc_serv() on CPU 1

But the problem happened when we saw a sequence like this:

  • enter xc_common() on CPU 0
  • enter xc_serv() on CPU 1
  • exit xc_common() on CPU 0
  • enter xc_common() on CPU 0

And nothing futher. What was happening was that after releasing remote CPUs, CPU 0 was exiting from the call to xc_common() and calling it again before the remote invocation of xc_serv() on the other CPU had a change to exit.

Recall that one of the the first things that xc_common() does is set the state flag. If the first call to xc_common() sets the state flag to release the remote CPU from xc_sync(), but when things go wrong, xc_common() was overwriting that flag before the remote CPU got a change to see it.

the problem

We were seeing this repeatably under VMware, but no one had seen this at all on real hardware (yet). The machine Keith and I were using was a 2-way box running Linux. On VMware, each virtual CPU is represented by a thread on the native OS so rather than having absolute control of the CPU, the execution was more or less at the whim of the Linux scheduler.

When this code is running unadulterated on physical CPUs, we won't see this problem. It's just a matter of timing -- the remote CPU has many many fewer instructions to execute before the state flag gets overwritten by a second xcall so there's no problem. On VMware, the Linux scheduler might decide that's your second virtual CPU's right to the physical CPU is trumped by moving the hands on your xclock (why not?) so there are no garuantees about how long these operations can take.

the fix

There are actually quite a few ways to fix this problem -- I'm sure you can think of at least one or two off the top of your head. We just need to make sure that subsequent xcalls can't interfere with each other. When we found this, Solaris 10 was wrapping up -- we were still making changes, but only those deemed of the absolute highest importance. Making changes to the xcall code (which is rather delicate and risky to change) for a bug that only manifests itself on virtual hardware (and which VMware could work around using some clever trickery[1]) didn't seem worthy of being designated a show- stopper.

Keith predicted a few possible situations where this same bug could manifest itself on physical CPUs: on hyper-threaded CPUs, or in the presence of service management interrupts. And that prediction turned out to be spot on: a few weeks after root causing the bug under VMware, we hit the same problem on a system with four hyper-threaded chips (8 logical CPUs).

Since at that time we were even closer to shipping Solaris 10, I chose the fix I thought was the safest and least likely to have nasty side effects. After releasing remote CPUs, the code in xc_common() would now wait for remote CPUs to check in -- wait for them to acknowledge receipt of the directive to proceed.

xc_common()

 515    	/*
 516    	 * Wait for all CPUs to acknowledge completion before we continue.
 517    	 * Without this check it's possible (on a VM or hyper-threaded CPUs
 518    	 * or in the presence of Service Management Interrupts which can all
 519    	 * cause delays) for the remote processor to still be waiting by
 520    	 * the time xc_common() is next invoked with the sync flag set
 521    	 * resulting in a deadlock.
 522    	 */
 523    	for (cix = 0; cix < NCPU; cix++) {
 524    		if (lcx != cix && CPU_IN_SET(set, cix)) {
 525    			cpup = cpu[cix];
 526    			if (cpup != NULL && (cpup->cpu_flags & CPU_READY)) {
 527    				while (cpup->cpu_m.xc_ack[pri] == 0) {
 528    					ht_pause();
 529    					return_instr();
 530    				}
 531    				cpup->cpu_m.xc_ack[pri] = 0;
 532    			}
 533    		}
 534    	}

In that comment, I tried to summarize in 6 lines what has just taken me several pages to describe. And maybe I should have said "livelock" -- oh well. Here's the complementary code in xc_serv():

xc_serv()

 170    	/*
 171    	 * Acknowledge that we have received the directive to continue.
 172    	 */
 173    	ASSERT(cpup->cpu_m.xc_ack[pri] == 0);
 174    	cpup->cpu_m.xc_ack[pri] = 1;

conclusions

That was one of my favorite bugs to work on, and it's actually fairly typical of a lot of the bugs I investigate: something's going wrong; figure out why. I think the folks who work on Solaris tend to love that kind of stuff as a rule. We spend tons of time building facilities like DTrace, mdb(1), kmdb, CTF, fancy core files, and libdis so that the hart part of investigating mysterious problems isn't gathering data or testing hypotheses, it's thinking of the questions to answer and inventing new hypotheses. It's my hope that OpenSolaris will attract those types of inquisitive minds that thrive on the (seemingly) unsolvable problem.


[1] This sort of problem is hardly unique to DTrace or to Solaris. Apparently (and not surprisingly) there are problems like this in nearly every operating system where the code implicitly or explicitly relies on the relative timing of certain operations. In these cases, VMware has hacks to do things like execute the shifty code in lock step.

Technorati Tag:
Technorati Tag:
Technorati Tag:
Technorati Tag:



(2005-06-14 09:15:43.0/2005-06-14 09:15:00.0) Permalink Comments [1]
Trackback: http://blogs.sun.com/ahl/entry/debugging_cross_calls_on_opensolaris

20050302 Wednesday March 02, 2005

 Beautiful S10/RHEL drag race

Network World Fusion has an article comparing Solaris 10 to Red Hat Enterprise Linux 4.0 (the self-proclaimed Solaris killer. The sweetest part of this article is the graph showing relative performance numbers. Solaris gets beat in 32-bit mode, but edges out Linux in 64-bit mode on identical hardware. AND, we've got a bunch more x86-64 performance through the pipeline. This benchmark is just in one very specific area of the system, but it's exciting nonetheless.



(2005-03-02 10:38:54.0/2005-03-02 10:40:00.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/beautiful_s10_rhe_drag_race

20050217 Thursday February 17, 2005

 Sanity and FUD

With all the FUD coming out of Red Hat, I thought the online rags would fall all over themselves to call shotgun on the anti-Solaris bandwagon. But LinuxInsider seems to have planted their flag on rational reporting:

Stacey Quandt, Open Source Practice Leader for the Robert Frances Group, agreed. "Although many organizations continue to migrate Solaris workloads from UltraSPARC to Linux on x86 and other platforms, the latest version of Red Hat does not eliminate the need for Solaris in the enterprise," she said.

"The performance characteristics of Solaris on AMD Opteron licensing of Solaris widens the choice of open-source operating systems on commodity hardware," Quandt told LinuxInsider.

Did you catch that? Not only is Solaris a performance leader, but it's an open source operating system.

But not so fast! In another FUD barrage, it turns out there are too many OSI-approved licenses -- guess what newly opened sourced operating system is using one of the apparently extraneous licenses. Kudos to slashdot for transmogrifying the title "OSI Urged To Reform Open-Source Licensing" to "OSI Hopes To Decrease Number of Licenses" -- turns out a change in voice can have a radical change in meaning...



(2005-02-17 17:25:20.0/2005-02-17 18:00:00.0) Permalink Comments [8]
Trackback: http://blogs.sun.com/ahl/entry/sanity_and_fud

20041125 Thursday November 25, 2004

 Solaris 10 in the news: week 1

It's been just over a week since we officially launched Solaris 10 and the reactions from the press have been all over the map. Consider these two declarations:

Not many open source aficionados will realize the impact, but by making Solaris 10 free and capable of operating on any kind of hardware, Sun is making a coup in the server market.
...
As a result Linux will probably not grow much beyond its current market share of about 10% leaving Red Hat and especially Novell with a big problem.

link

Sun's announcement of the launch of free open source Solaris 10 has garnered a mild positive response from the investment community. However it also has raised a lot of skepticism from the technical community. The current plans are not good enough.

link

In case you didn't notice the URLs: those are the same guy posted about 20ns apart. Why the sudden reversal? slashdot! Yes, the Linus-loyal quickly rose to Linux's defence citing the multitude of reasons Open Solaris will fail.

It will be a while before anyone knows the relative success or failure of Open Solaris -- we're not even sure of the terms of the license -- so this is last challenge I'm going to rise to. But look: the Linux that exists today is a direct result literally and philosophically of its community. And while Solaris has so far been developed by a closed community, it also reflects the timbre of that community. For example: Linux adheres to a spartan philosophy of debugging help -- ergo no crash dumps; Solaris takes a different tack -- we require projects to come with debugger commands to examine crash dumps. It's the community that defines the project and, as is evident, it's the community that defends it, but don't discount the very real Solaris community.

The dumbest slight is against Solaris's hardware compatabiIity: do you think Linux always ran on as much stuff as it does now? Of course not. A month ago, did Solaris run on the (sick) new laptop I'm typing this on? Nope. The hardware these two operating systems support doesn't define their communities, bur rather the reverse. When people can hack up Solaris to work on their gear, they will just as they've been doing on Linux for years.

I can't wait for the license to be announced, for opensolaris.org to open for bidness, and to have people start contributing code from outside of Sun. Does anyone really doubt that these things will happen? The only question is if Open Solaris will take off -- wait and see.



(2005-03-01 12:27:06.0/2004-11-25 08:00:00.0) Permalink Comments [4]
Trackback: http://blogs.sun.com/ahl/entry/solaris_10_in_the_news

20041117 Wednesday November 17, 2004

 Solaris 10 Launch

I was at the Solaris 10 launch for most of Monday, and it was a pretty fantastic day for everyone working on Solaris 10. I spent about two hours helping to answer questions in an online chat -- here's the transcript -- about Solaris 10 in what was dubbed a webchat sweatshop. There were a bunch of us from the Solaris group as well as Scott and some other execs all huddled around laptops while the HR folks beat the drum at a slow, but steady pace: These answers have to be on the streets of Hong Kong by morning.

I spent the rest of the afternoon talking to customers and the press -- mostly about DTrace -- and the exciting thing was that even more than before, they're getting it and they're excited about Solaris. One of the most indelible moments was when a group of us from the kernel group were talking to Jem Matzan of The Jem Report and he challenged the claim that Solaris 10 was the most innovative operating system ever. As we all painted the picture of Solaris 10, I realized that this was true. Solaris 10 isn't just a random collection of neat crap, but it comes together as a cohesive whole that innovates in each and every place an operating system can innovate. Maybe that's a stretch, but it's not far off..



(2004-11-17 17:15:43.0/2004-11-17 14:30:41.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/solaris_10_launch

20041111 Thursday November 11, 2004

 LinuxInsider on Solaris

Check out LinuxInsider's article on Solaris 10. In particular, the last paragraph has high praise for Solaris 10 and Sun's forthcoming hardware ventures:

Bring them together in production systems and what do you get? The ability to organize your business around a single physical computer redundantly implemented in processors spread across your network -- meaning that a lot of business processes now limited by technology costs and software complexity can be simplified right down to affordability. That's what Solaris 10 is really about, and the 10-year impact is likely to be like nothing we've seen before.

Come Monday, I hope to see similar thoughtful analysis coming from other publications.



(2004-11-11 16:11:58.0/2004-11-11 16:10:55.0) Permalink
Trackback: http://blogs.sun.com/ahl/entry/linuxinsider_on_solaris

XML

« May 2008
SunMonTueWedThuFriSat
    
1
2
3
4
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
       
Today

ahl at sun







Recent Entries