The View from the Moon

20080404 Friday April 04, 2008

Finding bugs in python code, using DTrace

I have spent a lot of time in recent months helping the IPS project.  This week I integrated usability improvements, and tracked down two different performance problems.  The first was notable because I think it's the kind of thing that would have been hard to find without DTrace.   Here's a writeup I did, cribbed from the bug report:

While doing work on IPS I discovered a large number of system calls being made during the "creating plan" phase of a fresh installation of the 'redistributable' cluster of packages.

I used the following crude dtrace script to track down the problem:

#!/usr/sbin/dtrace -Fs

python$1:::function-entry
/copyinstr(arg1) == $$2/
{
self->t=1;
calls++;
}

python$1:::function-return
/copyinstr(arg1) == $$2 && self->t/
{
self->t=0;
}

syscall:::entry
/self->t/
{
@a[probefunc]=count()
}

END
{
printf("calls: %d", calls);
}

No points for elegance but it works.

Successive guesses allowed me to narrow this down to the get_link_actions() routine. Here's an example of its system call impact during the following sequence:

# rm -fr /tmp/foo
# export PKG_IMAGE=/tmp/foo
# pkg image-create -a foo=http://ipkg.eng /tmp/foo
# pkg install -n redistributable
I ran the D script as follows:
$ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions
And once the command was done, got this output:
CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

brk 26
stat64 774400
close 775280
fcntl 775280
fsat 775280
fstat64 775280
getdents64 1553200
So here we can see that something is amiss-- this routine is the source of 4.7 million syscalls. Here's the code:
        def get_link_actions(self):
""" return a dictionary of hardlink action lists indexed by
target """
(a) if self.link_actions:
return self.link_actions

d = {}
for act in self.gen_installed_actions():
if act.name == "hardlink":
t = act.get_target_path()
if t in d:
d[t].append(act)
else:
d[t] = [act]

self.link_actions = d
return d

When Bart and I first looked at this, it seemed by inspection to be working as designed-- the first time the routine is called, we populate a cache, and subsequently the cache is returned. At some point Bart went "Aha!" and we realized that the problem here was that the comparison at (a) wasn't working right, and that, for a not-yet-installed image, the link_actions dictionary is exactly equal to {}, the empty dictionary. That evaluates to "false" in python. So we always think that we haven't yet filled out the link_actions cache, and go to fill it. That sends us into gen_installed_actions(), which is expensive. It seems to get more and more expensive as we accumulate more manifests, as well.

So, while there are exactly zero link actions we need to worry about, we rescan all manifests for every new manifest we evaluate.

This is a situation which will crop up most often with zones, since that's the place we're most likely to create new images from scratch.

The fix is very simple: set self->link_actions to None initially (in the object constructor), and change the comparison at (a) to test against None explictly.

Having fixed this, the impact of get_link_actions is:
CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

stat64 3
close 4
fcntl 4
fsat 4
fstat64 4
getdents64 8
Or, basically, nothing. In some informal testing I got these numbers with ptime(1): Before:
real     1:26.730
user 1:10.988
sys 11.624
After:
real     1:12.932
user 1:02.475
sys 6.361

Or about a 19% improvement in real time. And quite a bit in sys time. Not a huge win, but not too shabby either.

I used the same technique to find a similar bug in another part of the code, but alas, Danek had already found the same bug, so I can't claim it.

(2008-04-04 18:10:51.0) Permalink Comments [3]
Tags:
Trackback: http://blogs.sun.com/dp/entry/finding_bugs_in_python_code
 

Comments:

Yeah, but Danek probably won't blog about it. :)

Posted by Danek Duvall on April 04, 2008 at 10:36 PM PDT #

Maybe the python profiler can help too in such situations ?

http://docs.python.org/lib/module-profile.html

Here is how I call it on my projects, from the main, basically.

http://code.google.com/p/pytof/source/browse/trunk/ctof.py

Posted by bsergean on April 06, 2008 at 09:35 AM PDT #

bsergean, I've been using the python profiler, as well. We spotted that Action (one of our classes) creation was taking a long time by using the profiler, and subsequent improvements got about a 40% improvement on a simple benchmark we put together.

Posted by Dan Price on April 06, 2008 at 06:01 PM PDT #

Post a Comment:

Comments are closed for this entry.
Dan Price's Weblog
[about me]