Today, we were seeing a very odd crash in some xen code. The core dump wasn't of great use,
since both %eip and %ebp were zeroed, which means no backtrace. Instead I attached mdb to the
process and started stepping through to see what was happening. It soon transpired that we
were crashing after successfully executing a C function called xspy_introduce_domain(), but
before we got back to the Python code that calls into it. After a little bit of head-scratching,
I looked closer at the assembly for this function:
xs.so`xspy_introduce_domain: pushl %ebp
xs.so`xspy_introduce_domain+1: movl %esp,%ebp
...
xs.so`xspy_introduce_domain+0x57: subl $0xc,%esp
xs.so`xspy_introduce_domain+0x5a: leal -0xc(%ebp),%eax
xs.so`xspy_introduce_domain+0x5d: pushl %eax
xs.so`xspy_introduce_domain+0x5e: leal -0x8(%ebp),%eax
xs.so`xspy_introduce_domain+0x61: pushl %eax
xs.so`xspy_introduce_domain+0x62: leal -0x2(%ebp),%eax
xs.so`xspy_introduce_domain+0x65: pushl %eax
xs.so`xspy_introduce_domain+0x66: pushl $0xc4b13114
xs.so`xspy_introduce_domain+0x6b: pushl 0xc(%ebp)
xs.so`xspy_introduce_domain+0x6e: call +0x43595220
...
xs.so`xspy_introduce_domain+0x11f: leave
xs.so`xspy_introduce_domain+0x120: ret
Seems OK - we're pushing three pointers onto the stack (+0x5a-0x65) and two other arguments. Let's look
at the sources:
static PyObject *xspy_introduce_domain(XsHandle *self, PyObject *args)
{
domid_t dom;
unsigned long page;
unsigned int port;
struct xs_handle *xh = xshandle(self);
bool result = 0;
if (!xh)
return NULL;
if (!PyArg_ParseTuple(args, "ili", &dom, &page, &port))
return NULL;
Looking up the definition of PyArg_ParseTuple()
gave me the clue as to the problem. The format string specifies that we're giving the addresses of an int, long,
and int. Yet in the assembly, the offsets of the leal instructions indicate we're pushing addresses to
two 32-bit storage slots, and one 16-bit slot. So when PyArg_ParseTuple() writes its 32-bit quantity,
it's going to overwrite two more bytes than it should.
As it happens, we're at the very top of the local stack storage space (-0x2(%ebp)). So those two bytes
actually end up over-writing the bottom two bytes of the old %ebp we pushed at the start of the function.
Then we pop that corrupted value back into the %ebp register via the leave. This has no effect
until our caller calls leave itself. We move %ebp into %esp, then attempt to pop from
the top of this stack into %ebp again. As it happens, the memory pointed to by the corrupt %ebp
is zeroed; thus, we end up setting %ebp to zero. Finally, our caller does a ret, which pops
another zero, but this time into %eip. Naturally this isn't a happy state of affairs, and we find ourselves
with a core dump as described earlier.
Presumably this bug happened in the first place because someone didn't notice that domid_t was a 16-bit quantity. What's amazing is that nobody else has been hitting this problem!
Tags: Xen
Trackback URL: http://blogs.sun.com/levon/entry/fun_with_stack_corruption
Posted by ALan Burlison on April 13, 2006 at 10:42 PM BST #