Binu Jose Philip's Weblog

Binu Jose Philip's Weblog


« Previous month (Apr 2008) | Main | Next month (Jun 2008) »
Saturday Jun 28, 2008

Happiness and patching

A few weeks ago I picked up bug#6713370 and it made me happy. 
This entry is about how it made me happy.

First, about ioctls and PxFS. Assume you are on a PxFS client
node and call say _FIOSATIME ioctl. The file system primary
is on another node. The copyin for the ioctl happens on
another machine. The user address passed to the ioctl is not
valid there. You are deep in the kernel in non-cluster code
and must find a way to solve the wrongness in address
space. How is this achieved? Enter t_copyops.

Every kernel thread has the provision for storing a pointer
to array of functions to call when a fault is incurred
during a copyxxx routine. If this field is non-zero, the
appropriate vector from the array will be called.

Before calling any ioctl, cluster will register it's own
copyops vectors, mc_copyops. In addition the thread issuing
the ioctl is also given a tsd entry that has the pid and
nodeid of the node from which the call originated.

If there is a fault during copyxxx the appropriate vector
from mc_copyops is called after the original fault handler is
restored by copyxxx. User address space access will always
result in mc_copyops vector getting called. Put in simple
terms mc_copyops routines goes to the node where the ioctl
was issued and does a copyin from the process address space,
copies this over to memory in server and allows copyin to
proceed.

mc_copyops factors in the case where a failover or node
death caused the thread which isused the ioctl to die. There
is no process or node to access the user address from. All
such ioctls will first copy over parameters to the kernel
and issue the ioctl after setting the tsd entry to point to
pid 0. mc_copyops vectors identifies such cases and treats
the passed address as a local kernel address.

The amazing thing is, to talk about why bug#6713370 made be
happy the above is not necessary. But it is good to know.

If you look at the stack in the bug you can see we panicked
while enabling logging on the underlying UFS filesystem via
_FIOLOGENABLE.  This is done whenever a new pxfs primary for
a filesystem is created.

Trap pc shows '0'! Much badness. That could happen only if
deep dark internals of copyin and fault handlers messed
things up. My suspision was ASI problems or something
similar happening in new code for sun4v.  After getting all
down and dirty trawling through the core, I had this
brainwave that I should verify parameters passed to the
ioctl call.  That is the first thing to do for any core
dump, better late than never. Sure enough for past many many
years, we were calling VOP_IOCTL() with DATAMODEL_NATIVE as
the flag instead of FKIOCTL.

FKIOCTL tells ddi_copyin() to use kcopy() instead of
bcopy(). The ioctl is called by a kernel thread and the
argument is in a kernel address. So passing DATAMODEL_NATIVE
should be a sure fire way to mess everything up. Until now
this code worked because mc_copyops identifies failover or
node death retry by a caller pid of '0'. That code kicks-in
in this case since for kernel ioctls we set the caller pid
to zero via the above mentioned tsd entry .

So why didn't it work this time?

There was another bug 6673119 which messed up the lofault
handler.  This is the error recovery handler for bcopy and
must be valid. For user<->kernel address copy lofault should
be valid after a fault recovery. copyops vector is accessed
only on the error handler. But due to 6673119 we had an
invalid handler. That is where we panicked.  kcopy should
not have this problem. Thus passing FKIOCTL to the ioctl as
intended should make everything work.

Amazingly even the above explanation is not necessary to
explain my happiness due to bug#6713370.

I had a probable root cause. Testing this should be easy
enough. But the tests were running on a different patch
level than the code I was working on. It is too much trouble
to build the whole thing after finding the correct date and
thus source gate. I thought of solutions.

FKIOCTL is a constant: 0x80000000. DATAMODEL_NATIVE for 64
bit is also a constant with the value 0x00200000. The code
that loads this into a register should be easily visible as
a sethi instruction. It should be easy to edit the binary
and change the instruction to load 0x80000000 instead of
0x00200000. Binary patching. The last time I did this was 12
years ago. Minor thrill developing!

First ask dis to disassemble the appropriate routine.

...
kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
...
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 00 08 00  sethi     %hi(0x200000), %o3	<<< DATAMODEL_NATIVE
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1
...

17 00 08 00 is the sequence we are looking for. Now open
pxfs module in emacs and M-x hexl-mode. Search for 
a92d 700c 1700 0800

...
0005b8e0: 4000 0000 9010 0007 8090 0008 1240 0014  @............@..
0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1700 0800  ._.......-p.....
   		          here it is ---^^^^^^^^^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1700 0800  ._........0.....
   		            and here ---^^^^^^^^^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......
...

Now change and disassemble till you get the correct bits for
"sethi 0x80000000, %o3" This took a few iterations since I
couldn't bother myself to learn the binary layout for sethi.

...
0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000  ;...@...-.......
0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1720 0000  ._.......-p.. ..
   		          here it is ---^^^^^^^^^
0005b910: 933e a000 d85d 2000 4000 0000 9410 001b  .>...] .@.......
0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000  ....@...._..@...
0005b930: 9010 0007 1080 000f d006 6000 b017 6000  ..........`...`.
0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1720 0000  ._........0.. ..
   		            and here ---^^^^^^^^^
0005b950: 933e a000 d85d e000 4000 0000 9410 001b  .>...]..@.......
...

1700 0800 became 1720 0000. Disassemble and check.

...
kernel_ioctl+0xbc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0xc0:  a9 2d 70 0c  sllx      %l5, 0xc, %l4
kernel_ioctl+0xc4:  17 20 00 00  sethi     %hi(0x80000000), %o3  <<< FKIOCTL
kernel_ioctl+0xc8:  93 3e a0 00  sra       %i2, 0x0, %o1
...
kernel_ioctl+0xfc:  9a 07 a7 f3  add       %fp, 0x7f3, %o5
kernel_ioctl+0x100: af 2e 30 0c  sllx      %i0, 0xc, %l7
kernel_ioctl+0x104: 17 20 00 00  sethi     %hi(0x80000000), %o3   <<< FKIOCTL
kernel_ioctl+0x108: 93 3e a0 00  sra       %i2, 0x0, %o1
...

Run the tests. Everything hunky dory. I had successfuly
binary patched something after eons. And that, that made
me happy ;-)

Thursday Jun 19, 2008

raftman vs walle

I had sat down to find if I can extract part of an elf section
transpose it to another section and still keep the object file
functional. As usual one thing led to another and before I knew
I was watching a short movie called "The Raftman's Razor". I
lovvvved it. From "how uninteresting this is", I went to rapt
attention and a contented sigh at the end.



Elf editing was a receding dot in the horizon now. I forget the
association, but from Raftman I reached WALL.E. Fantastic
characterization. Now I need to watch some sort of walle excerpt
everyday. My daughter too has declared she likes walle and goes 
"Walleeeeee..", this from someone who cried while watching Donald 
Duck get hit on the head. Can't wait for wall.e to hit the screen.


Wednesday Jun 04, 2008

Deciding culture

Last few weeks I was obsessing about decisions. I had to think about
culture. I had to think about our responsibilities 
to our kid. I am  in the US now  and we have to go
back  soon. There are  no un-thwartable reasons to
go back.  The reasons I  quote are  taking care of
parents, being at a place which is really home and
belongs  to us and  allowing our  kid to  grow up
immersed in our culture and education. She is happy here enjoying
the parks  and play school and long trips.

We have  friends here who are  questioning our sanity about  going back.
We have our parents, relative and friends in India not believing we will
come back. I  have  colleagues who  counsel  me about opportunities  and
exploiting them  and giving my kid  the best possible education.  I have
other friends who tell me how much more money I can  make and how stress
free and easy life is here.

Am I really denying my a daughter good education and  a better life only
because I value  something as ephemeral  and shifting as  culture? Would
she be a better person because of  "our" culture? Do my responsibilities
have an inflated perception factor  and a very small  reality factor?  I
thought a lot, rather I obsessed a lot and went  around with a dour face
and an irritable disposition.

Finally, I have decided. I absolved myself all sins so to say. Thinking
back how I grew up, memories I cherish, where I studied, what I enjoyed, I
don't see it being very different for my daughter. For every
luxury  I  may  have missed   during childhood  there is  an
enjoyable story, some learning, an event, a reason making it
special  and  enriching  than a  lack  of  some   sort. From
mosquito  bites to crowded   open  markets, from  overloaded
public    buses to  lack  of  shopping   malls,  from  dirty
rest-rooms to inadequate infrastructure, everything has left
an impression on me. Whether good or bad or inconsequential,
all that is what has made me.

The "could have been"s are  endless, I have decided for her.
I know I am giving her what she deserves, only the best.


Today's Page Hits: 12