Binu Jose Philip's Weblog
Happiness and patching
A few weeks ago I picked up bug#6713370 and it made me happy. This entry is about how it made me happy. First, about ioctls and PxFS. Assume you are on a PxFS client node and call say _FIOSATIME ioctl. The file system primary is on another node. The copyin for the ioctl happens on another machine. The user address passed to the ioctl is not valid there. You are deep in the kernel in non-cluster code and must find a way to solve the wrongness in address space. How is this achieved? Enter t_copyops. Every kernel thread has the provision for storing a pointer to array of functions to call when a fault is incurred during a copyxxx routine. If this field is non-zero, the appropriate vector from the array will be called. Before calling any ioctl, cluster will register it's own copyops vectors, mc_copyops. In addition the thread issuing the ioctl is also given a tsd entry that has the pid and nodeid of the node from which the call originated. If there is a fault during copyxxx the appropriate vector from mc_copyops is called after the original fault handler is restored by copyxxx. User address space access will always result in mc_copyops vector getting called. Put in simple terms mc_copyops routines goes to the node where the ioctl was issued and does a copyin from the process address space, copies this over to memory in server and allows copyin to proceed. mc_copyops factors in the case where a failover or node death caused the thread which isused the ioctl to die. There is no process or node to access the user address from. All such ioctls will first copy over parameters to the kernel and issue the ioctl after setting the tsd entry to point to pid 0. mc_copyops vectors identifies such cases and treats the passed address as a local kernel address. The amazing thing is, to talk about why bug#6713370 made be happy the above is not necessary. But it is good to know. If you look at the stack in the bug you can see we panicked while enabling logging on the underlying UFS filesystem via _FIOLOGENABLE. This is done whenever a new pxfs primary for a filesystem is created. Trap pc shows '0'! Much badness. That could happen only if deep dark internals of copyin and fault handlers messed things up. My suspision was ASI problems or something similar happening in new code for sun4v. After getting all down and dirty trawling through the core, I had this brainwave that I should verify parameters passed to the ioctl call. That is the first thing to do for any core dump, better late than never. Sure enough for past many many years, we were calling VOP_IOCTL() with DATAMODEL_NATIVE as the flag instead of FKIOCTL. FKIOCTL tells ddi_copyin() to use kcopy() instead of bcopy(). The ioctl is called by a kernel thread and the argument is in a kernel address. So passing DATAMODEL_NATIVE should be a sure fire way to mess everything up. Until now this code worked because mc_copyops identifies failover or node death retry by a caller pid of '0'. That code kicks-in in this case since for kernel ioctls we set the caller pid to zero via the above mentioned tsd entry . So why didn't it work this time? There was another bug 6673119 which messed up the lofault handler. This is the error recovery handler for bcopy and must be valid. For user<->kernel address copy lofault should be valid after a fault recovery. copyops vector is accessed only on the error handler. But due to 6673119 we had an invalid handler. That is where we panicked. kcopy should not have this problem. Thus passing FKIOCTL to the ioctl as intended should make everything work. Amazingly even the above explanation is not necessary to explain my happiness due to bug#6713370. I had a probable root cause. Testing this should be easy enough. But the tests were running on a different patch level than the code I was working on. It is too much trouble to build the whole thing after finding the correct date and thus source gate. I thought of solutions. FKIOCTL is a constant: 0x80000000. DATAMODEL_NATIVE for 64 bit is also a constant with the value 0x00200000. The code that loads this into a register should be easily visible as a sethi instruction. It should be easy to edit the binary and change the instruction to load 0x80000000 instead of 0x00200000. Binary patching. The last time I did this was 12 years ago. Minor thrill developing! First ask dis to disassemble the appropriate routine. ... kernel_ioctl+0xbc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0xc0: a9 2d 70 0c sllx %l5, 0xc, %l4 kernel_ioctl+0xc4: 17 00 08 00 sethi %hi(0x200000), %o3 <<< DATAMODEL_NATIVE kernel_ioctl+0xc8: 93 3e a0 00 sra %i2, 0x0, %o1 ... kernel_ioctl+0xfc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0x100: af 2e 30 0c sllx %i0, 0xc, %l7 kernel_ioctl+0x104: 17 00 08 00 sethi %hi(0x200000), %o3 <<< DATAMODEL_NATIVE kernel_ioctl+0x108: 93 3e a0 00 sra %i2, 0x0, %o1 ... 17 00 08 00 is the sequence we are looking for. Now open pxfs module in emacs and M-x hexl-mode. Search for a92d 700c 1700 0800 ... 0005b8e0: 4000 0000 9010 0007 8090 0008 1240 0014 @............@.. 0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000 ;...@...-....... 0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1700 0800 ._.......-p..... here it is ---^^^^^^^^^ 0005b910: 933e a000 d85d 2000 4000 0000 9410 001b .>...] .@....... 0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000 ....@...._..@... 0005b930: 9010 0007 1080 000f d006 6000 b017 6000 ..........`...`. 0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1700 0800 ._........0..... and here ---^^^^^^^^^ 0005b950: 933e a000 d85d e000 4000 0000 9410 001b .>...]..@....... ... Now change and disassemble till you get the correct bits for "sethi 0x80000000, %o3" This took a few iterations since I couldn't bother myself to learn the binary layout for sethi. ... 0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000 ;...@...-....... 0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1720 0000 ._.......-p.. .. here it is ---^^^^^^^^^ 0005b910: 933e a000 d85d 2000 4000 0000 9410 001b .>...] .@....... 0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000 ....@...._..@... 0005b930: 9010 0007 1080 000f d006 6000 b017 6000 ..........`...`. 0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1720 0000 ._........0.. .. and here ---^^^^^^^^^ 0005b950: 933e a000 d85d e000 4000 0000 9410 001b .>...]..@....... ... 1700 0800 became 1720 0000. Disassemble and check. ... kernel_ioctl+0xbc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0xc0: a9 2d 70 0c sllx %l5, 0xc, %l4 kernel_ioctl+0xc4: 17 20 00 00 sethi %hi(0x80000000), %o3 <<< FKIOCTL kernel_ioctl+0xc8: 93 3e a0 00 sra %i2, 0x0, %o1 ... kernel_ioctl+0xfc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0x100: af 2e 30 0c sllx %i0, 0xc, %l7 kernel_ioctl+0x104: 17 20 00 00 sethi %hi(0x80000000), %o3 <<< FKIOCTL kernel_ioctl+0x108: 93 3e a0 00 sra %i2, 0x0, %o1 ... Run the tests. Everything hunky dory. I had successfuly binary patched something after eons. And that, that made me happy ;-)
Posted at 01:04PM Jun 28, 2008 by binujp in cluster and PxFS |
Today's Page Hits: 68
| « November 2009 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | |||||
| Today | ||||||