Binu Jose Philip's Weblog
Happiness and patching
A few weeks ago I picked up bug#6713370 and it made me happy. This entry is about how it made me happy. First, about ioctls and PxFS. Assume you are on a PxFS client node and call say _FIOSATIME ioctl. The file system primary is on another node. The copyin for the ioctl happens on another machine. The user address passed to the ioctl is not valid there. You are deep in the kernel in non-cluster code and must find a way to solve the wrongness in address space. How is this achieved? Enter t_copyops. Every kernel thread has the provision for storing a pointer to array of functions to call when a fault is incurred during a copyxxx routine. If this field is non-zero, the appropriate vector from the array will be called. Before calling any ioctl, cluster will register it's own copyops vectors, mc_copyops. In addition the thread issuing the ioctl is also given a tsd entry that has the pid and nodeid of the node from which the call originated. If there is a fault during copyxxx the appropriate vector from mc_copyops is called after the original fault handler is restored by copyxxx. User address space access will always result in mc_copyops vector getting called. Put in simple terms mc_copyops routines goes to the node where the ioctl was issued and does a copyin from the process address space, copies this over to memory in server and allows copyin to proceed. mc_copyops factors in the case where a failover or node death caused the thread which isused the ioctl to die. There is no process or node to access the user address from. All such ioctls will first copy over parameters to the kernel and issue the ioctl after setting the tsd entry to point to pid 0. mc_copyops vectors identifies such cases and treats the passed address as a local kernel address. The amazing thing is, to talk about why bug#6713370 made be happy the above is not necessary. But it is good to know. If you look at the stack in the bug you can see we panicked while enabling logging on the underlying UFS filesystem via _FIOLOGENABLE. This is done whenever a new pxfs primary for a filesystem is created. Trap pc shows '0'! Much badness. That could happen only if deep dark internals of copyin and fault handlers messed things up. My suspision was ASI problems or something similar happening in new code for sun4v. After getting all down and dirty trawling through the core, I had this brainwave that I should verify parameters passed to the ioctl call. That is the first thing to do for any core dump, better late than never. Sure enough for past many many years, we were calling VOP_IOCTL() with DATAMODEL_NATIVE as the flag instead of FKIOCTL. FKIOCTL tells ddi_copyin() to use kcopy() instead of bcopy(). The ioctl is called by a kernel thread and the argument is in a kernel address. So passing DATAMODEL_NATIVE should be a sure fire way to mess everything up. Until now this code worked because mc_copyops identifies failover or node death retry by a caller pid of '0'. That code kicks-in in this case since for kernel ioctls we set the caller pid to zero via the above mentioned tsd entry . So why didn't it work this time? There was another bug 6673119 which messed up the lofault handler. This is the error recovery handler for bcopy and must be valid. For user<->kernel address copy lofault should be valid after a fault recovery. copyops vector is accessed only on the error handler. But due to 6673119 we had an invalid handler. That is where we panicked. kcopy should not have this problem. Thus passing FKIOCTL to the ioctl as intended should make everything work. Amazingly even the above explanation is not necessary to explain my happiness due to bug#6713370. I had a probable root cause. Testing this should be easy enough. But the tests were running on a different patch level than the code I was working on. It is too much trouble to build the whole thing after finding the correct date and thus source gate. I thought of solutions. FKIOCTL is a constant: 0x80000000. DATAMODEL_NATIVE for 64 bit is also a constant with the value 0x00200000. The code that loads this into a register should be easily visible as a sethi instruction. It should be easy to edit the binary and change the instruction to load 0x80000000 instead of 0x00200000. Binary patching. The last time I did this was 12 years ago. Minor thrill developing! First ask dis to disassemble the appropriate routine. ... kernel_ioctl+0xbc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0xc0: a9 2d 70 0c sllx %l5, 0xc, %l4 kernel_ioctl+0xc4: 17 00 08 00 sethi %hi(0x200000), %o3 <<< DATAMODEL_NATIVE kernel_ioctl+0xc8: 93 3e a0 00 sra %i2, 0x0, %o1 ... kernel_ioctl+0xfc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0x100: af 2e 30 0c sllx %i0, 0xc, %l7 kernel_ioctl+0x104: 17 00 08 00 sethi %hi(0x200000), %o3 <<< DATAMODEL_NATIVE kernel_ioctl+0x108: 93 3e a0 00 sra %i2, 0x0, %o1 ... 17 00 08 00 is the sequence we are looking for. Now open pxfs module in emacs and M-x hexl-mode. Search for a92d 700c 1700 0800 ... 0005b8e0: 4000 0000 9010 0007 8090 0008 1240 0014 @............@.. 0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000 ;...@...-....... 0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1700 0800 ._.......-p..... here it is ---^^^^^^^^^ 0005b910: 933e a000 d85d 2000 4000 0000 9410 001b .>...] .@....... 0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000 ....@...._..@... 0005b930: 9010 0007 1080 000f d006 6000 b017 6000 ..........`...`. 0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1700 0800 ._........0..... and here ---^^^^^^^^^ 0005b950: 933e a000 d85d e000 4000 0000 9410 001b .>...]..@....... ... Now change and disassemble till you get the correct bits for "sethi 0x80000000, %o3" This took a few iterations since I couldn't bother myself to learn the binary layout for sethi. ... 0005b8f0: 3b00 0000 4000 0000 2d00 0000 aa15 a000 ;...@...-....... 0005b900: d05f a7e7 9a07 a7f3 a92d 700c 1720 0000 ._.......-p.. .. here it is ---^^^^^^^^^ 0005b910: 933e a000 d85d 2000 4000 0000 9410 001b .>...] .@....... 0005b920: b610 0008 4000 0000 d05f a7e7 4000 0000 ....@...._..@... 0005b930: 9010 0007 1080 000f d006 6000 b017 6000 ..........`...`. 0005b940: d05f a7e7 9a07 a7f3 af2e 300c 1720 0000 ._........0.. .. and here ---^^^^^^^^^ 0005b950: 933e a000 d85d e000 4000 0000 9410 001b .>...]..@....... ... 1700 0800 became 1720 0000. Disassemble and check. ... kernel_ioctl+0xbc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0xc0: a9 2d 70 0c sllx %l5, 0xc, %l4 kernel_ioctl+0xc4: 17 20 00 00 sethi %hi(0x80000000), %o3 <<< FKIOCTL kernel_ioctl+0xc8: 93 3e a0 00 sra %i2, 0x0, %o1 ... kernel_ioctl+0xfc: 9a 07 a7 f3 add %fp, 0x7f3, %o5 kernel_ioctl+0x100: af 2e 30 0c sllx %i0, 0xc, %l7 kernel_ioctl+0x104: 17 20 00 00 sethi %hi(0x80000000), %o3 <<< FKIOCTL kernel_ioctl+0x108: 93 3e a0 00 sra %i2, 0x0, %o1 ... Run the tests. Everything hunky dory. I had successfuly binary patched something after eons. And that, that made me happy ;-)
Posted at 01:04PM Jun 28, 2008 by binujp in cluster and PxFS |
Free Flow PxFS
I drove an Ikon 1.6 in Bangalore. It is a poor man's drivers car and handled beautifully. After 3 years of great fun, I appealed to home ministry, composed entirely of my wife Sapna. for more fun with the car. I got sanction for alloys and a performance filter also called a free flow filter. It's not that the power was inadequate, it had all the power needed to wait powerfully in traffic and at the frequent signals. This was similar to removing white spaces or refactoring or rewriting code that works perfectly. It is difficult for some of us to leave something that works as it is. There has to be change. Alloys and tube-less tires were a easy. Even my wife complemented on the difference and how maybe, just maybe, there was an iota of sense left in me. The air filter, K&N, was another matter. Although it didn't earn home ministry ire. For the money I paid the gain in power was small. To hear the subtle whoosh of it sucking air I would have to open the bonnet and listen carefully. Not a good posture while driving. Well, I did get some increase in power and a lot more mental satisfaction by looking at the beautiful cut-off cone once in a while. What I gave the engine was, ability to suck in a lot more air and thus burn fuel better. The engine does not face forward alone. There is a rear to it. The increased air flow and the larger volume of exhaust was still going through the old rear of the engine, the 'exhaust sub-system'. If I could allow the exhaust to flow free, the engine will become a much more efficient air pump. I will get more power and better mileage. Home ministry personnel realized that for every sane decision there is an equal and opposite decision. Thus, the alloys have to be balanced by a free flow exhaust, which I demanded in a manly manner by groveling and sniveling and composing long sentences consisting entirely of non-sensical words. With a free flow exhaust, the standard exhaust manifold is replaced with custom headers. Each header manifold is of the same length and (hopefully) tuned such that exhaust pulses from each cylinder helps the others along. Here's some theory. The catalytic converter may go, the standard muffler does go, the tail-pipe and the exhaust pipes become a little bigger. With all this, from the moment the exhaust valves open, the exhaust "pulse" has a much freer path to the outside the world. For me, behind the wheel, all this means 15-20% more power and much better throttle response. For a 950kg car, an increase of power by 15-20 bhp is significant. To quote a like minded friend, there is no restriction between flooring the pedal and red-lining the engine. Wow! The moral of the story was free inflow *and* free outflow were required for better performance. Oh, and yes, it also gives a nice deep throaty exhaust note too. Now let me get to PxFS, the crime of working on which I had pleaded guilty to earlier. PxFS is slow. PxFS is slow for large files. PxFS is slow for small files. PxFS is better used read-only. Many Solaris cluster customers and developers have heard this or spoken this. Can it be made faster? It is after all a data-pump. Yes, it can be made faster. Before I expound about the easy modifications, I should tell about Ellard Roush and the huge and the substantial modifications he did to PxFS. That is equivalent of designing a better car. Till Solaris Cluster 3.1u4, PxFS was a dog and it was a glutton for memory. Ellard held it by the neck and shook it till it started behaving itself, that project was called the Object Consolidation project. It was a rewrite of PxFS and made most of the work that followed easier. Somewhat like customizing an Evo is easier compared to almost every other car on the road. One of the main bottlenecks for allocating writes is making sure there is backing store, ie. blocks on disk. This typically means executing a stateful operation. The allocation should survive a failover and be guaranteed to be available. In PxFS's case, the allocation would be requested from a client and gets executed on the server. Even more overhead. Asking for and getting a page is easy and lightweight. To get around the allocation bottleneck, we implemented a space cache reservation strategy for PxFS. The "Fastwrite" project. This idea itself is not new and not unique to filesystems either. You allocate a local backing store pool and when it is exhausted you request more. The server or primary knows how much space is free and retains global control. The clients write merrily to pages after reserving space from the local cache. It is the page out which operates on filesystem meta-data that in turn affects on-disk structures. In real world terms, applications doing extending writes will see big speedups. Solaris Cluster 3.2 introduced fastwrites. What I described above is the equivalent of the performance air filter for my car. Applications can now write unhindered, but they can also starve the system of memory. One of the key requirements of a filesystem is how fast it can get data into stable storage. Runaway memory use can be fixed by throttling. The getting data to stable storage fast requires a better, "freer flowing", back end. With Solaris Cluster 3.2u1 or 3.2 and the latest patches, you get that. We made write chunks bigger and gave more threads to write with. Where data used to be written serially we parallelised. We split locks and reduced hold times. We also introduced a semi-heuristic flow control for clients. All of the above is the equivalent of my tuned headers and straighter exhaust for the car. Unlike the car, I can take this for a spin from my desk. Let me do so. Tests are with 3.2 with latest patches (or 3.2u1 to be released). For all these tests I mounted the metadevice as non-global to test UFS. Disk speed concerns can be put to rest. PxFS's strength is it's use of use. If you create mount directories on all nodes, global mounting is as easy a "mount -g. Similarly, if I take off the "-g" I get a local mount. Feel free to over-estimate my efforts. I'l do the scientificest of all file systems tests. "mkfile!" Writing a 2G file to UFS would take this long in solaris 10. -bash-3.00# timex mkfile 2g /mnt/kuntham real 1:45.79 user 0.22 sys 11.32 Doing the same on a PxFS mount will take this long. -bash-3.00# timex mkfile 2g /global/xxxx/kuntham real 1:50.87 user 0.21 sys 8.27 It is comparable! To hammer home the significance. Not only are PxFS writes going through two complete file system layers (like NFS), it is also check pointing metadata changes and making sure every page and transaction related to the file hits the disk before close() returns. Let's repeat this with dd, the scientificestest of fs tests. For UFS: -bash-3.00# timex dd if=/dev/zero of=/mnt/kuntham bs=524288 count=4096 4096+0 records in 4096+0 records out real 1:47.01 user 0.02 sys 12.50 For PxFS: -bash-3.00# timex dd if=/dev/zero of=/global/xxxx/kuntham bs=524288 count=4096 4096+0 records in 4096+0 records out real 1:46.71 user 0.01 sys 7.87 Ooooo, PxFS is faster. Knowing the insides of PxFS, I gave dd a block size the same as the default page kluster size for PxFS. All is fair in love and tuning. Now I'l brew my own slightly un-scientific test to quantify the statement that PxFS makes sure everything is on disk before closing. This is a small python script that does the same as the scientific dd test above, but breaks up the time for open, write, sync and close. Here is the script and here are the results. On UFS -bash-3.00# /cal/chunk.py /mnt/kuntham Time in seconds open 0.228416 write 106.314831 fsync 0.704131 close 0.000030 Total time: 107.247408 seconds On PxFS -bash-3.00# chunk.py /global/xxxx/kuntham Time in seconds open 0.012415 write 80.609731 fsync 30.116318 close 0.272743 Total time: 111.011207 seconds Notice how writes are much faster than for UFS but fsync and close contribute significantly to overall time? Overall, inspite of having to make sure of data integrity guarantees PxFS performs quite well. Am I done with the car? Absolutely no! Don't tell my wife, but there are iridium spark plugs, porting and polishing, maybe ECU tuning. Similarly, there are huge tuning opportunities for PxFS too and there must be someone who should not be told. Directory operations, small files, check points and so on. Since directory operations and other metadata operations require check-points, they are still slow. In another blog I'l explain checkpoints and recovery of PxFS. You can see the internals of PxFS and rest of Solaris Cluster very soon. I did mention it is going to be opensourced. Maybe one of you who is a master tuner and can then do much more.
Posted at 08:14AM Dec 12, 2007 by binujp in cluster and PxFS |
A rave for PxFS
Sacrilege. I realized only recently that I haven't blogged nice things about the technology that I work on. While I am at it, may I also point your attention to the side bar on the left^h^h^h^h right which is full of possibilities and not much in realization of those possibilities. That side bar had led me to CSS and hours of wonderful time in front of the monitor creating rectangles of various colors and sizes and overlaps. The psychedelics started with a visit to http://www.csszengarden.com I still remember the plans and grand designs I had for the next web creation of mine. Don't you fret, those thoughts and designs are still locked up somewhere in there. But what I actually put in place is what you see here, a div that doesn't break or justify lines and a side bar full of defaults. What fun to create vaporware, eh? Talking about vaporware, I still haven't talked a single thing about PxFS. Aha, the cat is out of the bag and the probability wave hasn't collapsed yet. So, about PxFS. PxFS is the general purpose distributed filesystem used internal to Solaris Cluster nodes. More cats out of the bag now. By this time next year you would have heard much more about Solaris Cluster in the Open. Haha .. in the open, all of the code for Solaris Cluster will be open. As of today http://opensolaris.org/os/community/ha-clusters/ohac will tell you what is open in Solaris Cluster and what is not. PxFS is not open yet, but I can talk about it. What does the big "Distributed HA Filesystem" suit-speak really tell? PxFS is a Highly Available, Distributed and *POSIX compliant* file system layer above some disk based file system. The disk based file system can be UFS or VxFS for now. Layering it above something better like ZFS is technically feasible. Okay. Now for details about what each of the above terms really mean. Before I go into the explanation, I am explaining the real basics of PxFS here, so total new-bees can also understand and I can pretend I know much much more than what I talked about here. Distributed. PxFS is a distributed file system internal to cluster nodes. To explain distributed, take the analogy of electric supply to a house. If you have only one socket in the house, then the supply at your house is not distributed. If you add more outlets then the supply becomes is distributed. Similarly, a Solaris Cluster can have 1 to err.. 8 or 16 nodes. No I am not going to quote an exact number, I like vague. PxFS allows the filesystem hosted in one of the nodes to be accessible in any of the other nodes. *Any* of the other nodes. It is like NFS in that it does not need a disk path, yeah so maybe it is just a file access protocol. To restate, if you globally mount a UFS or VxFS filesystem on a cluster node and the mount directory exists on all cluster nodes, you can access that file system on all cluster nodes at the mount point. Distributed. Now for the Highly Available part. Let's go back to the analogy of vibrating electrons in a linear conductor. If your house's electricity supply has an inverter to back it up then your electricity supply is highly available. If the main line goes down, the inverter (battery) kicks and you don't notice a down time. For exactness, there is the few milliseconds the inverter needs to cut-in when there is no power. Similarly, in a Solaris cluster setup, if you have more than one node with a path to storage hosting the underlying filesystem for PxFS, you have a highly available PxFS file system. If the node hosting PxFS goes down, the other node with path to storage will automatically takeover and your applications will not notice any down-time. Similar to the inverter takeover delay, there will be a brief period when your fs operations are delayed, but there will be no errors or retries. And that is the highly available part. What about POSIX compliant? Take writes to any POSIX compliant single node filesystem. There is a guarantee that every write is atomic. If there are multiple writes to the same file without synchronization between the writers, you have the guarantee that no writes will overlap. The only unknown is the order of writes. Similarly, in a PxFS filesystem, writers from the same node or multiple nodes can do writes with the guarantee that their writes will not get corrupted. That is one example of POSIX compliance, guarantees like space for async writes and fsync semantics, everything POSIX (as far as I know) is guaranteed on PxFS. And that is POSIX compliance. And the administration overhead? .. adding a "-g" to your mount command and making sure there are mount directories on each node. "man mount" will tell you about "-g". That part, the administrative simplicity is worth many paragraphs of prose. The value of simplicity has already been proven by "zpool create tank mirror c1d0 c2d0 mirror c3d0 c4d0" and all of ZFSs other possibilities which saves you a lot of wear and tear on fingertips and neurons if you had to use SVM and metaxxxx.
Posted at 07:08AM Oct 12, 2007 by binujp in cluster and PxFS |
Today's Page Hits: 17
| « November 2009 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | |||||
| Today | ||||||