Binu Jose Philip's Weblog
Free Flow PxFS
I drove an Ikon 1.6 in Bangalore. It is a poor man's drivers car and handled beautifully. After 3 years of great fun, I appealed to home ministry, composed entirely of my wife Sapna. for more fun with the car. I got sanction for alloys and a performance filter also called a free flow filter. It's not that the power was inadequate, it had all the power needed to wait powerfully in traffic and at the frequent signals. This was similar to removing white spaces or refactoring or rewriting code that works perfectly. It is difficult for some of us to leave something that works as it is. There has to be change. Alloys and tube-less tires were a easy. Even my wife complemented on the difference and how maybe, just maybe, there was an iota of sense left in me. The air filter, K&N, was another matter. Although it didn't earn home ministry ire. For the money I paid the gain in power was small. To hear the subtle whoosh of it sucking air I would have to open the bonnet and listen carefully. Not a good posture while driving. Well, I did get some increase in power and a lot more mental satisfaction by looking at the beautiful cut-off cone once in a while. What I gave the engine was, ability to suck in a lot more air and thus burn fuel better. The engine does not face forward alone. There is a rear to it. The increased air flow and the larger volume of exhaust was still going through the old rear of the engine, the 'exhaust sub-system'. If I could allow the exhaust to flow free, the engine will become a much more efficient air pump. I will get more power and better mileage. Home ministry personnel realized that for every sane decision there is an equal and opposite decision. Thus, the alloys have to be balanced by a free flow exhaust, which I demanded in a manly manner by groveling and sniveling and composing long sentences consisting entirely of non-sensical words. With a free flow exhaust, the standard exhaust manifold is replaced with custom headers. Each header manifold is of the same length and (hopefully) tuned such that exhaust pulses from each cylinder helps the others along. Here's some theory. The catalytic converter may go, the standard muffler does go, the tail-pipe and the exhaust pipes become a little bigger. With all this, from the moment the exhaust valves open, the exhaust "pulse" has a much freer path to the outside the world. For me, behind the wheel, all this means 15-20% more power and much better throttle response. For a 950kg car, an increase of power by 15-20 bhp is significant. To quote a like minded friend, there is no restriction between flooring the pedal and red-lining the engine. Wow! The moral of the story was free inflow *and* free outflow were required for better performance. Oh, and yes, it also gives a nice deep throaty exhaust note too. Now let me get to PxFS, the crime of working on which I had pleaded guilty to earlier. PxFS is slow. PxFS is slow for large files. PxFS is slow for small files. PxFS is better used read-only. Many Solaris cluster customers and developers have heard this or spoken this. Can it be made faster? It is after all a data-pump. Yes, it can be made faster. Before I expound about the easy modifications, I should tell about Ellard Roush and the huge and the substantial modifications he did to PxFS. That is equivalent of designing a better car. Till Solaris Cluster 3.1u4, PxFS was a dog and it was a glutton for memory. Ellard held it by the neck and shook it till it started behaving itself, that project was called the Object Consolidation project. It was a rewrite of PxFS and made most of the work that followed easier. Somewhat like customizing an Evo is easier compared to almost every other car on the road. One of the main bottlenecks for allocating writes is making sure there is backing store, ie. blocks on disk. This typically means executing a stateful operation. The allocation should survive a failover and be guaranteed to be available. In PxFS's case, the allocation would be requested from a client and gets executed on the server. Even more overhead. Asking for and getting a page is easy and lightweight. To get around the allocation bottleneck, we implemented a space cache reservation strategy for PxFS. The "Fastwrite" project. This idea itself is not new and not unique to filesystems either. You allocate a local backing store pool and when it is exhausted you request more. The server or primary knows how much space is free and retains global control. The clients write merrily to pages after reserving space from the local cache. It is the page out which operates on filesystem meta-data that in turn affects on-disk structures. In real world terms, applications doing extending writes will see big speedups. Solaris Cluster 3.2 introduced fastwrites. What I described above is the equivalent of the performance air filter for my car. Applications can now write unhindered, but they can also starve the system of memory. One of the key requirements of a filesystem is how fast it can get data into stable storage. Runaway memory use can be fixed by throttling. The getting data to stable storage fast requires a better, "freer flowing", back end. With Solaris Cluster 3.2u1 or 3.2 and the latest patches, you get that. We made write chunks bigger and gave more threads to write with. Where data used to be written serially we parallelised. We split locks and reduced hold times. We also introduced a semi-heuristic flow control for clients. All of the above is the equivalent of my tuned headers and straighter exhaust for the car. Unlike the car, I can take this for a spin from my desk. Let me do so. Tests are with 3.2 with latest patches (or 3.2u1 to be released). For all these tests I mounted the metadevice as non-global to test UFS. Disk speed concerns can be put to rest. PxFS's strength is it's use of use. If you create mount directories on all nodes, global mounting is as easy a "mount -g. Similarly, if I take off the "-g" I get a local mount. Feel free to over-estimate my efforts. I'l do the scientificest of all file systems tests. "mkfile!" Writing a 2G file to UFS would take this long in solaris 10. -bash-3.00# timex mkfile 2g /mnt/kuntham real 1:45.79 user 0.22 sys 11.32 Doing the same on a PxFS mount will take this long. -bash-3.00# timex mkfile 2g /global/xxxx/kuntham real 1:50.87 user 0.21 sys 8.27 It is comparable! To hammer home the significance. Not only are PxFS writes going through two complete file system layers (like NFS), it is also check pointing metadata changes and making sure every page and transaction related to the file hits the disk before close() returns. Let's repeat this with dd, the scientificestest of fs tests. For UFS: -bash-3.00# timex dd if=/dev/zero of=/mnt/kuntham bs=524288 count=4096 4096+0 records in 4096+0 records out real 1:47.01 user 0.02 sys 12.50 For PxFS: -bash-3.00# timex dd if=/dev/zero of=/global/xxxx/kuntham bs=524288 count=4096 4096+0 records in 4096+0 records out real 1:46.71 user 0.01 sys 7.87 Ooooo, PxFS is faster. Knowing the insides of PxFS, I gave dd a block size the same as the default page kluster size for PxFS. All is fair in love and tuning. Now I'l brew my own slightly un-scientific test to quantify the statement that PxFS makes sure everything is on disk before closing. This is a small python script that does the same as the scientific dd test above, but breaks up the time for open, write, sync and close. Here is the script and here are the results. On UFS -bash-3.00# /cal/chunk.py /mnt/kuntham Time in seconds open 0.228416 write 106.314831 fsync 0.704131 close 0.000030 Total time: 107.247408 seconds On PxFS -bash-3.00# chunk.py /global/xxxx/kuntham Time in seconds open 0.012415 write 80.609731 fsync 30.116318 close 0.272743 Total time: 111.011207 seconds Notice how writes are much faster than for UFS but fsync and close contribute significantly to overall time? Overall, inspite of having to make sure of data integrity guarantees PxFS performs quite well. Am I done with the car? Absolutely no! Don't tell my wife, but there are iridium spark plugs, porting and polishing, maybe ECU tuning. Similarly, there are huge tuning opportunities for PxFS too and there must be someone who should not be told. Directory operations, small files, check points and so on. Since directory operations and other metadata operations require check-points, they are still slow. In another blog I'l explain checkpoints and recovery of PxFS. You can see the internals of PxFS and rest of Solaris Cluster very soon. I did mention it is going to be opensourced. Maybe one of you who is a master tuner and can then do much more.
Posted at 08:14AM Dec 12, 2007 by binujp in cluster and PxFS |
Today's Page Hits: 15
| « November 2009 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | |||||
| Today | ||||||