« Processor Sets, and... | Main | A Cleaner Prototype... »
http://blogs.sun.com/csg/date/20070630 Saturday June 30, 2007

Deleting 14TB, and a Finished Prototype

Before I finished my prototype of merging processor sets and pools, this past week was already off to a great start:

The first, and possibly the most amusing bug of the week was found by Dan K. (and not in his own code). When we had left last Friday evening, Dan had kicked off a recursive find ... -exec ... to grep in basically every source file for calls to a particular function he was changing for his time independent zones. He redirected the output to a file in the search area, so once find hit the log file and started exec'ing grep on it... you get the idea. We arrived on Monday to several emails about the build server being out of space. People had deleted old workspaces once this happened, and were now working away again. Dan of course deleted the file (which due to compression, was only 654 GB or so on disk). This was all well and good. Then at some point I started doing a file operation, and it wasn't completing. Sending kill signals to the process yielded no results, so it was clearly stuck in the kernel somewhere (actual signal handling for things like kill signals only occurs when threads are entering, or exiting syscalls). My officemate was having a similar problem. I walked to Dan's office, and he also had that problem. We went to Dan Price, who determined there was a filesystem issue. The three of us went to see Matt Ahrens, who was already working on the bug, as another engineer had already tracked it down to ZFS. After some amount of standing there, most of us left, as we weren't really contributing anything. We went to lunch, came back, and about an hour later the server began working again. My chmod had finally finished - after two and a half hours. A short time later we got an email from Matt Ahrens saying that the problem was that a very large file had been deleted, and was just taking a long time. ZFS tries to cache all the level 1 indirect blocks for a file in memory when deleting. For a 14TB file, that's roughly 16GB - on a machine with 12 GB of RAM. And everyone else's threads were blocked on the thread busy deleting Dan's log file. The ZFS guys knew this sort of thing would be problematic, but had never actually run into it before... and so, they filed Bug 6573681.

At this point, I have a mostly functional version of a kernel in which processor sets and pools are merged, and both always enabled. There are a couple minor cases I haven't handled yet, but it's basically finished. I'd sent out a proposal for the change for review by a few people who had some experience with those systems, and am gathering feedback. Soon I'll start going through the formal process of filing for a significant change to the kernel. In the mean time, Im also looking at other possible projects for the rest of the summer - perhaps one of the projects I had considered earlier for my first project, and perhaps a new RFE...



Posted by csg [Sun] ( June 30, 2007 09:08 PM ) Permalink | Comments[0]
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed