Monday August 30, 2004
Matthew Ahrens' Weblogwhat I have to say What is ZFS? It occurs to me that before I can really talk much about ZFS, you need to know what it is, and how it's generally arranged. So here's an overview of what ZFS is, reproduced from our internal webpage: ZFS is a vertically integrated storage system that provides end-to-end data integrity, immense (128-bit) capacity, and very simple administration.What exactly is a "pooled storage model"? Basically it means rather than stacking one FS on top of one volume on top of some disks, you stack many filesystems on top of one storage pool on top of lots of disks. Take a home directoy server with a few thousand users and a few terabytes of data. Traditionally, you'd probably set it up so that there are a few filesystems, each a few hundred megabytes, and put a couple hundred users on each filesystem. That seems odd -- why is there an arbitrariy grouping of users into filesystems? It would be more logical to have either one filesystem for all users, or one filesystem for each user. We can rule out the latter because it would require that we statically partition our storage and decide up front how much space each user got -- ugh. Using one big filesystem would be plausable, but performance may become a problem with large filesystems -- both common run-time performance and performance of administrative tasks. Many backup tools are filesystem-based. The run-time of fsck(1m) is not linear in the size of the filesystem, so it could take a lot longer to fsck that one 8TB filesystem than it would to fsck 80 100GB filesystems. Furthermore, some filesystems simply don't support more than a terabyte or so of storage. It's inconvenient to run out of space in a traditional filesystem, and happens all too often. You might have lots of free space in a different filesystem, but you can't easily use it. You could manually migrate users to different filesystems to balance the free space... (hope your users don't mind downtime! hope you find the right backup tape when it comes time to restore!) Eventually you'll have to install new disks, make a new volume and filesystem out of them, and then migrate some users over to the new filesystem, incurring downtime. I experienced these kinds of problems with my home directory when I was attending school (using VxFS on VxVM on Solaris), they still plague some home directory and other file servers at Sun (using UFS on SVM on Solaris). With ZFS, you can have one storage pool which encompasses all the storage attached to your server. Then you can easily create one filesystem for each user. When you run low on storage, simply attach more disks and add them to the pool. No downtime. This is the scenario on the home directory server that I use at Sun, which uses ZFS on Solaris. Thus concludes ZFS lesson 1. (2004-08-30 15:38:27.0) Permalink Comments [33] Post a Comment: Comments are closed for this entry. |
Calendar
RSS Feeds
All /General /Solaris /ZFS SearchLinks
Navigation |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I guess you could say that "volumes" are a bit like hardware domains, but without the benefit of hardware isolation, while ZFS filesystems are like Zones in Solaris 10, from the point of view of the CPUs and memory. An imperfect but helpful comparison? Not that many people are familiar with volumes, compared with general system management I'd think (I'm certainly not)
Can different file-systems be optimised differently?
What kind of compression is supported?
Do you see ZFS as something that'll be used absolutely everywhere including very high-end, or is it more low-mid range? What about Sun's high-end storage software like SamFS and QFS?
Are there any types of applications you would not recommend ZFS for? (ie any workload types it would be poorly suited towards?)
Posted by Chris Rijk on August 30, 2004 at 04:14 PM PDT #
Thanks for "thinking outside the box" - easier said than done!
Posted by Chris Rijk on August 30, 2004 at 04:44 PM PDT #
I hadn't thought of the analogy with Zones, but you're absolutely right. Zones and ZFS both allow you to control different logical groupings of things (application environments and files, respectively) administratively, without having to statically partition your hardware. That sounds a bit technical, but both shifts open the door to lots of exciting new features.
Filesystems can be configured differently -- in fact, that's the entire reason for having different filesystems in ZFS. In ZFS, a filesystem is simply an administrative control point. Some of the things you can set on filesystems are quotas, reservations and compression. More about those in future posts!
Currently we have a very fast but very simple compression algorithm, called LZJB. It's what Solaris uses to compress crash dumps, so the code was already there, and it turns out to be a good balance. Eventually we'll probably add the ability to turn on more powerfull but slower compression.
Personally, I'd like to make ZFS the best at everything, thus relegating other filesystems and volume managers to legacy applications. We aren't there yet (eg. support for HSM solutions like SAM), but hopefully someday...
I think that describing storage solutions as "high-end" or "mid-range" can be a bit misleading. Some filesystems are better at performing certain tasks than others. SAM/QFS is better than ZFS at dealing with tape drives. ZFS is better at other tasks. I don't think that SAM/QFS supports snapshots (at least not as efficient and flexable snapshots as ZFS does). Depending on your criterion, one or ther other may be a better solution.
Posted by Matthew Ahrens on August 30, 2004 at 04:57 PM PDT #
Hi,
ZFS sounds nice, indeed.
What I would like to know is: if you use ZFS than you don't need software like SVM or Veritas for volume management at all ? Is there anything you can think of that could be realized with one of these and couldn't with ZFS ? Is it even possible to use a volume manager if you have ZFS ?
If answering these would mean getting into too much detail for now we'll all wait for the ExpertExchange session.
Posted by Vlad Grama on August 30, 2004 at 07:25 PM PDT #
Posted by Matthew Ahrens on August 30, 2004 at 07:58 PM PDT #
Posted by Jaime Cardoso on August 31, 2004 at 02:47 AM PDT #
Posted by Nick Towers on August 31, 2004 at 04:46 AM PDT #
I have a couple of questions though:
Does ZFS have something similiar to the VxVM concept of diskgroups that can be deported/imported to other systems? ie, how is access control to the diskpools going to work in a SAN environment?
Speaking of clustered systems, will ZFS support multiple writers like Veritas' Cluster File System?
On the examples on page 10 of the white paper, it shows creating multiple file systems in the same disk pool. Will there by anyway to apply the maximum size the file system can grow to so it doesn't consume the whole pool? (or will that be the job of Quotas?)
Will there be utilities to show how much space is used in each chunk of phyiscal space that makes up a pool so that you can easily see if a chunk of space can be removed or not?
In another example on page 10 it shows corrupting one side of a mirror but the data stays intact. When and how does ZFS detect that corruption? It would seem to have to do it on more than just the read, otherwise a failed disk might not be detected until the data is read, giving plenty of time for the redudant copies to fail as well.
Thanks
Posted by Charles on August 31, 2004 at 08:40 AM PDT #
Regarding performance and simplicity and "many administrative problems are caused by poor performance" - I think you could (sort of) reverse the statement as well: simple administration (but sufficiently cunning behind the scenes) can make performance less of an issue. From what I've seen so far, it seems with ZFS that it'd be quite hard to configure the system poorly. So typical real world performance will likely be much closer to optimal than before.
Regarding "mid-range" and "high-end" storage again. Somewhat arbitary I know, but for the sake of an example, do you think ZFS would be well suited to customers who buy things like StorEdge 9980 systems?
Do you think the availability of ZFS will change how Sun's storage group design hardware in the future? Putting it another way, would ZFS make some storage hardware features redundant?
One thing I'm pretty curious about with regards to the implementation of ZFS is how the "copy on write" function works in the long term. Specifically, what happens when you get to the "end" of the physical storage and need to wrap around? Is it like Java's garbage collection model in HotSpot where you have multiple groups, based on age (and other parameters) - eg eden space, longer-lasting and "permanent" space...? This certainly works for main memory management, but does the same apply to ZFS? Is there a background thread that cleans things up during low activity?
Posted by Chris Rijk on August 31, 2004 at 09:26 AM PDT #
Yes, ZFS makes it harder to shoot yourself in the foot performance-wise :-)
ZFS will work well on all sorts of storage hardware. When you were asking about "high-end" systems, I was thinking about different storage requirements (eg. lots of small files vs. databases vs. streaming large files) rather than different storage hardware (SATA vs FC, JBOD vs hardware RAID, etc), which can be ranked without much controversy (complexity, cost, and performance tend to be correlated).
That said, one of our goals with ZFS is to reduce the manageability and performance gap between simple, inexpensive storage hardware (eg. JBOD's with SATA disks) and complicated, expensive hardware (eg. huge hardware RAID boxes).
We believe that it's easier to administrate a storage system with ZFS and a hundred plain disks than a system with a hardware RAID box. No matter the filesystem you use with the hardware RAID box, the hardware contains an additional layer of software that you have to set up and take care of, and it isn't well integrated with the rest of the storage stack (eg. try adding more disks).
Secondly, while ZFS will perform well on hardware RAID boxes, it will also perform extremely well on plain disks, which cost 5-10x less. If we can get the performance of plain disks to within 90% of hardware RAID for most application, I think that a lot of our customers will value ease of management and cost more than getting that last 10% of performance. As we continue to improve ZFS' performance, we'll see how much we can close that gap.
To answer your next question straightforwardly: Yes, ZFS will make some hardware features redundant in some situations (the more the better!). I can't really comment on what Sun's storage hardware will look like in the future.
As far as how we reuse space, it isn't garbage collected. Like a traditional filesystem, we know exactly when a block is no longer used, and mark it as available at that point in time. However, given the increased churn rate of a COW system (compared to a traditional overwriting fs), it may be worth investigating some of the garbage collecting techniques you mentioned.
That said, we've found that write performance with our current algorithms is in general quite good. Writes are mostly sequential (or at least, localized to one section of each disk), and generaly nobody is waiting for them to complete, so latency is not a big concern. A bigger concern for us is read performance -- there's always someone waiting for a read, and we have to do the I/O to a specific spot on a specific disk (at best we'll have a choice of two disks in a mirrored configuration). So we've spent a good deal of time on I/O scheduling to try and improve reads. For example, we have a deadline I/O scheduling algorithm which allows us to give priority to reads over writes (and more generally, to I/Os that are being waited for synchronously over asynchronous I/O). We still have a lot of research to do on how to lay out data on disk to make it faster to read in later. This is quite challenging because you're trying to predict future behavior and plan for it...
ps. I've enjoyed your articles on Ace's.
Posted by Matthew Ahrens on August 31, 2004 at 01:53 PM PDT #
As of yet we can not use ZFS on root disks but this is a problem that we're actively working on and it will be possible at some point. As you may be aware, booting is quite a challenge (especially on x86 hardware).
ZFS certainly provides high availability ("HA"), but I'm not sure exactly what problem you're trying to solve. ZFS does provide mirroring. If you have your storage attached to multiple machines, you will be albe (at some point) to switch control of the pool over when the current owner dies.
Posted by Matthew Ahrens on August 31, 2004 at 02:03 PM PDT #
Snapshots are of entire filesystems. However, remember that in ZFS, creating a filesystem is nearly as easy as creating a directory. So it's easy to create a different filesystem for each subtree which you want to snapshot differently. You'll be able to 'roll back', resetting the contents of a filesystem to what it was when a snapshot was taken. We're still figuring out how to best make the snapshots accessible to the user. Certainly you'll be able to access them each as a separate read-only filesystem. We're looking into adding an option to turn on '.snapshots' directories, like what NetApp's WAFL has. We're also investigating zero-copy restore of an individual file.
Look for more posts about snapshots in the future :-)
Posted by Matthew Ahrens on August 31, 2004 at 02:12 PM PDT #
I'm not an expert in VxVM or Veritas' Cluster Filesystem, so please excuse me if I misunderstand your question. It will be possible to export a pool from one machine and then import it to another. This can be done without touching any cables if the storage is connected to both machines simultaneously (eg. on a SAN).
ZFS does not support multiple writers. ZFS does support NFS.
You can use quotas to put an upper limit on the space used by a filesystem. You can also use reservations to gaurantee that you will be able to store at least a certain amount of data in a filesystem before running out of space (in spite of any other filesystem's space usage). Expect more posts on this in the future!
ZFS will provide a way to show how much space is used in each disk. If you want to reduce the amount of space in a pool by removing a disk, you could use this to choose the least-full disk, thus minimizing the time it will take to migrate that data to other disks.
ZFS detects and corrects corruption when data is read. It will also 'scrub' the disks by periodically reading all the data on them. This serves to actively seeking out and repair corruption, minimizing the amount of time that any data not stored redundantly. Of course repair is usually only possible if the data is stored redundantly (eg. mirror or RAID5).
Posted by Matthew Ahrens on August 31, 2004 at 02:25 PM PDT #
Maybe the biggest issue for ZFS adoption will simply be that it's "different" - it's not simply "more of the same" compared to what came before. In other words, it requires users to think in a different way. Hope these questions here will help in writing documentation or FAQs!
PS Glad you liked my articles at Ace's Hardware. I have another one comming soon (probably next month).
Posted by Chris Rijk on August 31, 2004 at 03:59 PM PDT #
Posted by c0t0d0s0.org on August 31, 2004 at 10:14 PM PDT #
Thanks.
Posted by Vlad Grama on September 01, 2004 at 02:20 PM PDT #
Posted by Ryan Matteson on September 01, 2004 at 06:59 PM PDT #
ZFS is still under development, so while we perform lots of benchmarks, we don't have any to publish at this time. That said, it's possible to make some generalizations about the natural tradeoffs between statically-laid-out filesystems (eg. VxFS, UFS) compare to copy-on-write filesystems (eg. ZFS, WAFL). For example, doing writes to random offsets in a pre-allocated file will perform very poorly on a SLO fs, since it has to seek the disk head to a different location for every I/O; on a COW fs all the writes will be as contiguous as possible and thus very fast. On the other hand, if you try to read that randomly-written file sequentially, the SLO fs will be doing sequential reads whereas the COW filesystem will have to seek the disk head for every I/O. We believe that *most* applications don't behave this way (random writes, sequential reads), although there definitely are some important ones that do. We're investigating various ways to improve performance for this workload.
I'm working under the assumption that ZFS will be part of Solaris and licensed as Solaris is. However, licensing is not up to me and it could change. Personally, I'd like ZFS to be as easy to acquire as possible.
As far as mirroring, you will always have the option to explicitly control which pairs of disks are mirrored. We're also investigating how to make this task more automated. Ideally, the system would know about how all the storage is connected, and be able to reduce single points of failure by automatically mirror disks with the least path in common (eg. different JBODs, different controllers, different I/O boards, etc).
Posted by Matthew Ahrens on September 01, 2004 at 07:53 PM PDT #
Posted by Iwan Ang on September 01, 2004 at 11:00 PM PDT #
More documentation is certainly forthcoming. I plan to post some usage scenarios in the coming weeks.
We're absolutely trying to make disk storage more like memory, and often use that analogy in our presentations. For example, when you add DIMMS to your computer, you don't run some 'dimmconfig' program or worry about how the new memory will be allocated to various applications; the computer just does the right thing. Applications don't have to worry about where their memory comes from.
Likewise with ZFS, when you add new disks to the system, their space is available to any ZFS filesystems, without the need for any further configuration. In most scenarios it's fairly straightforward for the software to make the unequivocably best choices about how to use the storage. If you want to tell the system more about how you want the storage used, you'll be able to do that too (eg. this data should be mirrored but that not; it's more important for this data to be accessed quickly but that can be slower). We hope that with relatively modern hardware, all but the most complicated and demanding configurations will be handled adequately without any administrator intervention.
I believe that the "19 nines" gaurantee you're asking about is a reference to our end-to-end data integrity. We use 64-bit checksums on all of our data, so we can detect with extremely high probability any difference between what we wrote and what we later read.
Unlike some other checksum schemes, ours is truly end-to-end, in that we verify that what we read back to the computer's main memory is the same as what was in memory when we wrote it to disk. This protects against errors that may occur at many different levels: bit rot on the disk, disk firmware bugs, signal degredation in the cables, bugs in the I/O controller card, or problems in the PCI bus.
I'm not aware of any insurance policy against data loss. If there was one, it would be pretty easy to scam, which may discourage potential insurers :-)
Posted by Matthew Ahrens on September 02, 2004 at 12:00 AM PDT #
Of course, if you have lots of main memory, relative to the size of the files (or large files but small "hotspots" of access), it should cache pretty well, helping to reduce the problem. Right?
Some kind of automatic de-fragmenter...?PS A number of years ago, I wrote a little library in Perl to do data persistence on a hash (associative array). This was due to limitations (portability, size, limitation on extras we could install) for a particular project. I used a pretty simple algorithm where all data inserts and updates were simply appended to the end of the file. You could call a little routine to re-create the data cleanly, which would effectively de-frag it and eliminate old data.
It was seriously fast. I was rather amused to see it beat compiled C programs by about 5-10x on creating a database from scratch. (Perl is interpreted). Basically, the I/O was far more efficient. Since the files were pretty small, it also cached well on reads.
All this talk reminded me of it...
Posted by Chris Rijk on September 02, 2004 at 07:40 AM PDT #
Matthew, you unleashed a monster.
Today I had 5 request for more information about ZFS. Customers read your weblog and word got arround.
I have a specific case where the customer was evaluating IBM vs Sun (with SAM FS) and, now, ZFS is on the list.
I'm having a bigger reaction to ZFS than to Containers AND dtrace together, people are going crazy (and, of course, that it's up to the reseller to get information about something that doesn't even appear in Solaris eXpress).
Thought you'd like to know.
PS. 5 requests a day may not appear much but, I live in Portugal, a country with less than 10 million people, you do the math for your scale.
Posted by Jaime Cardoso on September 03, 2004 at 05:54 AM PDT #
Posted by Matthew Ahrens on September 03, 2004 at 01:20 PM PDT #
I've been studying what you wrote in lesson 1 and, I have a couple questions more. Imagine I have a 10MB file in my ZFS disk. I open the file with my application and, I change a single byte.
- For what I understod about copy-on-write, only 1 block of information is actually written to the disk (instead of the 10MB). Is that correct? It looks a little like SciFi.
- Are you planning in adding something like undo in the Filesystem? If what I understood about copy-on-write is correct, you could made available to the storage admin older versions of a changed file. Is that correct?
Posted by Jaime Cardoso on September 06, 2004 at 02:50 PM PDT #
As with most block-based filesystems, modifying a single byte only requires you to write out the block containing it, not the entire file. However, in a copy-on-write filesystems like ZFS where that block is written to a new location, the block that points to it must be "modified" (ie. written to a new location) as well, and so on up the tree of blocks.
If implemented naively, this would require writing out a lot more data. In ZFS, we batch a whole bunch of transactions (a few seconds worth) into a "transaction group", and write them all out together. This technique allows us to write each modified indirect block once per transaction *group*, instead of per transaction. Most of the modified indirect blocks will be modified in many different transactions, but we only need to write that block to disk once every transaction group (ie. at most once every few seconds), so most of the changes to it telescope away.
You can recover the older versions of a changed file using snapshots.
Posted by Matthew Ahrens on September 08, 2004 at 12:08 PM PDT #
Posted by Val Henson on September 09, 2004 at 09:18 PM PDT #
Posted by Kevin Martin on September 28, 2004 at 09:48 AM PDT #
Posted by Matthew Ahrens on September 28, 2004 at 10:18 AM PDT #
Posted by doug on October 06, 2004 at 07:33 PM PDT #
Currently, we support HA storage+, which means that ZFS can use shared disks, as long as only a single machine is using the pool at a time.
Does this mean multiple machines can't read or write from their own filesystems (which are part of the same pool) simultaneously? That sounds like a huge performance hit.
Another question--we want to use the x86 Solaris 10 port. Are there any HBA's supported so that we can use a SAN to create a pool? I couldn't find any in the compatibility list.
Posted by JAT on November 04, 2004 at 04:10 PM PST #
Posted by JAT on November 04, 2004 at 04:39 PM PST #
I attended the Solaris 10 BOF at LISA and heard about some of the features of 5.10 and ZFS. I was wondering specifically about "ZFS also ... supports the full range of NFSv4/NT-style ACLs."
What does this mean regarding the NT-style ACLs, esecially from the standpoint of accessing them? For example, I don't see or expect that ZFS will provide CIFS, so how would they be interpreted or managed? Or is this a reference to the similar capability of the extended attributes in UFS, with setfacl and getfacl?
Thanks for the info!
Posted by David Pullman on November 23, 2004 at 10:30 AM PST #