2009.Q3 Update for Size Calculator
With today's release of the 2009.Q3 update for 7000 Series Unified Storage Appliances there are a few enhancements to storage profiles that we can take advantage of and the trusty size calculator tool was in need of modification to use them, so I've spent some time putting them together. The resulting tool can be downloaded here, details about what has changed follow below. The most obvious change from a size calculator perspective with the 2009.Q3 release is that the appliance has some new storage profiles (namely Triple Parity RAID with wide stripes, and Three-Way Mirroring). When run against an updated appliance or simulator, the tool will show these new profiles: In addition to these supporting these new storage profiles, I wanted to enhance the calculator to better handle the existing variety of drives we support in the 7210 (the 2009.Q2 release could only model 1TB disks), and prepare to model configurations with drives larger than 1TB in the 7310 and 7410. The revised help message explains how to declare different disk sizes in more detail (run sizecalc.py without any arguments to read it), but the examples below should help to highlight the new capability and how it ties in with the previous enhancements. Here is an example modelling a 7210 configuration using 500GB disks; you can see that the size keyword and argument are prepended to the JBOD layout: When combined with Eric Schrock's modeling feature we can specify a different size drive each time we 'add' a JBOD layout (we're not actually selling 2TB drives yet, but the tool will allow us to model that configuration anyway): In this example, have supplied the configuration "1 add size 2T 1" which tells the calculator to initially model a single JBOD with the default drive size, which is 1TB, and then add a new JBOD with 2T drives. You can see that the total number of disks is only 48, however the results are based on 72TB of raw capacity. As always, you can find the latest 7000 information in the Fishworks wiki. Happy calculating! EOF $ ./sizecalc.py 172.16.131.131 *** 12
Sun Storage 7000 Size Calculator Version 2009.Q3
type NSPF width spares data drives raw (TB) usable (TiB)
mirror False 2 4 284 142.00 127.13
mirror True 2 4 284 142.00 127.13
mirror3 False 3 6 282 94.00 84.16
mirror3 True 3 6 282 94.00 84.16
raidz1 False 4 4 284 213.00 190.70
raidz1 True 4 4 284 213.00 190.70
raidz2 False 14 8 280 240.00 214.87
raidz2 True 14 8 280 240.00 214.87
raidz2 wide False 47 6 282 270.00 241.73
raidz2 wide True 20 8 280 252.00 225.61
raidz3 wide False 56 8 280 265.00 237.25
raidz3 wide True 35 8 280 256.00 229.19
stripe False 0 0 288 288.00 257.84
** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.
$ ./sizecalc.py 172.16.131.131 *** size 500G 1 t1
Sun Storage 7000 Size Calculator Version 2009.Q3
type NSPF width spares data drives raw (TB) usable (TiB)
mirror False 2 3 42 10.50 9.40
mirror3 False 3 3 42 7.00 6.27
raidz1 False 4 5 40 15.00 13.43
raidz2 False 14 3 42 18.00 16.12
raidz2 wide False 43 2 43 20.50 18.35
raidz3 wide False 43 2 43 20.00 17.91
stripe False 0 0 45 22.50 20.14
** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.
./sizecalc.py 172.16.131.131 *** 1 add size 2T 1
Sun Storage 7000 Size Calculator Version 2009.Q3
type NSPF width spares data drives raw (TB) usable (TiB)
mirror False 2 4 44 33.00 29.54
mirror3 False 3 6 42 21.00 18.80
raidz1 False 4 8 40 45.00 40.29
raidz2 False 11 4 44 54.00 48.35
raidz2 wide False 23 2 46 63.00 56.40
raidz3 wide False 23 2 46 60.00 53.72
stripe False 0 0 48 72.00 64.46
** As of 2009.Q3, the raidz2 wide profile has been deprecated.
** New configurations should use the raidz3 wide profile.
Posted at 03:32PM Sep 16, 2009 by Ryan Matthews in 7000Series | Comments[0] | permalink
2009.Q2 Update for Size Calculator
After introducing new storage expansion options on the 7210, and shipping large number of 7000 series Unified Storage Systems, we discovered that the size calculator tool was in need of a couple of improvements.
First, we needed to be able to model 7210 configurations with expansion modules. With the previous software, no expansion was possible, so I just maintained a simple table of the shipping configurations leaving the size calculator to handle the more intricate 7410 configurations, however adapting this to the new matrix of potential configurations is unwieldy at best.
Second, we wanted to be able to model the growth of systems over time. Customers who bought systems have been coming back and asking us questions like "If I add two more J4400's to my system, how much usable capacity will I have?". Previously, we would have run the calculator with the original configuration, recorded the output and then run the calculator again with the new configuration and added both capacities together. This was becoming annoying if you were mucking around with configurations to see the impact over time.
So, out of necessity the 2009.Q2 release of the size calculator was born. Eric Schrock decided to spend a little time to build a modeling feature into the tool to help answer complicated questions, and I built on his modifications to add 7210 support with configuration modeling.
The revised help message is a little more detailed than this (run sizecalc.py without any arguments to read it), but the examples should highlight the new features clearly. The coolest new feature is Eric's configuration modeling - it allows you to submit a number of configurations together and have the calculator add them together for you. Here is an example of the modeling feature in action:
$ ./sizecalc.py 172.16.131.131 *** 1 h1 add 1 h add 1 Sun Storage 7000 Size Calculator Version 2009.Q2 type NSPF width spares data drives raw (TB) usable (TiB) mirror False 2 5 42 21.00 18.80 raidz1 False 4 11 36 27.00 24.17 raidz2 False 10-11 4 43 35.00 31.33 raidz2 wide False 10-23 3 44 38.00 34.02 stripe False 0 0 47 47.00 42.08
We have supplied the configuration "1 h1 add 1 h add 1" which tells the size calculator that the initial configuration consists of 1 JBOD which is half full (the h in h1), and has one Log device (the 1 in h1), and that we will then add a second half JBOD (add 1 h), and later, we will add another full JBOD (add 1). The calculator then computes all three, and adds them together to produce the output shown. What's interesting is that you can see that the dual parity wide configurations have different widths, but the calculator just summarizes those as '10-23' and adds their contributed capacity to the pool.
When Eric's modeling feature is combined with my 7210 enhancements, we can also model the 7210 like this:
$ ./sizecalc.py 172.16.131.131 *** 1 t1 add 1 t Sun Storage 7000 Size Calculator Version 2009.Q2 type NSPF width spares data drives raw (TB) usable (TiB) mirror False 2 4 89 45.00 40.29 raidz1 False 4 6 87 66.00 59.09 raidz2 False 11 6 87 72.00 64.46 raidz2 wide False 44-46 4 89 86.00 76.99 stripe False 0 0 93 94.00 84.16
In this example, have supplied the configuration "1 t1 add 1 t" which tells the calculator to start with a single 7210 controller containing a single Log device (1 t1) and to add a J4500 expansion module (add 1 t).
As always, you can find the latest 7000 information in the Fishworks wiki. Happy calculating!
EOF
Posted at 05:13PM Jun 15, 2009 by Ryan Matthews in 7000Series | Comments[0] | permalink
Q&A on Hybrid Storage and SSDs
In a previous post I scratched the surface of how ZFS uses the ZFS Intent Log (ZIL), and how the 7000 Series uses Solid State Disk (SSD) to accelerate its performance. After having presented the Hybrid Storage Pool to more than a hundred customers, I can say that questions around how the 7000 leverages SSDs, and how it handles SSD failure are among the most frequently asked. I hope that I can expand on my previous entry here and explain things in clear detail. I apologize in advance that my artwork is not nearly what it could be, but I wanted to share the information I have.
Background
Before we can cover the detail of how the file system leverages SSD and handles SSD failure, we need to understand the basic components of ZFS, and how the data flows between them. The ZFS file system is made up of a number of modules and layers. The interfaces that we use to store data run as modules at the top of the stack. In the 7000, we know these as Filesystems and LUNS in the Shares section of the BUI.
Both user level interfaces connect using transactions to a layer called the Data Management Unit (DMU). The DMU manages the storage and retrieval of data independent of its structure (the structure is implemented above by the modules that give us Filesystems and LUNs); it is the coordinator, orchestrating the movement of data between the various components below.
One of the key components it manages is called the Adaptive Replacement Cache (ARC). ARC is used as cache for both read and write operations as well as key file system data and metadata. With the excpetion of the cached copy of the ZIL (more on that later), and the actual write data cache, anything that can live in the ARC can also live in the Level 2 ARC (L2ARC) which is a 'disk' based extension of primary cache designed to operate as a second tier in the system storage model. I will cover L2ARC as it relates to 7000 more later, but if you're itching for details, check out Brendan's blog entry on it here.
Another component managed by the DMU is the ZIL. As I discussed in my previous entry, the ZIL is the journal that allows the file system to recover from system failures. The ZIL must always exist on non-volatile storage in order to ensure it will be there to recover from. By default, the ZIL is stored inside the storage pool, however it can also be stored on a dedicated disk device called a log device. Regardless of how the system is configured to store the ZIL, it is always cached in system memory while running in order to improve performance.
Below all of the caching tiers is the disk pool itself. It is built from groups of disk devices. In the Hybrid Storage Pool, this is where the data protection happens.
Translating This to the 7000 Series
In the Sun Storage 7000 Series, we use
SSD to accelerate some of the components of the storage
infrastructure. First, we use Write-Optimized SSDs to store the
non-volatile copy of the ZIL. For our use case the devices we are
shipping with the system today are capable of about 10,000 operations
per second and use a supercapacitor to ensure that the device can
stay powered long enough to write all data to the flash chips.
Second, we use Read-Optimized SSDs to store the L2ARC. The devices
we are shipping with the system today vary in read performance
depending on the size of the operation being used, but are somewhere
between 16 and 64 times faster than a standard disk device for read
operations.
Q: How does data get into the Write Optimized SSD?
A: First, either a filesystem or LUN receives the new data to be written. That module then creates a transaction to add the new data to the currently open transaction group (TXG) in the DMU. As part of the transaction, the data is sent to the ARC while the Write Optimized SSD containing the ZIL is updated to reflect the changes. As new transactions continue to happen, they are logged sequentially to the ZIL.
Q: I've heard that SSDs can "wear out" if you write to them too many times. How do you prevent that from happening to the Write Optimized SSD?
The system treats the SSD as a circular buffer starting to write at the beginning of the disk continuing in order until it reaches the end, and then resuming again at the beginning of the disk. This sequential pattern helps to minimize the risk of 'Wearing Out' the SSD over time. Some people I have explained this to express concern that the system could overwrite data required for recovery in this model, however the system is very aware of which part of the disk contains active data and which parts contain inactive data
Q: So how does the data get from the Write Optimized SSD to disk?
Surprisingly, the answer is that it doesn't -- at least not in the way the most people think. The trick here is that the ZIL is actually cached in the ARC for performance reasons. So, every few seconds when the system begins a commit cycle for the current transaction group it reads the copy of the ZIL in memory. This is the point at which the data will be integrated into the pool. If the data requires compression, it will be compressed and then a checksum for the data is generated. The system decides where the data should live, and then finally the data is synchronized from the ARC to disk.
Q: What happens if a Write Optimized SSD fails?
From my previous post:
"If the ZIL is stored on a single SSD, and that device fails, the system has a window to flush the ZIL from memory to disk (the Transaction Group Commit I mentioned earlier). Typically in the 7000 Series, this flush happens every 1-5 seconds, but it can take up to 30 seconds on an extremely busy system. Once the data is flushed from memory to disk, the system will use the disk pool to store the ZIL for the next transaction group. This window is the only time in a 7000 series where there is a chance for data loss. We mitigate this risk by mirroring the Write Optimized SSD's in the system."
Q: How does data get into the Read Optimized SSD?
As I mentioned earier, the Read Optimized SSD is used in the 7000 Series to hold the L2ARC. Since we would prefer to return the most popular data directly from our first level cache in DRAM, we use L2ARC to hold data that has a history of being useful, but hasn't been accessed as recently or as frequently as other data. As the ARC fills up, the system begins to scan the cache for the data that has been accessed least frequently or recently. After finding enough candidates, it begins to copy those blocks from ARC to L2ARC. While this process is happening, the data is still active in the ARC, so if a client did request it it could be returned. The process that fills the L2ARC operates in batches in order that there are a few larger writes rather than frequent smaller writes which improves performance.
Q: How do you prevent the Read Optimized SSD from "wearing out"?
Similar to the ZIL, the system writes to the ARC in a circular fashion to reduce the risk of wear over time.
Q: When does the system read from the Read Optimized SSD instead of Memory or Disk?
When the system starts to run out of space in the ARC, it will attempt to evict the data that has been accessed least recently or frequently, the same data we copied to the L2ARC earlier. Now that the data has been evicted from the ARC, the lowest latency copy is living in L2ARC. When the next read request comes for that data, the system will find that the data is no longer available in the ARC, and will check the L2ARC to see if it has a copy. If a copy does exist in L2ARC, the checksum will be compared to ensure that there has been no corruption, and then the data will be returned at micro second latencies. If during the checksum comparison the system had found that the data had for some reason become corrupt in the L2ARC, it would release that copy of the data and read the correct data from the disk pool.
Q: What about the Read Optimized SSD, what happens if it fails?
The L2ARC is what we call a clean cache, meaning that all of the data stored in the L2ARC is available somewhere on disk. So if an L2ARC device fails, the system continues to operate returning read requests that would have been cached by that device directly from disk.
EOF
Posted at 02:58PM Apr 08, 2009 by Ryan Matthews in 7000Series | Comments[1] | permalink
Enhancing the Size Calculator
Adam Leventhal produced a really useful tool shortly after we launched the 7000 series which, in combination with our Storage VM would allow the user to see the usable capacity of various hardware configurations. You can read Adam's original blog entry here. The tool is fantastic, and I use it all the time, but the usable capacities were in raw TB, as reported by the drive manufacturer, which is not what the 7410 sees... You may have run across this with your own PC when you bought that shiny new 1TB drive and powered up the machine to find your OS asking you to format 931GB.
When I used the tool, I found myself manually converting
the usable capacity into binary to find the true usable capacity. With a little effort, I collaborated with Adam to enhance the tool to show a new column that gives the true usable capacity in binary and account for a small filesystem metadata reserve that the 7410 will hold back. The resulting tool can be downloaded here. Of course, for the 7110 and 7210, you can continue to use the tables posted on the 7000 Series wiki here.
For anyone who cares to know the difference between a base 10 TB and a base 2 TB (labeled in the updated tool as TiB):
One TB (as described by disk manufacturers) is 1000 to the 4th power, or 1,000,000,000,000 bytes
One TiB is 1024 to the 4th power, or 1,099,511,627,776 bytes
EOF
Posted at 06:28PM Feb 04, 2009 by Ryan Matthews in 7000Series | Comments[0] | permalink
ZIL, SSD, and Other Fun Acronyms
The ZFS Intent Log (or ZIL) is always written to non-volatile storage.
The ZIL allows the file system to recover from crashes without data
loss. In a 7000 Series with Write Optimized SSD, the ZIL is stored on the
Write Optimized SSD, otherwise it is stored in the disk pool. Either
way, it is also available in system memory. The ZIL flushes to the disk pool
every once in awhile (this is called a
Transaction Group Commit).
In a 7410 cluster, if a fail over occurs under normal conditions the pool is imported by
the alternate node, the ZIL is replayed against the pool, and the pool
is online and ready. You can think of the Write Optimized SSD and ZIL as our NVRAM
if that helps, but we don't need batteries.
If the ZIL is stored on a single SSD, and that device fails, the system
has a window to flush the ZIL from memory to disk (the Transaction Group Commit I mentioned earlier). Typically in the 7000 Series, this flush happens every 1-5 seconds, but it can take up to 30 seconds on an extremely busy system. Once the data is flushed from memory to disk, the system will use the disk pool to store the ZIL for the next transaction group.
This window is the only time in a 7000 series where there is a
chance for data loss. We mitigate this risk by mirroring the Write
Optimized SSD's in the system.
ZFS performance on asynchronous writes is good and SSD is not required
in these configurations (although it will help improve performance and
is recommended) however in configurations that require synchronous
writes (many iSCSI configurations, NFS with O_DSYNC etc) Write SSD is
almost mandatory.
Write SSD Sizing Rules of Thumb:
-Each device supports about 9000-10000 Write IOPS (Sequential
writes stream directly to disk for better performance)
-If devices are mirrored, they only count for 1x Write IOPS (ie two
devices at 9000 IOPS each when mirrored together support 9000 IOPS total)
-If aiming to support No Single Point of Failure configurations, more trays with less SSD's
per tray will have higher usable capacities. Clusters will only allow
SSD in pairs.
EOF
Posted at 01:54AM Jan 11, 2009 by Ryan Matthews in 7000Series | Comments[0] | permalink
Blogging about Wikis about Blogs...
Since we launched the 7000 Series and revealed Fishworks to the world, there has been a lot of questions both inside (via our overflowing internal mailing list) and outside Sun. I've personally been involved with presenting the product dozens of times to staff and customers, and usually feel like a broken record as I share the answers to the usual questions, so I decided to create a central repository of information on the Fishworks wiki to help share the answers more freely.
The first component of this was an FAQ. It works, and contains a lot of useful stuff, but I was finding that the answers to many questions were actually in blogs written by the Fishworks team, and various other engineers within and outside sun. The next step was to create a central repository of links to these useful blog entries where people could go to find answers to their questions. I finished creating it the other day and have been rearranging and adding entries. It's not perfect, and it may never be, but now is the time to reveal FishBlog Central, and I thought what better way to reveal a wiki about blogs than to blog about it.
If you have some information or a link to add to either the FAQ or FishBlog Central, please share it so that it can be shared with the world.
EOF
Posted at 12:05PM Jan 09, 2009 by Ryan Matthews in Fishworks | Comments[0] | permalink