So Sun is coming out today with a line of Sun Storage 7000
systems that have ZFS as the integrated volume and filesystem manager
using both read and write optimized SSD. What is this
Hybrid Storage Pool
and why is this a
good performance architecture for storage ?
A write optimized SSD is a custom designed device for
the purpose of
accelerating operations of the ZFS intent log (ZIL). The ZIL is the
part of ZFS that manages the important synchronous operation
guaranteeing that such writes are acknowledged quickly to applications
while guaranteeing persistence in case of outage. Data stored in the
ZIL is also kept in memory until ZFS issue the next Transaction Groups
(every few seconds).
The ZIL is what stores data urgently (when application is waiting) but
the TXG is what stores data permanently. The ZIL on-disk blocks are
only ever re-read after a failure such as power outage. So the SSDs
that are used to accelerate the ZIL are write-optimized : they need to handle data at
low latency on writes; reads are unimportant.
The TXG is an operation that is asynchronous to applications : apps
are generally not waiting for transactions groups to commit. The
exception here is when data is generated at a rate that exceeds the
TXG rate for a sustained period of time. In this case, we become
throttled by the pool throughput. In a NAS storage this will rarely
happen since network connectivity even at GB/s is still much less that
what storage is capable of and so we do not generate the imbalance.
The important thing now is that in a NAS server, the controller is also
running a file level protocol (NFS or CIFS) and so is knowledgeable
about the nature (synchronous or not) of the requested writes. As such
it can use the accelerated path (the SSD) only for the necessary
component of the workloads. Less competition for these devices means
we can deliver both high throughput and low latency together in the
same consolidated server.
But here is where is gets nifty. At times, a NAS server might
receive a huge synchronous request. We've observed this for instance
due to fsflush running on clients which will turn non-synchronous
writes into a massive synchronous one. I note here that a way to
reduce this effect, is to tune up fsflush (to say 600). This is
commonly done to reduce the cpu usage of fsflush but will be welcome
in the case of client interacting with NAS storage. We can also
disable page flushing entirely by setting dopageflush to 0. But that
is a client issue. From the perspective of the server, we still need
as a NAS to manage large commit request.
When subject to such a workload, say 1GB commit, ZFS being all aware
of the situation, can now decide to bypass the SDD device and issue
request straight to disk based pool blocks. It would do so for 2
reasons. One is that the pool of disks in it's entirety has more
throughput capabilities than the few write optimized SSD and so we will
service this request faster. But more importantly, the value of the
SSD is in it's latency reduction aspect. Leaving the SSDs available to
service many low latency synchronous writes is considered valuable
here. Another way to say this is that large writes are generally well
served by regular disk operations (they are throughput bound) whereas
small synchronous writes (latency bound) can and will get help from
the SSDs.
Caches at work
On the read path we also have custom designed read optimized
SSDs to fit in these OpenStorage platforms. At Sun, we just believe
that many workloads will naturally lend to caching technologies. In a
consolidated storage solution, we can offer up to 128GB of primary
memory based caching and approximately 500GB of SSD based caching.
We also recognized that the latency delta between memory cached
response and disk response was just too steep. By inserting a layer
of SSD between memory and disk, we have this intermediate step
providing lower latency access than disk to a working set which is
now many times greater than memory.
It's important here to understand how and when these read
optimized SSD will work. The first thing to recognized is that the SSD
will have to be primed with data. They feed off data being evicted
from the primary caches. So their effect will not immediately seen at
the start of a benchmarks. Second, one of the value of read optimized
SSD is truly in low latency responses to small requests. Small request
here means things of the order of 8K in size. Such request will occur
either when dealing with small files (~8K) or if dealing with larger
size but with fix record based application, typically a database. For
those application it is customary to set the recordsize and this will
allow those new SSDs to become more effective.
Our read optimized SSD can service up to 3000 read IOPS (see Brendan's work
on the
L2 ARC)
and this is close or better to what a 24 x 7.2 RPM disks JBOD can do. But the
key point is that the low latency response means it can do so using
much fewer threads that would be necessary to reach the same level on
a JBOD. Brendan demonstrated here that the response time of these
devices can be 20 times faster than disks and 8 to 10 times faster
from the client's perspective. So once data is installed in the
SSD, users will see their requests serviced much faster
which means we are less likely to be subject to queuing delays.
The use of read optimized SSD is configurable in the Appliance. Users
should learn to identify the part of their datasets that end up gated
by lightly threaded read response time. For those workloads enabling
the secondary cache is one way to deliver the value of the read optimized
SSD. For those filesystems, if the workload contains small files (such
as 8K) there is no need to tune anything, however for large files
access in small chunks setting the filesystem recordsize to 8K is
likely to produce the best response time.
Another benefit to these SSDs will be in the $/IOPS case. Some
workloads are just IOPS hungry while not necessarely huge block
consumers. The SSD technology offers great advantages in this space
where a single SDD can deliver the IOPS of a full JBOD at a fraction
of the cost. So with workloads that are more modestly sized but IOPS
hungry a test drive of the SSD will be very interesting.
It's also important to recognized that these systems are used in
consolidation scenarios. It can be that some part of the applications
will be sped up by read or write optimized SSD, or by the large memory
based caches while other consolidated workloads can exercise other
components.
There is another interesting implication to using SSD in the
storage in regards to clustering. The read optimized ssd acting as
caching layers actually never contain critical data. This means those
SSD can go into disk slots of head nodes since there is no data to be
failed over. On the other hand, write optimized SSD will store data
associated with the critical synchronous writes. But since those are located in
dual-ported backend enclosures, not the head nodes, it implies that,
during clustered operations, storage head nodes do not have to
exchange
any user level data.
So by using ZFS and read and write optimized SSDs, we can deliver low
latency writes for application that rely on them, and good throughput
for synchronous and non synchronous case using cost effective SATA
drives. Similarly on the read size, the high amount of primary and
secondary caches enables delivering high IOPS at low latency (even if
the workload is not highly threaded) and it can do so using the more
cost and energy efficient SATA drive.
Our architecture allows us to take advantage of the latency
accelerators while never being gated by them.
Having L2ARCs (Readzillas) on a shared SSD has an advantage - when you do a failover the other node could re-use L2ARC with all the data which is alradt there possibly providing better performance after failover while main memory is being filled up. IIRC there is a RFE to implement it. Thanks to checksumming it is safe to do so.
Imaging a MySQL database which is smaller than lets say 60GB of SSD - if basically entire database has been cached on L2ARC after failover the new node won't have to basially even touch physical drives...
Posted by Robert Milkowski on novembre 12, 2008 at 05:20 PM MET #
Hi,
Thanks for this interesting description of performance improvements !
But what is the use if the data is lost because the server is dead for any reason ?
What i have to really guarantee is that the data is synchronously stored in another building on the network.
But i can't find an efficient way to do a ZFS mirror on both local disks and 1 or 2 remote iSCSI targets. I may have 1G or 10G links, so no real bandwidth problem, but the network is more prone to small cuts.
So i would prefer that the server use the remote iSCSI targets for writes and "management/checking" reads, but not for "production" reads.
Can i do it today with current ZFS ?
If not, when will it be available ?
Best regards !
Posted by Bernard Dugas on novembre 25, 2008 at 07:42 PM MET #