| « juillet 2008 |
| lun. | mar. | mer. | jeu. | ven. | sam. | dim. |
|---|
| | 1 | 2 | 3 | 4 | 5 | 6 |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 |
21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 | 31 | | | |
| | | | | | | |
| Today |

mercredi mai 14, 2008
The new ZFS write throttle
A very significant improvement is coming soon to ZFS. A change that
will increase the general quality of service delivered by ZFS.
Interestingly it's a change that might also slow down your
microbenchmark but nevertheless it's a change you should be eager for.
Write throttling
For a filesystem, write throttling designates the act of blocking
application for some amount of time, as short as possible, waiting for
the proper conditions to allow the write system calls to succeed.
Write throttling is normally required because applications can write
to memory (dirty memory pages) at a rate significantly faster than the
kernel can flush the data to disk. Many workloads dirty memory pages
by writing to the filesystem page cache at near memory copy speed,
possibly using multiple threads issuing high-rates of filesystem
writes. Concurrently, the filesystem is doing it's best to drain all
that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk
can be longer than the time required for applications to dirty the
cache. Even if one considers storage with fast NVRAM, under sustained
load, that NVRAM will fill up to a point where it needs to wait for a
slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite
desirable to push the data at memory speed and then drain the cache to
disk during the lapses between bursts. But when data is generated at
a sustained high rate, lack of throttling leads to total memory
depletion. We thus need at some point to try and match the application
data rate with that of the I/O subsystem. This is the primary goal of
write throttling.
A secondary goal of write throttling is to prevent massive data loss.
When applications do not manage I/O synchronization (i.e don't use
O_DSYNC and fsync), data ends up cached in the filesystem and the
contract is that there is no guarantee that the data will still be
there if a system crash were to occur. So even if the filesystem
cannot be blamed for such data loss, it is still a nice feature to
help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data
exposed for no more than 30 seconds (default value of autoup). Also,
UFS would keep track of the amount of I/O outstanding for each
file. Once too much I/O was pending, UFS would throttle writers for
that file. This was controlled through ufs_HW, ufs_LW and their
values were commonly tuned (a bad sign). Eventually old defaults
values were updated and seem to work nicely today. UFS write
throttling thus operates on a per file basis. While there are some
merits to this approach, it can be defeated as it does not manage the
imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg).
Normally, every 5 seconds an _open_ txg goes to the quiesced
state. From that state the quiesced txg will go to the syncing state
which sends dirty data to the I/O subsystem. For each pool, there are
at most 1 txg in each of the 3 states, open, quiescing, syncing. Write
throttling used to occur when the 5 second txg clock would fire while
the syncing txg had not yet completed. The open group would wait on
the quiesced one which waits on the syncing one. Application writers
(write system call) would block, possibly a few seconds, waiting for a
txg to open. In other words, if a txg took more than 5 seconds to
sync to disk, we would globally block writers thus matching their
speed with that of the I/O. But if a workload had a bursty write
behavior that could be synced during the allotted 5 seconds,
application would never be throttled.
The Issue
But ZFS did not sufficiently controled the amount of data that could
get in an open txg. As long as the ARC cache was no more than half
dirty, ZFS would accept data. For a large memory machine or one with
weak storage, this was likely to cause long txg sync times. The
downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
negatively impacted.
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the
the previous throttling code. Once a proper solution is in place, it will
be interesting to see if we behave better on that front.
The Solutions
The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new
code will detect when we are dangerously close to that situation
(7/8th of the limit) and will insert 1 tick delays for applications
issuing writes. This prevents a write intensive thread from hogging
the available space starving out other threads. This delay should
also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that,
starting with an empty TXG, all threads will be allowed to dirty
memory at full speed until a first threshold of bytes in the TXG is
reached. At that time, every write system call will be delayed by 1
tick thus significantly slowing down the pace of writes. If the
previous TXG completes it's I/Os, then the current TXG will then be
allowed to resume at full speed. But in the unlikely event that a
workload, despite the per write 1-tick delay, manages to fill up the
TXG up to the full threshold we will be forced to throttle all writes
in order to allow the storage to catch up.
It should make the system much better behaved and generally more
performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more
throttling, do consider the other benefits that you get from the new
code. If that does not compensate for the loss, get in touch and tell
us what your needs are on that front.

lundi janvier 08, 2007
NFS and ZFS, a fine combination
No doubt there is still a lot to learn about ZFS as an NFS server and
this will not delve deeply into that topic. What I'd like to dispel
here is the notion that ZFS can cause some NFS workloads to exhibit
pathological performance characteristics.
The Sightings
Since there have been a few perceived 'sightings' of such slowdowns, a
little clarification is in order. Large reported slowdowns would
typically be reported when looking at a single threaded load, probably
doing small file creation such as 'tar xf many_small_files.tar'.
For instance, I've run a small such test over a 72G SAS drive.
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
nfs/zfs : 12 sec
There are a few things to observe here. Local filesystem services
have a huge advantage for this type of load: in the absence of
specific request by the application (e.g. tar), local filesystems can
lose your data and noone will complain. This is data loss, not data
corruption, and this generally accepted data loss will occur in the
event of a system crash. The argument being that if you need a higher
level of integrity, you need to program it in applications either
using O_DSYNC, fsync etc. Many applications are not that critical and
avoid such burden.
NFS and COMMIT
On the other hand, the nature of the NFS protocol is such that the
client _must_ at some specific point request to the server to place
previously sent data onto stable storage. This is done through an
NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract
between clients and servers that allows the client to forget about
its previous historical interaction with the file. In the event of a
server crash/reboot, the client is guaranteed that previously commited
data will be returned by the server. Operations since the last COMMIT
can be replayed after a server crash in a way that insures a coherent
view between everybody involved.
But this all topples over if the COMMIT contract is not honored. If a
local filesystem does not properly commit data when requested to do
so, there is no more guarantee that the client's view of files will be
what it would otherwise normally expect. Despite the fact that the
client has completed the 'tar x' with
no errors, it can happen
that some of the files are missing in full or in parts.
With local filesystems, a system crash is plainly obvious to users and
requires applications to be restarted. With NFS, a server crash in
not obvious to users of the service (the only sign being a lengthy
pause), and applications are not notified. The fact that files or
parts of files may go missing
in the absence of errors can be
considered as plain
corruption of the client's side view.
When the underlying filesystem serving NFS ignores COMMIT request, or
when the storage subsystem acknowledge I/O before they reach stable
storage, what is potential data loss on the server, becomes
corruption of the client's point of view.
It turns out that in NFSv3/NFSv4 the client will request a COMMIT on
close; Moreover, the NFS server itself is required to commit on
meta-data operations; for NFSv3 that is on :
SETATTR, CREATE, MKDIR, SYMLINK, MKNOD,
REMOVE, RMDIR, RENAME, LINK
and a COMMIT maybe required on the containing directory.
Expected Performance
Let's imagine we find a way to run our load at 1 COMMIT (on close) per
extracted files. The COMMIT means the client must wait for at least a
full I/O latency and since 'tar x' processes the tar file from a
single thread, that implies that we can run our workload at the
maximum rate (assuming infinitely fast networking) of one extracted file
per I/O latency or about 200 extracted files per second (on modern
disks). If the files to be extracted are 1K in average size, the tar
x will proceed at a pace of 200K/sec. If we are required to issue 2
COMMIT operations per extracted files (for instance due to a
server-side COMMIT on file create), that would further halves that
throughput number.
However, If we had lots of threads extracting individual files
concurrently the performance would scale up nicely with the number of
threads.
But tar is single threaded, so what is actually going on here ? The
need to COMMIT frequently means that our thread must frequently pause
for a full server side I/O latency. Because our single threaded tar is
blocked, nothing is able to process the rest of our workload. If we
allow the server to ignore COMMIT operations, then NFS responses will
we sent earlier allowing the single thread to proceed down the tar
file at greater speed. One must realise that the extra performance is
obtained at the risk of causing corruption from the client's point of
view in the event of a crash.
Whether or not the client or the server needs to COMMIT as often as it
does is a separate issue. The existence of other clients that would be
accessing the files needs to be considered in that discussion. The
point being made here is that this issue is not particular to ZFS, nor
does ZFS necessarily exacerbate the problem. The performance of single
threaded writes to NFS will be throttled as a result of the
NFS-imposed COMMIT semantics.
ZFS Relevant Controls
ZFS has two controls that come into this picture. The disk write
caches and the zil_disable tunable. ZFS is designed to work correctly
whether or not the disk write caches are enabled. This is acheived
through explicit cache flush requests, which are generated (for
example) in response to an NFS COMMIT. Enabling the write caches is
then a performance consideration, and can offer performance gains for
some workloads. This is not the same with UFS which is not aware of
the existence of a disk write cache and is not designed to operate
with such cache enabled. Running UFS on a disk with write cache
enabled can lead to corruption of the client's view in the event of a
system crash.
ZFS also has the zil_disable control. ZFS is not designed to operate
with zil_disable set to 1. Setting this variable (before mounting a
ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT
operations are all ignored! We note that, even without a ZIL, ZFS will
always maintain a coherent local view of the on-disk state. But by
ignoring NFS COMMIT operations, it will cause the client's view to
become
corrupted (as defined above).
Comparison with UFS
In the original complaint, there was no comparison between a
semantically correct NFS service delivered by ZFS to another
similar NFS service delivered by another filesystem. Let's gather
some more data:
Local and memory based filesystems :
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
NFS service with risk of corruption of client's side view :
nfs/ufs : 7 sec (write cache enable)
nfs/zfs : 4.2 sec (write cache enable,zil_disable=1)
nfs/zfs : 4.7 sec (write cache disable,zil_disable=1)
Semantically correct NFS service :
nfs/ufs : 17 sec (write cache disable)
nfs/zfs : 12 sec (write cache disable,zil_disable=0)
nfs/zfs : 7 sec (write cache enable,zil_disable=0)
We note that with most filesystems we can easily produce an
improper NFS service by enabling the disk write caches. In this
case, a server-side filesystem may think it has commited data to
stable storage but the presence of an enabled disk write cache causes this
assumption to be false. With ZFS, enabling the write caches is not
sufficient to produce an
improper service.
Disabling the ZIL (setting zil_disable to 1 using mdb and then
mounting the filesystem) is one way to generate an improper NFS
service. With the ZIL disabled, commit request are ignored with
potential client's view corruption.
Intelligent Storage
An different topic is about running ZFS on intelligent storage arrays.
One known pathology is that some arrays will
_honor_ the ZFS
request to flush the write caches despite the fact that their caches
are qualified as stable storage. In this case, NFS performance will
be much much worst than otherwise expected. On this topic and ways to
workaround this specific issue, see Jason's .Plan:
Shenanigans with
ZFS.
Conclusion
In many common circumstances, ZFS offers a fine NFS service that
complies with
all NFS semantics even with write caches enabled.
If another filesystem appears much faster, I suggest first making sure
that this other filesystem complies in the same way.
This is not to say that ZFS performance cannot be perfected as clearly
it can. The performance of ZFS is still evolving quite rapidly. In
many situations, ZFS provides the highest throughput of any
filesystem. In others, ZFS performance is highly competitive with
other filesystems. In some cases, ZFS can be slower than other
filesystems -- while in all cases providing end-to-end data integrity,
ease of use and integrated services such as compression, snapshots
etc.
See Also Eric's fine entry on
zil_disable

vendredi septembre 22, 2006
ZFS and OLTP
ZFS and Databases
Given that we started to have enough understanding on the internal
dynamics of ZFS,
I figured it was time to tackle the next hurdle : running a database
management system (DBMS). Now I know very little myself about DBMS,
so I teamed up with people that have tons of experience with it, my
Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and
Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as
well.
Note that UFS (with DIO) has been heavily tuned over the years to
provide very good support for DBMS. We are just beginning to explore
the tweaks and tunings necessary to achieve comparable performance
from ZFS in this specialized domain.
We knew that running a DBMS would be a challenge since, a database
tickles filesystems in ways that are quite different from other types
of loads. We had 2 goals. Primarily, we needed to understand how
ZFS performs in a DB environment and in what specific area it needs to
improve. Secondly, we figured that whatever would come out of the
work, could be used as blog-material, as well as best practice
recommendations. You're reading the blog material now; also watch this
space for Best Practise updates.
Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)
Workload
The workload we use in PAE to characterize DBMSes is called OLTP/Net.
This benchmark was developed inside Sun for the purpose of
engineering performance into DBMS. Modeled on common transaction
processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:
"OLTP/Net, the New-Order transaction involves multi-hops as it
performs Item validation, and inserts a single item per hop as
opposed to block updates "
I hope that means something to you; Neel will be blogging on his own,
if you need more info.
Reference Point
The reference performance point for this work would be UFS (with VxFS
being also an interesting data point, but I'm not tasked with
improving that metric). For DB loads we know that UFS directio (DIO)
provides a significant performance boost and that would be our target
as well.
Platform & Configuration
Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or
strands per core) with 130 @ 36GB disks attached in JBOD
fashion. Each disk was partitioned in 2 equal slices, with half of
the surface given to a Solaris Volume Manager (SVM) onto which UFS
would be built and the other half was given to ZFS pool.
The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.
Write Cache Enabled (WCE)
ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However,
when given a full disk, ZFS will turn _ON_ the write cache as part of
the import sequence. That is, it won't enable write cache when given only a
slice. So, to be fair to ZFS capabilities we manually turned ON WCE when
running our test over ZFS.
UFS is not designed to work with WCE and will put data at risk if WCE
is set, so we needed to turn it off for the UFS runs. We needed to do
this, to get around the fact that we did not have enough disk to
provide each filesystem. Therefore the performance we measured is what
would be expected when giving full disk to either filesystem. We note
that, for the FC devices we used, WCE does not provide ZFS a
significant performance boost on this setup.
No Redundancy
For this initial effort we also did not configure any form of
redundancy for either filesystem. ZFS RAID-Z does not really have
equivalent feature in UFS and so we settled on simple stripe. We could
eventually configure software mirroring on both filesystems, but we
don't expect that will change our conclusions. But still this will be
interesting in follow-up work.
DBMS logging
Another thing we know already is that a DBMS's log writer latency is
critical to OLTP performance. So in order to improve on that metric,
it's good practice to set aside a number of disks for the DBMS'
logs. So with this in hand, we manage to run our benchmark and get our
target performance number (in relative terms, higher the better):
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Recordsize
OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS),
build a log pool and a
data pool and get going. Note that log writers actually generate a pattern
of sequential I/O of varying sizes. That should map quite well with
ZFS out of the box. But for the DBMS' data pool, we expect a very
random pattern of read and writes to DB records. A commonly known zfs
best practice when servicing fixed record access is to match the ZFS'
recordsize property to that of the application. We note that UFS, by
chance or by design, also works (at least on sparc) using 8K records.
2nd run ZFS/S10U2
So for a fair comparison, we set the recordsize to 8K for the data
pool and run our OLTP/Net and....gasp!:
ZFS/S10U2 : 11.0
Data pool (8K record on FS)
Log pool (no tuning)
So that's no good and we have our work cut out for us.
The role of Prefetch in this result
To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also
refer to as the software track buffer). In this code, whenever ZFS
issues a small read I/O to a device, it will, by default, go and fetch
quite a sizable chunk of data (64K) located at the physical location
being read. In itself, this should not increase the I/O latency which
is dominated by the head-seek and since the data is stored in a small
fixed sized buffer we don't expect this is eating up too much memory
either. However in a heavy-duty environment like we have here, every
extra byte that moves up or down the data channel occupies valuable
space. Moreover, for a large DB, we really don't expect the
speculatively read data to be used very much. So for our next attempt
we'll tune down the prefetch buffer to 8K.
And the role of the vq_max_pending parameter
But we don't expect this to be quite sufficient here. My DBMS savvy
friends would tell me that the I/O latency of reads was quite large in
our runs. Now ZFS prioritizes reads over writes and so we thought we
should be ok. However during a pool transaction group sync, ZFS will
issue quite a number of concurrent writes to each device. This is the
vq_max_pending parameter which default to 35. Clearly during this
phase the read latency even if prioritized will take a somewhat longer
time to complete.
3rd run, ZFS/S10U2 - tuned
So I wrote up a
script
to tune those 2 ZFS knobs. We could then run
with a vdev preftech buffer of 8K and a vq_max_pending of 10. This
boosted our performance almost 2X:
ZFS/S10U2 : 22.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
But not quite satisfying yet.
ZFS/S10U2 known bug
We know of something else about ZFS. In the last few builds before
S10U2, a little bug made it's way into the code base. The effect of
this bug was that for full record rewrite, ZFS would actually input
the old block even though the data is actually not needed at all.
Shouldn't be too bad, perfectly aligned block rewrites of uncached
data is not that common....except for database, bummer.
So S10U2 is plagued with this issue affecting DB performance with no
workaround. So our next step was to move on to ZFS latest bits.
4th run ZFS/Build 44
Build 44 of our next Solaris version has long had this particular
issue fixed. There we topped our past performance with:
ZFS/B44 : 33.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
As we compare to umpty-years of super tuned UFS:
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Summary
I think at this stage of ZFS, the results are neither great nor
bad. We have achieved:
UFS/DIO : 100 %
UFS : xx no directio (to be updated)
ZFS Best : 75% best tuned config with latest bits.
ZFS S10U2 : 50% best tuned config.
ZFS S10U2 : 25% simple tuning.
To achieve acceptable performance levels:
The latest ZFS code base. ZFS improves fast these days. We
will need to keep tracking releases for a little while. The
current OpenSolaris release as well as the upcoming Solaris 10
Update 3 (this fall), should perform for these tests, as well
as the Build 44 results shown here.
1 data pool and 1 log pool: common practice to partition HW
resource when we want proper isolation. Going forward I think
that, we will eventually get to the point where this will not be
necessary but it seems an acceptable constraint for now.
Tuned vdev prefetch: the code is being worked on. I expect
that in a near future this will not be necessary.
Tuned vq_max_pending: that may take a little longer. In a DB
workload, latency is key and throughput secondary. There are
a number of ideas that needs to be tested which will help ZFS
improve on both average latency as well as latency
fluctuations. This will help both the Intent log (O_DSYNC
write) latency as well as reads.
Parting Words
As those improvement come out, they may well allow ZFS to catch or
surpass our best UFS numbers. When you match that kind of performance
with all the usability and data integrity features of ZFS, that's a
proposition that becomes hard to pass up.

mercredi mai 31, 2006
WHEN TO (AND NOT TO) USE RAID-Z
WHEN TO (AND NOT TO) USE RAID-Z
RAID-Z is the technology used by ZFS to implement a data-protection scheme
which is less costly than mirroring in terms of block
overhead.
Here, I'd like to go over, from a theoretical standpoint, the
performance implication of using RAID-Z. The goal of this technology
is to allow a storage subsystem to be able to deliver the stored data
in the face of one or more disk failures. This is accomplished by
joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z
groups can be dynamically striped to form a larger storage pool.
To store file data onto a RAID-Z group, ZFS will spread a filesystem
(FS) block onto the N devices that make up the group. So for each FS
block, (N - 1) devices will hold file data and 1 device will hold
parity information. This information would eventually be used to
reconstruct (or resilver) data in the face of any device failure. We
thus have 1 / N of the available disk blocks that are used to store
the parity information. A 10-disk RAID-Z group has 9/10th of the
blocks effectively available to applications.
A common alternative for data protection, is the use of mirroring. In
this technology, a filesystem block is stored onto 2 (or more) mirror
copies. Here again, the system will survive single disk failure (or
more with N-way mirroring). So 2-way mirror actually delivers similar
data-protection at the expense of providing applications access to
only one half of the disk blocks.
Now let's look at this from the performance angle in particular that
of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z
group achieves it's protection by spreading a ZFS block onto the N
underlying devices. That means that a single ZFS block I/O must be
converted to N device I/Os. To be more precise, in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.
Now after a request for a ZFS block has been spread this way, the IO
scheduling code will take control of all the device IOs that needs to
be issued. At this stage, the ZFS code is capable of aggregating
adjacent physical I/Os into fewer ones. Because of the ZFS
Copy-On-Write (COW) design, we actually do expect this reduction in
number of device level I/Os to work extremely well for just about any
write intensive workloads. We also expect it to help streaming input
loads significantly. The situation of random inputs is one that needs
special attention when considering RAID-Z.
Effectively, as a first approximation, an N-disk RAID-Z group will
behave as a single device in terms of delivered random input
IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group. This is the price to
pay to achieve proper data protection without the 2X block overhead
associated with mirroring.
With 2-way mirroring, each FS block output must be sent to 2 devices.
Half of the available IOPS are thus lost to mirroring. However, for
Inputs each side of a mirror can service read calls independently from
one another since each side holds the full information. Given a
proper software implementation that balances the inputs between sides
of a mirror, the FS blocks delivered by a mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.
So looking at random access input load, the number of FS blocks per
second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way
mirrored or simply striped (a.k.a RAID-0, no data protection !), the
equation would be (where dev represents the capacity in terms of
blocks of IOPS of a single device):
Random
Blocks Available FS Blocks / sec
---------------- --------------
RAID-Z (N - 1) * dev 1 * dev
Mirror (N / 2) * dev N * dev
Stripe N * dev N * dev
Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and
look at different possible configurations; In the table below the
configuration labeled:
"Z 5 x (19+1)"
refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.
Random
Config Blocks Available FS Blocks /sec
------------ ---------------- ---------
Z 1 x (99+1) 9900 GB 200
Z 2 x (49+1) 9800 GB 400
Z 5 x (19+1) 9500 GB 1000
Z 10 x (9+1) 9000 GB 2000
Z 20 x (4+1) 8000 GB 4000
Z 33 x (2+1) 6600 GB 6600
M 2 x (50) 5000 GB 20000
S 1 x (100) 10000 GB 20000
So RAID-Z gives you at most 2X the number of blocks that mirroring
provides but hits you with much fewer delivered IOPS. That means
that, as the number of devices in a group N increases, the expected
gain over mirroring (disk blocks) is bounded (to at most 2X) but the
expected cost in IOPS is not bounded (cost in the range of [N/2, N]
fewer IOPS).
Note that for wide RAID-Z configurations, ZFS takes into account the
sector size of devices (typically 512 Bytes) and dynamically adjust
the effective number of columns in a stripe. So even if you request a
99+1 configuration, the actual data will probably be stored on much
fewer data columns than that. Hopefully this article will contribute
to steering deployments away from those types of configuration.
In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups should be restrained to smaller sizes and one must
accept some level of disk block overhead.
When performance matters most, mirroring should be highly favored. If
mirroring is considered too costly but performance is nevertheless
required, one could proceed like this:
Given N devices each capable of X IOPS.
Given a target of delivered Y FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(Y / X) devices.
For instance:
Given 50 devices each capable of 200 IOPS.
Given a target of delivered 1000 FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(1000 / 200) = 5 devices.
In that system we then would have 20% block overhead lost to maintain
RAID-Z level parity.
RAID-Z is a great technology not only when disk blocks are your most
precious resources but also when your available IOPS far exceed your
expected needs. But beware that if you get your hands on fewer very
large disks, the IOPS capacity can easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a dynamic stripe of RAID-Z groups each made up of a
small number of devices.

mardi mai 16, 2006
128K Suffice
I argue for the fact that 128K I/O sizes is sufficient to extract the most out of
a disk given enough concurrent I/Os
[
Read More]

mercredi novembre 16, 2005
ZFS to UFS Performance Comparison on Day 1
With special thanks to Chaoyue Xiong for her help in this work.
In this paper I'd like to review the performance data we have gathered
comparing this initial release of ZFS (Nov 16 2005) with the Solaris
legacy, optimized beyond reason, UFS filesystem. The data we will be
reviewing is based on 14 Unit tests that were designed to stress some
specific usage pattern of filesystem operations. Working with these
well contained usage scenarios, greatly facilitate subsequent
performance engineering analysis.
Our focus was to issue a fair head to head comparison between UFS and
ZFS but not try to produce the biggest, meanest marketing numbers.
Since ZFS is also a Volume Manager, we actually compared ZFS to a
UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted
to figure out why and how to improve ZFS.
We currently also are focusing on data intensive operations. Metadata
intensive tests are being develop and we will report on those in a
later study.
Looking ahead to our results we find that of our 12 Filesystem Unit
test that were successfully run:
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
In this paper, we will be taking a closer look at the tests where UFS
is ahead and try to make proposition toward improving those numbers.
THE SYSTEM UNDER TEST
Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At
this point we are not yet monitoring the CPU utilization of the
different tests although we plan to do so in the future. The storage
is an insanely large 300 disk array; The disks were rather old
technology, small & slow 9 GB disks. None of the test currently
stresses the array very much and the idea was mostly trying to take
the storage configuration out of the equation. Working with old
technology disks, the absolute throughput numbers are not necessarily
of interest; they are presented in an appendix.
Every disk in our configuration is partitioned into 2 slices and a
simple zvm or zpool stripped volume is made across all spindles. We
then build a filesystem on top of the volume. All commands are run
with default parameters. Both filesystems are mounted and we can run
our test suite on either one.
Every test is rerun multiple times in succession; The tests are
defined and developed to avoid variability between instances. Some of
the current test definition require that file data not be present in
the filesystem cache. Since we currently do not have a convenient way
to control this for ZFS, the result for those tests are omitted from
this report.
THE FILESYSTEM UNIT TESTS
Here is the definition of the 14 data intensive tests we have
currently identified. Note that we are very open to new test
definition; if you know of an data intensive application, that uses a
Filesystem in a very different pattern, and there must be tons of
them, we would dearly like to hear from you.
Test 1
This is the simplest way to create a file; we open/creat a file then
issue 1MB writes until the filesize reaches 128 MB; we then close the file.
Test 2
In this test, we also create a new file, although here we work with a
file opened with the O_DSYNC flag. We work with 128K writes system
calls. This maps to some database file creation scheme.
Test 3
This test is also relative to file creation but with writes that are
much smaller and of varying sizes. In this test, we create a 50MB file
using writes of size picked randomly between [1K,8K]. The file is open
with default flags (no O_*SYNC) but every 10 MB of written data we
issue an fsync() call for the whole file. This form of access can be
used for log files that have data integrity requirements.
Test 4
Moving now to a read test; we read a 1 GB file (assumed in cache) with
32K read system call. This is a rather simple test to keep everybody
honest.
Test 5
This is same test as Test 4 but when the file is assumed not present
in the filesystem cache. We currently have no control on ZFS for this
and so we will not be reporting performance numbers for this test.
This is a basic streaming read sequence that should test the readahead
capacity of a filesystem.
Test 6
Our previous write test, were allocating writes. In this test we will
verify the ability of a filesystem to rewrite over an existing file.
We will look at 32K writes, to a file open with O_DSYNC.
Test 7
Here we also test the ability to rewrite existing files. The size are
randomly picked in the [1K,8K] range. Not special control over data
integrity (no O_*SYNC, no fsync()).
Test 8
In this test we create a very large file (10 GB) with 1MB writes
followed by 2 full-pass sequential read. This test is still evolving
but we want verify the ability of the filesystem to work with files
that are of size close or larger that available free memory.
Test 9
In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB
file. When 128 MB of data is written we issue an fsync().
Test 10
Here, we issue 2K writes at random (unaligned) offsets to a file
opened O_DSYNC.
Test 11
Same test as 10 but using 4 cooperating threads all working on a
single file.
Test 12
Here we attempt to simulate a mixed read/write pattern. Working with
an existing file, we loop through a pattern of 3 reads at 3 randomly
selected 8K aligned offsets followed by an 8K write to the last read
block.
Test 13
In this test we issue 2K pread() calls (to an random unaligned
offset). File is asserted to not be in the cache. Since we currently
have no such control, no won't report data for this test.
Test 14
We have 4 cooperating threads (working on a single file) issuing 2K
pread() calls to random unaligned offset. The file is present in the
cache.
THE RESULTS
We have a common testing framework to generate the performance data.
Each test is written using as a simple C program and the framework is
responsible for creating threads, files, timing the runs and
reporting. We currently are in discussing merging this test framework
with the Filebench suite. We regret that we cannot easily share the
test code, however the above descriptions should be sufficiently
precise to allow someone to reproduce our data. In my mind a simple
10 to 20 disk array and any small server should be enough to generate
similar numbers. If anyone find very different results, I would be
very interested in knowing about it.
Our framework reports all timing results as a throughput
measure. Absolute values of throughput is highly test case dependent.
A 2K O_DSYNC write will not have the same throughput as a 1MB cached
read. Some test would be better described in terms of operations per
second. However since our focus is a relative ZFS to UFS/SVM
comparison, we will focus here on the delta in throughput between the
2 filesystems (for the curious the full throughput data is posted in
the appendix).
Drumroll....
Task ID Description Winning FS / Performance Delta
1 open() and allocation of a ZFS / 3.4X 128.00 MB file with
write(1024K) then close().
2 open(O_DSYNC) and ZFS / 5.3X
allocation of a
5.00 MB file with
write(128K) then close().
3 open() and allocation of a UFS / 1.8X
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
4 Sequential read(32K) of a ZFS / 1.1X 1024.00 MB file, cached.
5 Sequential read(32K) of a no data
1024 MB MB file, uncached.
6 Sequential rewrite(32K) of a ZFS / 2.6X
10.00 MB file, O_DSYNC,
uncached
7 Sequential rewrite() of a 1000.00 UFS / 1.3X
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
8 create a file of size 1/2 of ZFS / 2.3X
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
9 128.00 MB worth of random 8 UFS / 2.3X
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
10 1.00 MB worth of 2K write to draw (UFS == ZFS)
100.00 MB file, O_DSYNC,
random offset, cached.
11 1.00 MB worth of 2K write to ZFS / 5.8X
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
12 128.00 MB worth of 8K aligned draw (UFS == ZFS)
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per no data
thread within a shared
1024.00 MB file, random
offset, uncached
14 5.00 MB worth of pread(2K) per UFS / 6.9X
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
As stated in the abstract
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
The performance differences can be sizable; lets have a closer look
at some of them.
PERFORMANCE DEBRIEF
Lets look at each test to try and understand what is the cause of the
performance differences.
Test 1 (ZFS 3.4X)
open() and allocation of a
128.00 MB file with
write(1024K) then close().
This test is not fully analyzed. We note that in this situation UFS
will regularly kick off some I/O from the context of the write system
call. This would occur whenever a cluster of writes (typically of
size 128K or 1MB) has completed. The initiation of I/O by UFS slows
down the process. On the other hand ZFS can zoom through the test at
a rate much closer to a memcopy. The ZFS I/Os to disks are actually
generated internally by the ZFS transaction group mechanism: every few
seconds a transaction group will come and flush the dirty data to disk
and this occurs without throttling the test.
Test 2 (ZFS 5.3X)
open(O_DSYNC) and
allocation of a
5.00 MB file with
write(128K) then close().
Here ZFS shows an even bigger advantage. Because of it's design and
complexity, UFS is actually somewhat limited in it capacity to write
allocate files in O_DSYNC mode. Every new UFS write requires some
disk block allocation, which must occur one block at a time when
O_DSYNC is set. ZFS can easily outperform UFS for this test.
Test 3 (UFS 1.8X)
open() and allocation of a
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
Here ZFS pays the advantage it had in test 1. In this test, we issue
very many writes to a file. Those are cached as the process is racing
along. When the fsync() hits (every 10 MB of outstanding data per the
test definition) the FS must now guarantee that all the data is set to
stable storage. Since UFS kicks off I/O more regularly, when the
fsync() hits UFS has a smaller amount of data left to sync up. What
save the day for ZFS is that, for that leftover data UFS slows down to
a crawl. On the other hand ZFS has accumulated a large amount of data
in the cache and when the fsync() hits. Fortunately ZFS is able to
issue much larger I/Os to disk and catches some of it's lag that has
built up. But the final results shows that UFS wins the horse race
(at least in this specific test); Details of the test will influence
final result here.
However the ZFS team is working on ways to make the fsync() much
better. We actually have 2 possible avenues of improvements. We can
borrow from the UFS behavior and kick off some I/Os when too much
outstanding data is cached. UFS does this at a very regular interval
which does not look right either. But clearly if a file has many MB
of outstanding dirty data sending them off to disk might be
beneficial. On the other hand, keeping the data in cache in
interesting when the pattern of writing is such that the same file
offsets are written and re-written over and over again. Sending the
data to disk is wasteful if data is subsequently rewritten shortly
after. Basically the FS must place a bet on whether a future fsync()
will occur before an new write to the block. We cannot win this bet
on all tests all the time.
Given that fsync() performance is important, I would like to see us
asynchronously kick off I/O when some we reach many MB of outstanding
data to a file. This is nevertheless debatable.
Even if we don't do this, we have another area of improvement that the
ZFS team is looking into. When the Fsync finally hits the fan, even
with a lot of outstanding data; the current implementation does not
issue disk I/Os very efficiently. The proper way to do this is to
kick-off all required I/Os and then wait for them to all complete.
Currently in the intricacies of the code, some I/Os are issued and
waited upon one after the other. This is not yet optimal but we
certainly should see improvements coming in the future and I truly
expect ZFS fsync() performance to be ahead all the time.
Test 4 (ZFS 1.1X)
Sequential read(32K) of a 1024.00
MB file, cached.
Rather simple test, mostly close to memcopy speed between the
Filesystem cache and the user buffer. Contest is almost a wash with
ZFS slightly on top. Not yet analyzed.
Test 5 (N/A)
Sequential read(32K) of a 1024.00
MB file, uncached.
No results dues to lack of control on the ZFS file level caching.
Test 6 (ZFS 2.6X)
Sequential rewrite(32K) of a
10.00 MB file, O_DSYNC,
uncached
Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not
very different to an initial write and it seems to perform very well
on this test. Presumably UFS performance is hindered by the need to
synchronize the cached data. Result not yet analyzed.
Test 7 (UFS 1.3X)
Sequential rewrite() of a 1000.00
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
In this test we are not timing any of the disk I/O. This is merely a
test about unrolling the filesystem code for 1K to 8K cached writes.
The UFS codepath wins in simplicity and years of performance tuning.
The ZFS codepath here somewhat suffers from it's youth. Understandably
the ZFS current implementation is very well layered and we easily
imagine that the locking strategies of the different layers are
independent of one another. We have found (thanks dtrace) that a small
ZFS cached write would use about 3 times as many lock acquisition that
an equivalent UFS call. Mutex rationalization within or between
layers certainly seems to be an area of potential improvement for ZFS
that would help this particular test. We also realised that the very
clean and layered code implementation is causing the callstack to
follow very many elevator ride up and down between layers. On a Sparc
CPU going up and down 6 or 7 layers deep in the callstack causes a
spill/fill trap and one additional trap for every additional floor
travelled. Fortunately there are very many areas where ZFS will be
able to merge different functions into single one or possibly exploit
the technique of tail calls to regain some of the lost performance.
All in all, we find that the performance difference is small enough to
not be worrysome at this point specially in view of the possible
improvements we already have identified.
Test 8 (ZFS 2.3X)
create a file of size 1/2 of
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
This test needs to be analyzed further. We note that UFS will
proactively freebehind read blocks. While this is a very responsible
use of memory (give it back after use) it potentially impact the
re-read UFS performance. While we're happy to see ZFS performance on
top, some investigation is warranted to make sure that ZFS does not
overconsume memory in some situations.
Test 9 (UFS 2.3X)
128.00 MB worth of random 8
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
In this test we expect a rational similar to the one of Test 3 to take
effect. The same cure should also apply.
Test 10 (draw)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, cached.
Both FS must issue and wait for a 2K I/O on each write. They both do
this as efficiently as possible.
Test 11 (ZFS 5.8X)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
This test is similar to the previous one except for the 4 cooperating
threads. ZFS being on top highlights a key feature of ZFS, the lack of
single writer lock. UFS can only allow a single write thread working
per file. The only exception is when directio is enabled and then
only with rather restrictive conditions. UFS with directio would allow
concurrent writers with the implied restriction that it did not honor
full POSIX semantics regarding write atomicity. ZFS, out of the box,
is able to allow concurrent writers without requiring any special
setup nor giving up full POSIX semantics. All great news for
simplicity of deployment and great Data-Base performance .
Test 12 (draw)
128.00 MB worth of 8K aligned
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
Both filesystem perform appropriately. Test still require analysis.
Test 13 (N/A)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, uncached
No results dues to lack of control on the ZFS file level caching.
Test 14 (UFS 6.9X)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
This test unexplicably shows UFS on top. The UFS code can perform
rather well given that the FS cache is stored in the page cache.
Servicing writes from cache can be made very scalable. We are just
starting our analysis of the performance characteristic of ZFS for
this test We have identified some serialization construct in the
buffer management code where we find that reclaiming the buffers into
which to put the cached data is acting as a serial throttle. This is
truly the only test where the ZFS performance disappoint although
there is no doubt that we will be finding a cure to this
implementation issue.
THE TAKEAWAY
ZFS is on top on very many of our test often by a significant
factor. Where UFS is ahead we have a clear view on how to improve the
ZFS implementation. The case of shared readers to a single file will
be the test that requires special attention.
Given the youth of the ZFS implementation, the performance outline
presented in this paper shows that the ZFS design decision are totally
validated from a performance perspective.
FUTURE DIRECTIONS
Clearly, we should now expands the unit test coverage. We would like
to study more metadata intensive workloads. We also would like to see
how ZFS features such as compression and RaidZ perform. Other
interesting studies could focus on CPU consumption and memory
efficiency. We also need to find a solution to running the existing
unit test that requires the files to not be cached in the filesystem.
APPENDIX/ THROUGHPUT MEASURE
Here are the raw throughput measures for each of the 14 Unit test.
Task ID Description ZFS latest+nv25(MB/s) UFS+nv25 (MB/s)
1 open() and allocation of a 486.01572 145.94098
128.00 MB file with
write(1024K) then close(). ZFS 3.4X
2 open(O_DSYNC) and 4.5637 0.86565
allocation of a
5.00 MB file with
write(128K) then close(). ZFS 5.3X
3 open() and allocation of a 27.3327 50.09027
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync() 1.8X UFS
every 10.00 MB
4 Sequential read(32K) of a 1024.00 674.77396 612.92737
MB file, cached.
ZFS 1.1X
5 Sequential read(32K) of a 1024.00 1756.57637 17.53705
MB file, uncached.
XXXXXXXXX
6 Sequential rewrite(32K) of a 2.20641 0.85497
10.00 MB file, O_DSYNC,
uncached ZFS 2.6X
7 Sequential rewrite() of a 1000.00 204.31557 257.22829
MB cached file, size picked
uniformly in the [1K,8K] 1.3X UFS
range, then close().
8 create a file of size 1/2 of 698.18182 298.25243
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No ZFS 2.3X
special cache manipulation.
9 128.00 MB worth of random 8 42.75208 100.35258
K-aligned write to a
1024.00 MB file; followed 2.3X UFS
by fsync(); cached.
10 1.00 MB worth of 2K write to 0.117925 0.116375
100.00 MB file, O_DSYNC,
random offset, cached. ====
11 1.00 MB worth of 2K write to 0.42673 0.07391
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each ZFS 5.8X
writing 1 MB
12 128.00 MB worth of 8K aligned 264.84151 266.78044
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read =====
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per 75.98432 0.11684
thread within a shared
1024.00 MB file, random XXXXXXXX
offset, uncached
14 5.00 MB worth of pread(2K) per 56.38486 386.70305
thread within a shared
1024.00 MB file, random 6.9X UFS
offset, cached 4 threads.
OpenSolaris,
ZFS