| « décembre 2009 |
| lun. | mar. | mer. | jeu. | ven. | sam. | dim. |
|---|
| | 1 | 2 | 3 | 4 | 5 | 6 |
7 | 8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 | 20 |
21 | 22 | 23 | 24 | 25 | 26 | 27 |
28 | 29 | 30 | 31 | | | |
| | | | | | | |
| Today |

mercredi mai 14, 2008
The new ZFS write throttle
A very significant improvement is coming soon to ZFS. A change that
will increase the general quality of service delivered by ZFS.
Interestingly it's a change that might also slow down your
microbenchmark but nevertheless it's a change you should be eager for.
Write throttling
For a filesystem, write throttling designates the act of blocking
application for some amount of time, as short as possible, waiting for
the proper conditions to allow the write system calls to succeed.
Write throttling is normally required because applications can write
to memory (dirty memory pages) at a rate significantly faster than the
kernel can flush the data to disk. Many workloads dirty memory pages
by writing to the filesystem page cache at near memory copy speed,
possibly using multiple threads issuing high-rates of filesystem
writes. Concurrently, the filesystem is doing it's best to drain all
that data to the disk subsystem.
Given the constraints, the time to empty the filesystem cache to disk
can be longer than the time required for applications to dirty the
cache. Even if one considers storage with fast NVRAM, under sustained
load, that NVRAM will fill up to a point where it needs to wait for a
slow disk I/O to make room for more data to get in.
When committing data to a filesystem in bursts, it can be quite
desirable to push the data at memory speed and then drain the cache to
disk during the lapses between bursts. But when data is generated at
a sustained high rate, lack of throttling leads to total memory
depletion. We thus need at some point to try and match the application
data rate with that of the I/O subsystem. This is the primary goal of
write throttling.
A secondary goal of write throttling is to prevent massive data loss.
When applications do not manage I/O synchronization (i.e don't use
O_DSYNC and fsync), data ends up cached in the filesystem and the
contract is that there is no guarantee that the data will still be
there if a system crash were to occur. So even if the filesystem
cannot be blamed for such data loss, it is still a nice feature to
help prevent such massive losses.
Case in point : UFS Write throttling
For instance UFS would use the fsflush daemon to try to keep data
exposed for no more than 30 seconds (default value of autoup). Also,
UFS would keep track of the amount of I/O outstanding for each
file. Once too much I/O was pending, UFS would throttle writers for
that file. This was controlled through ufs_HW, ufs_LW and their
values were commonly tuned (a bad sign). Eventually old defaults
values were updated and seem to work nicely today. UFS write
throttling thus operates on a per file basis. While there are some
merits to this approach, it can be defeated as it does not manage the
imbalance between memory and disks at a system level.
ZFS Previous write throttling
ZFS is designed around the concept of transaction groups (txg).
Normally, every 5 seconds an _open_ txg goes to the quiesced
state. From that state the quiesced txg will go to the syncing state
which sends dirty data to the I/O subsystem. For each pool, there are
at most 1 txg in each of the 3 states, open, quiescing, syncing. Write
throttling used to occur when the 5 second txg clock would fire while
the syncing txg had not yet completed. The open group would wait on
the quiesced one which waits on the syncing one. Application writers
(write system call) would block, possibly a few seconds, waiting for a
txg to open. In other words, if a txg took more than 5 seconds to
sync to disk, we would globally block writers thus matching their
speed with that of the I/O. But if a workload had a bursty write
behavior that could be synced during the allotted 5 seconds,
application would never be throttled.
The Issue
But ZFS did not sufficiently controled the amount of data that could
get in an open txg. As long as the ARC cache was no more than half
dirty, ZFS would accept data. For a large memory machine or one with
weak storage, this was likely to cause long txg sync times. The
downsides were many :
- if we did ended up throttled, long sync times meant the system
behavior would be sluggish for seconds at a time.
- long txg sync times also meant that our granularity at which
we could generate snapshots would be impacted.
- we ended up with lots of pending data in the cache all of
which could be lost in the event of a crash.
- the ZFS I/O scheduler which prioritizes operations was also
negatively impacted.
- By not throttling we had the possibility that
sequential writes on large files could displace from the ARC
a very large number of smaller objects. Refilling
that data meant very large number of disk I/Os.
Not throttling can paradoxically end up as very
costly for performance.
- the previous code also could at times, not be issuing I/Os
to disk for seconds even though the workload was
critically dependant of storage speed.
- And foremost, lack of throttling depleted memory and prevented
ZFS from reacting to memory pressure.
That ZFS is considered a memory hog is most likely the results of the
the previous throttling code. Once a proper solution is in place, it will
be interesting to see if we behave better on that front.
The Solutions
The new code keeps track of the amount of data accepted in a TXG and
the time it takes to sync. It dynamically adjusts that amount so that
each TXG sync takes about 5 seconds (txg_time variable). It also
clamps the limit to no more than 1/8th of physical memory.
And to avoid the system wide and seconds long throttle effect, the new
code will detect when we are dangerously close to that situation
(7/8th of the limit) and will insert 1 tick delays for applications
issuing writes. This prevents a write intensive thread from hogging
the available space starving out other threads. This delay should
also generally prevent the system wide throttle.
So the new steady state behavior of write intensive workloads is that,
starting with an empty TXG, all threads will be allowed to dirty
memory at full speed until a first threshold of bytes in the TXG is
reached. At that time, every write system call will be delayed by 1
tick thus significantly slowing down the pace of writes. If the
previous TXG completes it's I/Os, then the current TXG will then be
allowed to resume at full speed. But in the unlikely event that a
workload, despite the per write 1-tick delay, manages to fill up the
TXG up to the full threshold we will be forced to throttle all writes
in order to allow the storage to catch up.
It should make the system much better behaved and generally more
performant under sustained write stress.
If you are owner of an unlucky workload that ends up as slowed by more
throttling, do consider the other benefits that you get from the new
code. If that does not compensate for the loss, get in touch and tell
us what your needs are on that front.

lundi janvier 08, 2007
NFS and ZFS, a fine combination
No doubt there is still a lot to learn about ZFS as an NFS server and
this will not delve deeply into that topic. What I'd like to dispel
here is the notion that ZFS can cause some NFS workloads to exhibit
pathological performance characteristics.
The Sightings
Since there have been a few perceived 'sightings' of such slowdowns, a
little clarification is in order. Large reported slowdowns would
typically be reported when looking at a single threaded load, probably
doing small file creation such as 'tar xf many_small_files.tar'.
For instance, I've run a small such test over a 72G SAS drive.
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
nfs/zfs : 12 sec
There are a few things to observe here. Local filesystem services
have a huge advantage for this type of load: in the absence of
specific request by the application (e.g. tar), local filesystems can
lose your data and noone will complain. This is data loss, not data
corruption, and this generally accepted data loss will occur in the
event of a system crash. The argument being that if you need a higher
level of integrity, you need to program it in applications either
using O_DSYNC, fsync etc. Many applications are not that critical and
avoid such burden.
NFS and COMMIT
On the other hand, the nature of the NFS protocol is such that the
client _must_ at some specific point request to the server to place
previously sent data onto stable storage. This is done through an
NFSv3 or NFSv4 COMMIT operation. The COMMIT operation is a contract
between clients and servers that allows the client to forget about
its previous historical interaction with the file. In the event of a
server crash/reboot, the client is guaranteed that previously commited
data will be returned by the server. Operations since the last COMMIT
can be replayed after a server crash in a way that insures a coherent
view between everybody involved.
But this all topples over if the COMMIT contract is not honored. If a
local filesystem does not properly commit data when requested to do
so, there is no more guarantee that the client's view of files will be
what it would otherwise normally expect. Despite the fact that the
client has completed the 'tar x' with
no errors, it can happen
that some of the files are missing in full or in parts.
With local filesystems, a system crash is plainly obvious to users and
requires applications to be restarted. With NFS, a server crash in
not obvious to users of the service (the only sign being a lengthy
pause), and applications are not notified. The fact that files or
parts of files may go missing
in the absence of errors can be
considered as plain
corruption of the client's side view.
When the underlying filesystem serving NFS ignores COMMIT request, or
when the storage subsystem acknowledge I/O before they reach stable
storage, what is potential data loss on the server, becomes
corruption of the client's point of view.
It turns out that in NFSv3/NFSv4 the client will request a COMMIT on
close; Moreover, the NFS server itself is required to commit on
meta-data operations; for NFSv3 that is on :
SETATTR, CREATE, MKDIR, SYMLINK, MKNOD,
REMOVE, RMDIR, RENAME, LINK
and a COMMIT maybe required on the containing directory.
Expected Performance
Let's imagine we find a way to run our load at 1 COMMIT (on close) per
extracted files. The COMMIT means the client must wait for at least a
full I/O latency and since 'tar x' processes the tar file from a
single thread, that implies that we can run our workload at the
maximum rate (assuming infinitely fast networking) of one extracted file
per I/O latency or about 200 extracted files per second (on modern
disks). If the files to be extracted are 1K in average size, the tar
x will proceed at a pace of 200K/sec. If we are required to issue 2
COMMIT operations per extracted files (for instance due to a
server-side COMMIT on file create), that would further halves that
throughput number.
However, If we had lots of threads extracting individual files
concurrently the performance would scale up nicely with the number of
threads.
But tar is single threaded, so what is actually going on here ? The
need to COMMIT frequently means that our thread must frequently pause
for a full server side I/O latency. Because our single threaded tar is
blocked, nothing is able to process the rest of our workload. If we
allow the server to ignore COMMIT operations, then NFS responses will
we sent earlier allowing the single thread to proceed down the tar
file at greater speed. One must realise that the extra performance is
obtained at the risk of causing corruption from the client's point of
view in the event of a crash.
Whether or not the client or the server needs to COMMIT as often as it
does is a separate issue. The existence of other clients that would be
accessing the files needs to be considered in that discussion. The
point being made here is that this issue is not particular to ZFS, nor
does ZFS necessarily exacerbate the problem. The performance of single
threaded writes to NFS will be throttled as a result of the
NFS-imposed COMMIT semantics.
ZFS Relevant Controls
ZFS has two controls that come into this picture. The disk write
caches and the zil_disable tunable. ZFS is designed to work correctly
whether or not the disk write caches are enabled. This is acheived
through explicit cache flush requests, which are generated (for
example) in response to an NFS COMMIT. Enabling the write caches is
then a performance consideration, and can offer performance gains for
some workloads. This is not the same with UFS which is not aware of
the existence of a disk write cache and is not designed to operate
with such cache enabled. Running UFS on a disk with write cache
enabled can lead to corruption of the client's view in the event of a
system crash.
ZFS also has the zil_disable control. ZFS is not designed to operate
with zil_disable set to 1. Setting this variable (before mounting a
ZFS filesystem) means that O_DSYNC writes, fsync as well as NFS COMMIT
operations are all ignored! We note that, even without a ZIL, ZFS will
always maintain a coherent local view of the on-disk state. But by
ignoring NFS COMMIT operations, it will cause the client's view to
become
corrupted (as defined above).
Comparison with UFS
In the original complaint, there was no comparison between a
semantically correct NFS service delivered by ZFS to another
similar NFS service delivered by another filesystem. Let's gather
some more data:
Local and memory based filesystems :
tmpfs : 0.077 sec
ufs : 0.25 sec
zfs : 0.12 sec
NFS service with risk of corruption of client's side view :
nfs/ufs : 7 sec (write cache enable)
nfs/zfs : 4.2 sec (write cache enable,zil_disable=1)
nfs/zfs : 4.7 sec (write cache disable,zil_disable=1)
Semantically correct NFS service :
nfs/ufs : 17 sec (write cache disable)
nfs/zfs : 12 sec (write cache disable,zil_disable=0)
nfs/zfs : 7 sec (write cache enable,zil_disable=0)
We note that with most filesystems we can easily produce an
improper NFS service by enabling the disk write caches. In this
case, a server-side filesystem may think it has commited data to
stable storage but the presence of an enabled disk write cache causes this
assumption to be false. With ZFS, enabling the write caches is not
sufficient to produce an
improper service.
Disabling the ZIL (setting zil_disable to 1 using mdb and then
mounting the filesystem) is one way to generate an improper NFS
service. With the ZIL disabled, commit request are ignored with
potential client's view corruption.
Intelligent Storage
An different topic is about running ZFS on intelligent storage arrays.
One known pathology is that some arrays will
_honor_ the ZFS
request to flush the write caches despite the fact that their caches
are qualified as stable storage. In this case, NFS performance will
be much much worst than otherwise expected. On this topic and ways to
workaround this specific issue, see Jason's .Plan:
Shenanigans with
ZFS.
Conclusion
In many common circumstances, ZFS offers a fine NFS service that
complies with
all NFS semantics even with write caches enabled.
If another filesystem appears much faster, I suggest first making sure
that this other filesystem complies in the same way.
This is not to say that ZFS performance cannot be perfected as clearly
it can. The performance of ZFS is still evolving quite rapidly. In
many situations, ZFS provides the highest throughput of any
filesystem. In others, ZFS performance is highly competitive with
other filesystems. In some cases, ZFS can be slower than other
filesystems -- while in all cases providing end-to-end data integrity,
ease of use and integrated services such as compression, snapshots
etc.
See Also Eric's fine entry on
zil_disable

vendredi septembre 22, 2006
ZFS and OLTP
ZFS and Databases
Given that we started to have enough understanding on the internal
dynamics of ZFS,
I figured it was time to tackle the next hurdle : running a database
management system (DBMS). Now I know very little myself about DBMS,
so I teamed up with people that have tons of experience with it, my
Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and
Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as
well.
Note that UFS (with DIO) has been heavily tuned over the years to
provide very good support for DBMS. We are just beginning to explore
the tweaks and tunings necessary to achieve comparable performance
from ZFS in this specialized domain.
We knew that running a DBMS would be a challenge since, a database
tickles filesystems in ways that are quite different from other types
of loads. We had 2 goals. Primarily, we needed to understand how
ZFS performs in a DB environment and in what specific area it needs to
improve. Secondly, we figured that whatever would come out of the
work, could be used as blog-material, as well as best practice
recommendations. You're reading the blog material now; also watch this
space for Best Practise updates.
Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)
Workload
The workload we use in PAE to characterize DBMSes is called OLTP/Net.
This benchmark was developed inside Sun for the purpose of
engineering performance into DBMS. Modeled on common transaction
processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:
"OLTP/Net, the New-Order transaction involves multi-hops as it
performs Item validation, and inserts a single item per hop as
opposed to block updates "
I hope that means something to you; Neel will be blogging on his own,
if you need more info.
Reference Point
The reference performance point for this work would be UFS (with VxFS
being also an interesting data point, but I'm not tasked with
improving that metric). For DB loads we know that UFS directio (DIO)
provides a significant performance boost and that would be our target
as well.
Platform & Configuration
Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or
strands per core) with 130 @ 36GB disks attached in JBOD
fashion. Each disk was partitioned in 2 equal slices, with half of
the surface given to a Solaris Volume Manager (SVM) onto which UFS
would be built and the other half was given to ZFS pool.
The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.
Write Cache Enabled (WCE)
ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However,
when given a full disk, ZFS will turn _ON_ the write cache as part of
the import sequence. That is, it won't enable write cache when given only a
slice. So, to be fair to ZFS capabilities we manually turned ON WCE when
running our test over ZFS.
UFS is not designed to work with WCE and will put data at risk if WCE
is set, so we needed to turn it off for the UFS runs. We needed to do
this, to get around the fact that we did not have enough disk to
provide each filesystem. Therefore the performance we measured is what
would be expected when giving full disk to either filesystem. We note
that, for the FC devices we used, WCE does not provide ZFS a
significant performance boost on this setup.
No Redundancy
For this initial effort we also did not configure any form of
redundancy for either filesystem. ZFS RAID-Z does not really have
equivalent feature in UFS and so we settled on simple stripe. We could
eventually configure software mirroring on both filesystems, but we
don't expect that will change our conclusions. But still this will be
interesting in follow-up work.
DBMS logging
Another thing we know already is that a DBMS's log writer latency is
critical to OLTP performance. So in order to improve on that metric,
it's good practice to set aside a number of disks for the DBMS'
logs. So with this in hand, we manage to run our benchmark and get our
target performance number (in relative terms, higher the better):
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Recordsize
OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS),
build a log pool and a
data pool and get going. Note that log writers actually generate a pattern
of sequential I/O of varying sizes. That should map quite well with
ZFS out of the box. But for the DBMS' data pool, we expect a very
random pattern of read and writes to DB records. A commonly known zfs
best practice when servicing fixed record access is to match the ZFS'
recordsize property to that of the application. We note that UFS, by
chance or by design, also works (at least on sparc) using 8K records.
2nd run ZFS/S10U2
So for a fair comparison, we set the recordsize to 8K for the data
pool and run our OLTP/Net and....gasp!:
ZFS/S10U2 : 11.0
Data pool (8K record on FS)
Log pool (no tuning)
So that's no good and we have our work cut out for us.
The role of Prefetch in this result
To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also
refer to as the software track buffer). In this code, whenever ZFS
issues a small read I/O to a device, it will, by default, go and fetch
quite a sizable chunk of data (64K) located at the physical location
being read. In itself, this should not increase the I/O latency which
is dominated by the head-seek and since the data is stored in a small
fixed sized buffer we don't expect this is eating up too much memory
either. However in a heavy-duty environment like we have here, every
extra byte that moves up or down the data channel occupies valuable
space. Moreover, for a large DB, we really don't expect the
speculatively read data to be used very much. So for our next attempt
we'll tune down the prefetch buffer to 8K.
And the role of the vq_max_pending parameter
But we don't expect this to be quite sufficient here. My DBMS savvy
friends would tell me that the I/O latency of reads was quite large in
our runs. Now ZFS prioritizes reads over writes and so we thought we
should be ok. However during a pool transaction group sync, ZFS will
issue quite a number of concurrent writes to each device. This is the
vq_max_pending parameter which default to 35. Clearly during this
phase the read latency even if prioritized will take a somewhat longer
time to complete.
3rd run, ZFS/S10U2 - tuned
So I wrote up a
script
to tune those 2 ZFS knobs. We could then run
with a vdev preftech buffer of 8K and a vq_max_pending of 10. This
boosted our performance almost 2X:
ZFS/S10U2 : 22.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
But not quite satisfying yet.
ZFS/S10U2 known bug
We know of something else about ZFS. In the last few builds before
S10U2, a little bug made it's way into the code base. The effect of
this bug was that for full record rewrite, ZFS would actually input
the old block even though the data is actually not needed at all.
Shouldn't be too bad, perfectly aligned block rewrites of uncached
data is not that common....except for database, bummer.
So S10U2 is plagued with this issue affecting DB performance with no
workaround. So our next step was to move on to ZFS latest bits.
4th run ZFS/Build 44
Build 44 of our next Solaris version has long had this particular
issue fixed. There we topped our past performance with:
ZFS/B44 : 33.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
As we compare to umpty-years of super tuned UFS:
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Summary
I think at this stage of ZFS, the results are neither great nor
bad. We have achieved:
UFS/DIO : 100 %
UFS : xx no directio (to be updated)
ZFS Best : 75% best tuned config with latest bits.
ZFS S10U2 : 50% best tuned config.
ZFS S10U2 : 25% simple tuning.
To achieve acceptable performance levels:
The latest ZFS code base. ZFS improves fast these days. We
will need to keep tracking releases for a little while. The
current OpenSolaris release as well as the upcoming Solaris 10
Update 3 (this fall), should perform for these tests, as well
as the Build 44 results shown here.
1 data pool and 1 log pool: common practice to partition HW
resource when we want proper isolation. Going forward I think
that, we will eventually get to the point where this will not be
necessary but it seems an acceptable constraint for now.
Tuned vdev prefetch: the code is being worked on. I expect
that in a near future this will not be necessary.
Tuned vq_max_pending: that may take a little longer. In a DB
workload, latency is key and throughput secondary. There are
a number of ideas that needs to be tested which will help ZFS
improve on both average latency as well as latency
fluctuations. This will help both the Intent log (O_DSYNC
write) latency as well as reads.
Parting Words
As those improvement come out, they may well allow ZFS to catch or
surpass our best UFS numbers. When you match that kind of performance
with all the usability and data integrity features of ZFS, that's a
proposition that becomes hard to pass up.

mardi septembre 19, 2006
Tuning the knobs
A script is provided to tune some ZFS knobs
[
Read More]

mercredi juillet 12, 2006
ZFS and Directio
ZFS AND DIRECTIO
In view of the great performance gains that UFS gets out of the
'Directio' (DIO) feature, it is interesting to ask ourselves, where exactly
do those gains come from and if ZFS can be tweaked to benefit from
them in the same way.
UFS Directio
UFS Directio is actually a set of things bundled together that
improves performance of very specific workloads most notably that of
Database. Directio is actually a performance hint to the filesystem
and apart from relaxing posix requirements does not carry any change
in filesystem semantics. The users of directio actually assert the
condition on the full Filesystem or individual file level and the
filesystem code if given extra freedom to run or not the tuned DIO
codepath.
What does that tuned code path gets us ? A few things:
- output goes directly from application buffer to disk
bypassing the filesystem core memory cache.
- the FS is not constrained anymore to strictly obey the POSIX
write ordering. The FS is thus able to allow multiple thread
concurrently issuing some I/Os to a single file.
- On input UFS DIO refrains from doing any form of readahead.
In a sense, by taking out the middleman (the filesystem cache),
UFS/DIO causes files to behave a lot like a raw device. Application
reads and writes map one to one onto individual I/Os.
People often consider that the great gains that DIO provides comes
from avoiding the CPU cost of the copy into system caches and from the
avoiding the double buffering, once in the DB, once in the FS, that one
gets in the non-directio case.
I would argue that while the CPU cost associated with a copy certainly
does exists, the copy will run very very quickly compared to the time
the ensuing I/O takes. So the impact of the copy would only appear on
systems that have their CPU quite saturated, notably for industry
standard benchmarks. However real systems, which are more likely to
be I/O constrained than CPU constrained should not pay a huge toll to
this effect.
As for double buffering, I note that Databases (or applications in
general), are normally setup to consume a given amount of memory and
the FS operates using the remaining portion. Filesystems caches data
in memory for lack of better use of that memory. And FS give up their
hold whenever necessary. So the data is not double buffered but
rather 'free' memory keeps a hold on recently issued I/O. Buffering
data in 2 locations does not look like a performance issue to me.
Anything for ZFS ?
So what does that leaves us with ? Why is DIO so good ?
This tells me that we gain a lot from those 2 mantras
don't do any more I/O that requested
allow multiple concurrent I/O to a file.
I note that UFS readahead is particularly bad for certain usage; when
UFS sees access to 2 consecutive pages, it will read a full cluster
and those are typically 1MB in sizes today. So avoiding UFS readahead
has probably contributed greatly to the success of DIO. As for ZFS
there are 2 levels of readahead (a.k.a prefetching). One that is
filebased and one device based. Both are being reworked at this stage.
I note that filebased readahead code has not and will not behave like
UFS. On the other hand device level prefetching probably is being
over agressive for DB type loads and it should be avoided. While I
have not given hope of that this can be managed automatically, watch
this space for tuning scripts to control the device prefetching
behavior.
DIO for input does not otherwise appear an interesting proposition
since if the data is cached, I don't really see the gains in bypassing
it (apart from slowing down the reads).
As for writes, ZFS, out of the box, does not suffer from the single
writer lock that UFS needs to implement the posix ordering rules. The
transaction groups (TXG) are sufficient for that purpose (see
The
Dynamics of ZFS).
This leaves us to the amount of I/O needed by the 2 filesystems when
running many concurrent O_DSYNC writers running small writes to random
file offsets.
UFS actually handles this load by overwriting the data in it's
preallocated disk locations. Every 8K pages is associated with set
place on the storage and a write to that location means a disk head
movement and an 8K output I/O. This loads should scale well with
number of disks in the storage and the 'random' IOPS capability of
each drives. If a drives handle 150 random IOPS, then we can handle
about 1MB/s/drive of output.
Now ZFS will behave quite differently. ZFS does not have preallocation
of file blocks and will not, ever, overwrite live data. The handling
of the O_DSYNC writes in ZFS will occur in 2 stages.
The 2 stages of ZFS
First at the ZFS Intent Log (ZIL) level where we need to I/O the data
in order to release the application blocked in a write call. Here the
ZIL has the ability of aggregating data from multiple writes and issue
fewer/larger I/Os than UFS would. Given the ZFS strategy of block
allocation we also expect those I/O to be able to stream to the disk
at high speed. We don't expect to be restrained by the random IOPS
capabilities of disk but more by their streaming performance.
Next at the TXG level, we clean up the state of the filesystem and
here again the block allocation should allow high rate of data
transfer. At this stage there are 2 things we have to care about.
With current state of things, we probably will see the data
sent to disk twice, once to the ZIL once to the pool. While this
appears suboptimal at first, the aggregation and streaming
characteristics of ZFS makes the current situation already probably
better than what UFS can achieve. We're also looking to see if we can
make this even better by avoiding the 2 copies while preserving the
full streaming performance characteristics.
For pool level I/O we must take care to not inflate the amount
of data sent to disk which could eventually cause early storage
saturation. ZFS works out of the box with 128K records for large
files. However for DB workloads, we expect this will be tuned such
that the ZFS recordsize matches the DB block size. We also expect the
DB blocksize to be at least 8K in sizes. Matching the ZFS recordize to
the DB block size is a recommendation that is inline with what UFS DIO
has taught us: don't do any more I/O than necessary.
Note also that with ZFS, because we don't overwrite live data, every
block output needs to bubble up into metadata block updates etc... So
there are some extra I/O that ZFS has to do. So depending on the exact
test conditions the gains of ZFS can be offset by the extra metadata
I/Os.
ZFS Performance and DB
Despite all the advantage of ZFS, the reason that performance data has
been hard to come by is that we have to clear up the road and bypass
the few side issues that currently affects performance on large DB
loads. At this stage, we do have to spend some time and apply magic
recipes to get ZFS performance on Database to behave the way it's
intended to.
But when the dust settles, we should be right up there in terms of
performance compared to UFS/DIO, and improvements ideas are still
plenty, if you have some more I'm interested....

mercredi juin 21, 2006
The Dynamics of ZFS
The
Dynamics of ZFS
ZFS has a number of identified components that governs its
performance. We review the major ones here.
Introducing ZFS
A
volume manager is a layer of software that groups a set of block
devices in order to implement some form of data protection
and/or aggregation of devices exporting the collection as a
storage volumes that behaves as a simple block device.
A
filesystem is a layer that will manage such a block device using a
subset of system memory in order to provide Filesystem operations
(including Posix semantics) to applications and provide a
hierarchical namespace for storage - files. Applications issue
reads and writes to the Filesystem and the Filesystem issues Input
and Output (I/O) operations to the storage/block device.
ZFS
implements those 2 functions at once. It thus typically manages
sets of block devices (leaf vdev), possibly grouping them into
protected devices (RAID-Z or N-way mirror) and aggregating those
top level vdevs into pool. Top level vdevs can be added to a pool
at any time. Objects that are stored onto a pool will be
dynamically striped onto the available vdevs.
Associated
with pools, ZFS manages a number of very lightweight
filesystem objects. A ZFS filesystem is basically just a set of properties
associated with a given mount point. Properties of a filesystem
includes the quota (maximum size) and reservation
(guaranteed size) as well as, for example, whether or not to
compress file data when storing blocks. The filesystem is
characterized as lightweight because it does not statically associate
with any physical disk blocks and any of its settable properties can
be simply changed dynamically.
Recordsize
The
recordsize is one of those properties of a given ZFS filesystem
instance. ZFS files smaller than the recordsize are stored using
a single filesystem block (FSB) of variable length in multiple of a
disk sector (512 Bytes). Larger files are stored using multiple FSB,
each of recordsize bytes, with default value of 128K.
The
FSB is the basic file unit managed by ZFS and to which a checksum
is applied. After a file grows to be larger than the recordsize (and
gets to be stored with multiple FSB) changing the Filesystem's
recordsize property will not impact the file in question. A copy
of the file will inherit the tuned recordsize value. A FSB can
be mirrored onto a vdev or spread to a RAID-Z device.
The
recordsize is currently the only performance tunable of ZFS. The
default recordsize may lead to early storage saturation: For many
small updates (much smaller than 128K) to large files (bigger than 128K) the
default value can cause an extra strain on the physical storage or on
the
data channel (such as a fiber channel) linking it to the host.
For those loads, If one notices a saturated I/O channel then tuning
the recordsize to smaller values should be investigated.
Transaction Groups
The
basic mode of operation for writes operations that do not require
synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will
absorb the operation in a per host system cache called Adaptive
Replacement Cache (ARC). Since there is only one host system memory
but potentially multiple ZFS pools,
cached data from all pools is handled by a unique ARC.
Each
file modification (e.g. a write) is associated with a certain
transaction group (TXG). At regular interval (default of txg_time =
5 seconds) each TXG will shut down and the pool will issue a sync
operation for that group. A TXG may also be shut down when the ARC
indicates that there is too much dirty memory currently being
cached. As a TXG closes, a new one immediately opens and file
modifications then associate with the new active TXG.
If
the active TXG shuts down while a previous one is still in the
process of syncing data to the storage, then applications will be
throttled until the running sync completes. In this situation where
are sinking a TXG, while TXG + 1 is closed due to memory
limitations or the 5 second clock and is waiting to sync itself;
applications are throttled waiting to write to TXG + 2. We need
sustained saturation of the storage or a memory constraint in order
to throttle applications.
A
sync of the Storage Pool will involve sending all level 0 data
blocks to disk, when done, all level 1 indirect blocks, etc. until
eventually all blocks representing the new state of the filesystem
have been committed. At that point we update the ueberblock to point
to the new consistent state of the storage pool.
ZFS Intent Log (ZIL)
For
file modification that come with some immediate data integrity
constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem
intent log or ZIL. The ZIL marks each FS operation (say a
write) with a log sequence number. When a synchronous command is
requested for the operation (such as an fsync), the ZIL will output
blocks up to the sequence number. When the ZIL is in process of
committing data, further commit operations will wait for the
previous ones to complete. This allows the ZIL to aggregate
multiple small transactions into larger ones thus performing
commits using fewer larger I/Os.
The
ZIL works by issuing all the required I/Os and then flushing
the write caches if those are enabled. This use of disk write
cache does not artificially improve a disk's commit latency because
ZFS insures that data is physically committed to storage before returning. However
the write cache allows a disk to hold multiple concurrent I/O
transactions and this acts as a good substitute for drives that do
not implement tag queues.
CAVEAT:
The current state of the ZIL is such that if there is a lot of
pending data in a Filesystem (written to the FS, not yet output to
disk) and a process issues an fsync() for one of it's files, then all
pending operations will have to be sent to disk before the
synchronous command can complete. This can lead to unexpected
performance characteristics. Code is under review.
I/O Scheduler and Priorities
ZFS
keeps track of pending I/Os but only issues to disk
controllers a certain number (35 by default). This allows the
controllers to operate efficiently while never overflowing their
queues. By limiting the I/O queue size, service times of individual
disks are kept to reasonable values. When one I/O completes, the
I/O scheduler then decides the next most important one to issue.
The priority scheme is timed based; so for instance an Input I/O to
service a read calls will be prioritize over any regular Output
I/O issued in the last ~ 0.5 seconds.
The
fact that ZFS will limit each leaf devices I/O queue to 35, is
one of the reasons that suggests that zpool should be built using
vdevs that are individual disks or at least volumes that map to
small number of disks. Otherwise this self imposed limits
could become an artificial performance throttle.
Read Syscalls
If a
read cannot be serviced from the ARC cache, ZFS will issue a
'prioritized' I/O for the data. So even if the storage is handling
a heavy output load, there are only 35 I/Os outstanding, all
with reasonable service times. As soon as one of the 35 I/Os completes
the I/O scheduler will issue the read I/O to the controller. This
insures good service times for read operations in general.
However
to avoid starvation, when there is a long-standing backlog of
Output I/Os then eventually those regain priority over the Input
I/O. ZIL synchronous I/Os are of the same priority to synchronous
reads.
Prefetch
The
prefetch code allowing ZFS to detect sequential or strided access to
a file and issue I/O ahead of phase is currently under review. To
quote the developer "ZFS prefetching needs some love".
Write Syscalls
ZFS
never overwrites live data on-disk and will always output full records
validated by a checksum. So in order to partially overwrite a
file record, ZFS first has to have the corresponding data in
memory. If the data is not yet cached, ZFS will issue an input
I/O before allowing the write(2) to partially modify the file
record. With the data now in cache, more writes can target the
blocks. On output ZFS will checksum data before sending to disk.
For full record overwrite the input phase is not necessary.
CAVEAT:
Simple write calls (not O_DSYNC) are normally absorbed by the ARC
cache and so proceed very quickly. Such a sustained dd(1)-like load can
quickly overrun a large amount of system memory and cause transaction
groups to eventually throttle all applications for large amount
of time (10s of seconds). This is probably what underwrites the
notion that ZFS needs more RAM (it does not). Write throttling code
is under review.
Soft Track Buffer
An
input I/O is serious business. While a Filesystem can decide
where to write stuff out on disk, the Inputs are requested by
applications. This means a necessary head seek to the location of the
data. The time to issue a small read will be totally dominated by
this seek. So ZFS takes the stance that it might as well
amortize those operations and so, for uncached reads,
ZFS normally will issue a fairly large Input I/O (64K by
default). This will help loads that input data using
similar access pattern to the output phase. The data goes into a
per device cache holding 20MB.
This
cache can be invaluable in reducing the I/Os necessary to read-in
data. But just like the recordsize, if the inflated I/O cause a
storage channel saturation the Soft Track Buffer can act as a
performance throttle.
The ARC Cache
The
most interesting caching occurs at the ARC layer. The ARC manages
the memory used by blocks from all pools (each pool servicing
many filesystems). ARC stands for Adaptive Replacement Cache
and is inspired by a paper of
Megiddo/Modha presented at FAST'03
Usenix conference.
That
ARC manages it's data keeping a notion of Most Frequently Used (MFU)
and Most Recently Use (MRU) balancing intelligently between the two.
One of it's very interesting properties is that a large scan of a
file will not destroy most of the cached data.
On
a system with Free Memory, the ARC will grow as it starts to cache
data. Under memory pressure the ARC will return some of it's memory
to the kernel until low memory conditions are relieved.
We
note that while ZFS has behaved rather well under 'normal' memory
pressure, it does not appear to behave satisfactorily under swap
shortage. The memory usage pattern of ZFS is very different to other filesystems
such as UFS and so exposes VM layer issues in a number of
corner cases. For instance, a number of kernel
operations fails with ENOMEM not even attempting a reclaim
operation. If they did, then ZFS would be responding by releasing
some of it's own buffers allowing the initial operation to then
succeed.
The
fact that ZFS caches data in the kernel address space does mean that
the kernel size will be bigger than when using traditional filesystems. For
heavy duty usage it is recommended to use a 64-bit kernel i.e. any
Sparc system or an AMD configured in 64-bit mode. Some systems
that have managed in the past to run without any swap configured
should probably start to configure some.
The behavior of the ARC in response to memory pressure is under review.
CPU Consumption
Recent
enhancement to ZFS has improved it's CPU efficiency by a large
factor. We don't expect to deviate from other filesystems much in
terms of cycles per operations. ZFS checksums all disk blocks but this
has not proven to be costly at all in terms of CPU consumption.
ZFS
can be configured to compress on-disk blocks. We do expect to
see some extra CPU consumption from that compression. While it is
possible that compression could lead to some performance gain due to
reduced I/O load, the emphasis of compression should be to save
on-disk space not performance.
What
About Your Test ?
This is what I know about the ZFS performance model today. My
performance
comparison on different types of modelled workloads made last fall already had ZFS
ahead on many of them; we have improved the
biggest issues highlighted then and there are further performance
improvements in the pipeline (based on UFS, we know this will never
end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern
with the simulated workloads. But nothing will beat having
reports from real workloads at this stage; Your results are
therefore of great interest to us.
And watch this space for updates...

mercredi juin 07, 2006
Tuning ZFS recordsize
One
important performance parameter of ZFS is the recordsize which
govern the size of filesystem blocks for large files. This is
the unit that ZFS validates through checksums.
Filesystem blocks are dynamically striped onto the pooled storage,
on a block to virtual device (vdev) basis.
It
is expected that for some loads, tuning the recordsize will be
required. Note that, in traditional Filesytems such a tunable
would govern the behavior of all of the underlying storage. With
ZFS, tuning this parameter only affects the tuned Filesystem
instance; it will apply to newly created files. The tuning is
achieved using
zfs
set recordsize=64k mypool/myfs
In
ZFS all files are stored either as a single block of varying
sizes (up to the recordsize) or using multiple recordsize
blocks. Once a file grows to be multiple blocks, it's blocksize if
definitively set to the FS recordsize at the time.
Some
more experience will be required with the recordsize tuning.
Here are some elements to guide along the way.
If
one considers the input of a FS block typically in response to an
application read, the size of the I/O in question will not
basically impact the latency by much. So, as a first
approximation, the recordsize does not matter (I'll come back to
that) to read-type workloads.
For
FS block outputs, those that are governed by the recordsize,
actually occur mostly asynchronously with the application; and
since applications are not commonly held up by those outputs, the
delivered throughput is, as for read-type loads, not impacted by
the recordsize.
So
the first approximation is that recordsize does not impact
performance much. To service loads that are transient in nature
with short I/O bursts (< 5 seconds) we do not expect records
tuning to be necessary. The same can be said for sequential type
loads.
So
what about the second approximation ? A problem that can occur
with using an inflated recordsize (128K) compared to application
read/write sizes, is early storage saturation. If an application
requests 64K of data, then providing a 128K record doesn't change the
latency that the application sees much. However if the extra data is
discarded from the cache before ever being read, we see that the
extra occupation of the data channel was occupied for no good reason.
If a limiting factor to the storage is, for instance, a 100MB/sec
channel, I can handle 700 times 128K records per second onto that
channel. If I halves the recordsize that should double the number
of small records I can input.
On
the small record output loads, the system memory creates a buffer
that defer the direct impact to applications. For output, if the
storage is saturated this way for tens of seconds, ZFS will
eventually throttle applications. This means that, in the end,
when the recordsize leads to sustained storage overload on
output, there will be an impact as well.
There
is another aspect to the recordsize. A partial write to an uncached
FS block (a write syscall of size smaller than the recordsize) will
have to first input the corresponding data. Conversely, when
individual writes are such that they cover full filesystem
recordsize blocks, those writes can be handled without the need to
input the associated FS blocks. Other consideration (metadata
overhead, caching) dictates however that the recordsize not be
reduced below a certain point (16K to 64K; do send-in your
experience).
So,
one advice is to keep an eye on the channel throughput and tune
recordsize for random access workloads that saturate to storage.
Sequential type workloads should work quite well with the current
default recordsize. If the applications' read/write sizes can
be increased, that should also be considered. For non-cached
workloads that overwrites file data in small aligned chunks ,
then matching the recordsize with the write access size may bring
some performance gains.

mardi juin 06, 2006
DOES ZFS REALLY USE MORE RAM ?
DOES ZFS REALLY USE
MORE RAM ?
I'll touch 3 aspects of that
question here :
- reported freemem
- syscall writes to mmap pages
- application write throttling
Reported freemem will be lower when
running with ZFS than say UFS. The UFS page cache is considered as
freemem. ZFS will return it's 'cache' only when memory is needed.
So you will operate with lower freemem but won't normally suffer
from this.
It's been wrongly feared that this
mode of operation puts us back to the days of Solaris 2.6 and 7
where we saw a roaller coaster effect on freemem leading to sub-par
application performance. We actually DO NOT have this problem with
ZFS. The old problem came because the memory reaper could not
distinguish between a useful application page and an UFS cached
page. That was bad. ZFS frees up it's cache in a way that does not
cause this problem.
ZFS is designed to release some of
it's memory when kernel modules exert back pressure onto the kmem
subsystem. Some kernel code that did not properly exert that pressure
was recently fixed (short description here: 4034947).
There is one peculiar
workload that does lead ZFS to consume more memory: writing (using
syscalls) to pages that are also mmaped. ZFS does not use the
regular paging system to manage data that passes through reads and
writes syscalls. However mmaped I/O which is closely tied to the
Virtual Memory subsystem still goes through the regular paging code .
So syscall writting to mmaped pages, means we will keep 2 copies
of the associated data at least until we manage to get the data
to disk. We don't expect that type of load to commonly use large
amount of ram.
Finally, one area where ZFS will
behave quite differently from UFS is in throttling writters. With
UFS, up to not long ago, we throttled a process trying to write to
a file, as soon as that file had 0.5 M B of I/O pending associated
with it. This limit has been recently upped to 16 MB. The gain of
such throttling is that we prevent an application working on a
single file or consuming inordinate amount of system memory. The
downside is that we throttle an application possibly
unnecessarely when memory is plenty.
ZFS will not throttle individual
apps like this. The scheme is mutualized between all writers:
when the global load of applications data overflows the I/O
subsystem for 5 to 10 seconds then we throttle the applications
allowing the I/O to catch up. Applications thus have a
lot more ram to play with before being throttled.
This is probably what's behind the
notion that ZFS likes more RAM. By and large, to cache some data,
ZFS just needs the equivalent amount of RAM as any other filesystem.
But currently, ZFS lets applications run a lot more decoupled from
the I/O subsystem. This can speed up some loads by very large
factor, but at times, will appear as extra memory consumption.

mercredi mai 31, 2006
WHEN TO (AND NOT TO) USE RAID-Z
WHEN TO (AND NOT TO) USE RAID-Z
RAID-Z is the technology used by ZFS to implement a data-protection scheme
which is less costly than mirroring in terms of block
overhead.
Here, I'd like to go over, from a theoretical standpoint, the
performance implication of using RAID-Z. The goal of this technology
is to allow a storage subsystem to be able to deliver the stored data
in the face of one or more disk failures. This is accomplished by
joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z
groups can be dynamically striped to form a larger storage pool.
To store file data onto a RAID-Z group, ZFS will spread a filesystem
(FS) block onto the N devices that make up the group. So for each FS
block, (N - 1) devices will hold file data and 1 device will hold
parity information. This information would eventually be used to
reconstruct (or resilver) data in the face of any device failure. We
thus have 1 / N of the available disk blocks that are used to store
the parity information. A 10-disk RAID-Z group has 9/10th of the
blocks effectively available to applications.
A common alternative for data protection, is the use of mirroring. In
this technology, a filesystem block is stored onto 2 (or more) mirror
copies. Here again, the system will survive single disk failure (or
more with N-way mirroring). So 2-way mirror actually delivers similar
data-protection at the expense of providing applications access to
only one half of the disk blocks.
Now let's look at this from the performance angle in particular that
of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z
group achieves it's protection by spreading a ZFS block onto the N
underlying devices. That means that a single ZFS block I/O must be
converted to N device I/Os. To be more precise, in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.
Now after a request for a ZFS block has been spread this way, the IO
scheduling code will take control of all the device IOs that needs to
be issued. At this stage, the ZFS code is capable of aggregating
adjacent physical I/Os into fewer ones. Because of the ZFS
Copy-On-Write (COW) design, we actually do expect this reduction in
number of device level I/Os to work extremely well for just about any
write intensive workloads. We also expect it to help streaming input
loads significantly. The situation of random inputs is one that needs
special attention when considering RAID-Z.
Effectively, as a first approximation, an N-disk RAID-Z group will
behave as a single device in terms of delivered random input
IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group. This is the price to
pay to achieve proper data protection without the 2X block overhead
associated with mirroring.
With 2-way mirroring, each FS block output must be sent to 2 devices.
Half of the available IOPS are thus lost to mirroring. However, for
Inputs each side of a mirror can service read calls independently from
one another since each side holds the full information. Given a
proper software implementation that balances the inputs between sides
of a mirror, the FS blocks delivered by a mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.
So looking at random access input load, the number of FS blocks per
second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way
mirrored or simply striped (a.k.a RAID-0, no data protection !), the
equation would be (where dev represents the capacity in terms of
blocks of IOPS of a single device):
Random
Blocks Available FS Blocks / sec
---------------- --------------
RAID-Z (N - 1) * dev 1 * dev
Mirror (N / 2) * dev N * dev
Stripe N * dev N * dev
Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and
look at different possible configurations; In the table below the
configuration labeled:
"Z 5 x (19+1)"
refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.
Random
Config Blocks Available FS Blocks /sec
------------ ---------------- ---------
Z 1 x (99+1) 9900 GB 200
Z 2 x (49+1) 9800 GB 400
Z 5 x (19+1) 9500 GB 1000
Z 10 x (9+1) 9000 GB 2000
Z 20 x (4+1) 8000 GB 4000
Z 33 x (2+1) 6600 GB 6600
M 2 x (50) 5000 GB 20000
S 1 x (100) 10000 GB 20000
So RAID-Z gives you at most 2X the number of blocks that mirroring
provides but hits you with much fewer delivered IOPS. That means
that, as the number of devices in a group N increases, the expected
gain over mirroring (disk blocks) is bounded (to at most 2X) but the
expected cost in IOPS is not bounded (cost in the range of [N/2, N]
fewer IOPS).
Note that for wide RAID-Z configurations, ZFS takes into account the
sector size of devices (typically 512 Bytes) and dynamically adjust
the effective number of columns in a stripe. So even if you request a
99+1 configuration, the actual data will probably be stored on much
fewer data columns than that. Hopefully this article will contribute
to steering deployments away from those types of configuration.
In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups should be restrained to smaller sizes and one must
accept some level of disk block overhead.
When performance matters most, mirroring should be highly favored. If
mirroring is considered too costly but performance is nevertheless
required, one could proceed like this:
Given N devices each capable of X IOPS.
Given a target of delivered Y FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(Y / X) devices.
For instance:
Given 50 devices each capable of 200 IOPS.
Given a target of delivered 1000 FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(1000 / 200) = 5 devices.
In that system we then would have 20% block overhead lost to maintain
RAID-Z level parity.
RAID-Z is a great technology not only when disk blocks are your most
precious resources but also when your available IOPS far exceed your
expected needs. But beware that if you get your hands on fewer very
large disks, the IOPS capacity can easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a dynamic stripe of RAID-Z groups each made up of a
small number of devices.

mardi mai 16, 2006
128K Suffice
I argue for the fact that 128K I/O sizes is sufficient to extract the most out of
a disk given enough concurrent I/Os
[
Read More]

jeudi décembre 15, 2005
Beware of the Performance of RW Locks
In my naive little mind a rw lock would represents a
performant scalable construct inasmuch as WRITERS do not
hold the lock for a significant amount of time. One
figures that the lock would be held for short WRITERS
times followed by concurrent execution of RW_READERS.
What I recently found out is quite probably well known to
seasoned kernel engineer but this was new to me. So I
figured it could be of interest to others.
The SETUP
So Reader/Writer locks (RW) can be used in kernel and user
level code to allow multiple READERS of, for instance, a
data structure, to access the structure while allowing
only a single WRITER at a time within the bounds of the
rwlock().
A RW locks (
rwlock(9F),
rwlock(3THR))
is more complex that a simple mutex. So acquiring such locks will be
more expensive. This means that if the expected hold times of a lock
is quite small (say to update or read 1 or 2 fields of a structure)
then regular mutexes can usually do that job very well. A common
programming mistake is to expect faster execution of RW locks for
those cases.
However when READ hold times need to be fairly long; then
RW locks represent an alternative construct. With those
locks we expect to have multiple READERS executing
concurrently thus leading to performant code that scales
to large numbers of threads. As I said, if WRITERS are
just quick updates to the structure, we can naively believe
that our code will scale well.
Not So
Let's see how it goes. A WRITER lock cannot get in the
protected code while READERS are executing protected
code. The WRITER must then wait at the door until READERS
releases their hold. If the implementation of RW locks
didn't pay attention, there would be cases in which at
least one READER is always present within the protected
code and WRITERS would get starved of access. To prevent
such starvation, RW lock must block READERS as soon as a
WRITER has requested access. But no matter, our WRITERS
will quickly update the structure and we will get
concurrent execution most of the time. Isn't it ?
Well not quite. As just stated, a RW locks will block
readers as soon as a WRITER has hit the door. This means
that the construct does not allow parallel execution at
that point. Moreover the WRITER will stay at the door while
READERS are executing. So the construct stays fully
serializing from the time a WRITER hits until all current READERS
are done followed by the WRITERS time.
For Instance:
- a RW_READER gets in and will keep a long time. ---|
- a RW_WRITER hits the lock; is put on hold. |
- other RW_READERS now also block. |
.... time passes |
- the long RW_READER releases <----------------|
- the RW_WRITER gets the lock; work; releases
- other RW_READER now work concurrently.
Pretty obvious once you think about it. So to assess the
capacity of a RW lock to allow parallel execution, one
must consider the average hold time as a READER but also
the frequency of access as a WRITER. The construct becomes
efficient and scalable to N threads if and only if:
(avg interval between writers) >> (N * avg read hold times).
Roundup
In the end, from a performance point of view, RW locks
should be used only when the average hold times is
significant in order to justify the use of this more
complex type of lock: for instance, calling a function of
unknown latency or issuing an I/O while holding the lock
represent good candidates. But the construct will be
scalable to N threads, if and only if WRITERS are very
infrequent.
[T]:
NiagaraCMT
Solaris
Sun

mardi décembre 06, 2005
Showcasing UltraSPARC T1 with Directory Server's searches
So my Friend and
Sun's Directory Server (DS) developer Gilles Bellaton recently got his
hands onto an early access Niagara (UltraSPARC T1) system; something
akin to
SunFireTMT2000.
The chip in the system only had 7 active cores and thus 28 hardware
threads (a.k.a strands) but we wanted to check how well it would
perform on DS. The results here are a little anecdotal: we just ran a
few quick test with the aim to showcase Niagara but nevertheless the
results we're beyond expectations.
If you consider the
Throughput Engine
architecture that Niagara provides
(what the Inquire says), we can expect it to perform well
in highly multithreaded loads such as a directory search test. Since
we had limited disk space on the system the slapd instance was created
on /tmp. We realize that this is not at all proper deployment
conditions; however the nature of the test is such that we would
expect the system to operate mostly from memory (Database fully
cached). The only data that would need to go to disk on a real
deployment would be the 'access log' and this typically is a not a
throughput limiting subsystem.
So we can prudently expect that a real on-disk deployment of a
read-mostly workload in which the DB can be fully cached could perform
perhaps closely to our findings. This showcase test is a base search
over a tiny 1000 entries Database using 50 thread slapd. Slapd was
not tuned in any way before the test. For simplicity, the client was
run on the same system as the server. This means that, on the one
hand, the client is consuming some CPU away from the server, but on
the other it reduces the need to run the Network adapter driver code.
All in all, this was not designed as a realistic DS test but only to
see in a few hours of access time to the system if DS was running
acceptably well on this new cool Hardware.
The Results were obtained with Gilles' workspace of DS 6.0 optimized
build of August 29th 2005. The number of CPUs where adjusted by
creating psrset.
Numbers of Strands Search/sec Ratio
1 920 1 X
3 (1 core; 3 str/core) 2260 2.45 X
4 (1 core; 4 str/core) 2650 2.88 X
4 (4 core; 1 str/core) 4100 4.45 X
14 (7 cores, 2 str/core) 12500 13.59 X
21 (7 cores, 3 str/core) 16100 17.5 X
28 (7 cores; 4 str/core) 18200 19.8 X
Those are pretty good scaling numbers straight out of the box. While
other more realistics investigation will be produced, this test at
least showed us early that Niagara based systems were not suffering
from an flagrant deficiencies when running DS searches.
[T]:
NiagaraCMT
Solaris
Sun

mercredi novembre 16, 2005
ZFS to UFS Performance Comparison on Day 1
With special thanks to Chaoyue Xiong for her help in this work.
In this paper I'd like to review the performance data we have gathered
comparing this initial release of ZFS (Nov 16 2005) with the Solaris
legacy, optimized beyond reason, UFS filesystem. The data we will be
reviewing is based on 14 Unit tests that were designed to stress some
specific usage pattern of filesystem operations. Working with these
well contained usage scenarios, greatly facilitate subsequent
performance engineering analysis.
Our focus was to issue a fair head to head comparison between UFS and
ZFS but not try to produce the biggest, meanest marketing numbers.
Since ZFS is also a Volume Manager, we actually compared ZFS to a
UFS/SVM combination. In cases where ZFS underperforms UFS, we wanted
to figure out why and how to improve ZFS.
We currently also are focusing on data intensive operations. Metadata
intensive tests are being develop and we will report on those in a
later study.
Looking ahead to our results we find that of our 12 Filesystem Unit
test that were successfully run:
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
In this paper, we will be taking a closer look at the tests where UFS
is ahead and try to make proposition toward improving those numbers.
THE SYSTEM UNDER TEST
Our testbed is a hefty V890 with 8 x 1200 Mhz US-IV CPUs (16 cores). At
this point we are not yet monitoring the CPU utilization of the
different tests although we plan to do so in the future. The storage
is an insanely large 300 disk array; The disks were rather old
technology, small & slow 9 GB disks. None of the test currently
stresses the array very much and the idea was mostly trying to take
the storage configuration out of the equation. Working with old
technology disks, the absolute throughput numbers are not necessarily
of interest; they are presented in an appendix.
Every disk in our configuration is partitioned into 2 slices and a
simple zvm or zpool stripped volume is made across all spindles. We
then build a filesystem on top of the volume. All commands are run
with default parameters. Both filesystems are mounted and we can run
our test suite on either one.
Every test is rerun multiple times in succession; The tests are
defined and developed to avoid variability between instances. Some of
the current test definition require that file data not be present in
the filesystem cache. Since we currently do not have a convenient way
to control this for ZFS, the result for those tests are omitted from
this report.
THE FILESYSTEM UNIT TESTS
Here is the definition of the 14 data intensive tests we have
currently identified. Note that we are very open to new test
definition; if you know of an data intensive application, that uses a
Filesystem in a very different pattern, and there must be tons of
them, we would dearly like to hear from you.
Test 1
This is the simplest way to create a file; we open/creat a file then
issue 1MB writes until the filesize reaches 128 MB; we then close the file.
Test 2
In this test, we also create a new file, although here we work with a
file opened with the O_DSYNC flag. We work with 128K writes system
calls. This maps to some database file creation scheme.
Test 3
This test is also relative to file creation but with writes that are
much smaller and of varying sizes. In this test, we create a 50MB file
using writes of size picked randomly between [1K,8K]. The file is open
with default flags (no O_*SYNC) but every 10 MB of written data we
issue an fsync() call for the whole file. This form of access can be
used for log files that have data integrity requirements.
Test 4
Moving now to a read test; we read a 1 GB file (assumed in cache) with
32K read system call. This is a rather simple test to keep everybody
honest.
Test 5
This is same test as Test 4 but when the file is assumed not present
in the filesystem cache. We currently have no control on ZFS for this
and so we will not be reporting performance numbers for this test.
This is a basic streaming read sequence that should test the readahead
capacity of a filesystem.
Test 6
Our previous write test, were allocating writes. In this test we will
verify the ability of a filesystem to rewrite over an existing file.
We will look at 32K writes, to a file open with O_DSYNC.
Test 7
Here we also test the ability to rewrite existing files. The size are
randomly picked in the [1K,8K] range. Not special control over data
integrity (no O_*SYNC, no fsync()).
Test 8
In this test we create a very large file (10 GB) with 1MB writes
followed by 2 full-pass sequential read. This test is still evolving
but we want verify the ability of the filesystem to work with files
that are of size close or larger that available free memory.
Test 9
In this test, we issue 8K writes at random 8K aligned offsets in a 1 GB
file. When 128 MB of data is written we issue an fsync().
Test 10
Here, we issue 2K writes at random (unaligned) offsets to a file
opened O_DSYNC.
Test 11
Same test as 10 but using 4 cooperating threads all working on a
single file.
Test 12
Here we attempt to simulate a mixed read/write pattern. Working with
an existing file, we loop through a pattern of 3 reads at 3 randomly
selected 8K aligned offsets followed by an 8K write to the last read
block.
Test 13
In this test we issue 2K pread() calls (to an random unaligned
offset). File is asserted to not be in the cache. Since we currently
have no such control, no won't report data for this test.
Test 14
We have 4 cooperating threads (working on a single file) issuing 2K
pread() calls to random unaligned offset. The file is present in the
cache.
THE RESULTS
We have a common testing framework to generate the performance data.
Each test is written using as a simple C program and the framework is
responsible for creating threads, files, timing the runs and
reporting. We currently are in discussing merging this test framework
with the Filebench suite. We regret that we cannot easily share the
test code, however the above descriptions should be sufficiently
precise to allow someone to reproduce our data. In my mind a simple
10 to 20 disk array and any small server should be enough to generate
similar numbers. If anyone find very different results, I would be
very interested in knowing about it.
Our framework reports all timing results as a throughput
measure. Absolute values of throughput is highly test case dependent.
A 2K O_DSYNC write will not have the same throughput as a 1MB cached
read. Some test would be better described in terms of operations per
second. However since our focus is a relative ZFS to UFS/SVM
comparison, we will focus here on the delta in throughput between the
2 filesystems (for the curious the full throughput data is posted in
the appendix).
Drumroll....
Task ID Description Winning FS / Performance Delta
1 open() and allocation of a ZFS / 3.4X 128.00 MB file with
write(1024K) then close().
2 open(O_DSYNC) and ZFS / 5.3X
allocation of a
5.00 MB file with
write(128K) then close().
3 open() and allocation of a UFS / 1.8X
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
4 Sequential read(32K) of a ZFS / 1.1X 1024.00 MB file, cached.
5 Sequential read(32K) of a no data
1024 MB MB file, uncached.
6 Sequential rewrite(32K) of a ZFS / 2.6X
10.00 MB file, O_DSYNC,
uncached
7 Sequential rewrite() of a 1000.00 UFS / 1.3X
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
8 create a file of size 1/2 of ZFS / 2.3X
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
9 128.00 MB worth of random 8 UFS / 2.3X
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
10 1.00 MB worth of 2K write to draw (UFS == ZFS)
100.00 MB file, O_DSYNC,
random offset, cached.
11 1.00 MB worth of 2K write to ZFS / 5.8X
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
12 128.00 MB worth of 8K aligned draw (UFS == ZFS)
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per no data
thread within a shared
1024.00 MB file, random
offset, uncached
14 5.00 MB worth of pread(2K) per UFS / 6.9X
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
As stated in the abstract
- ZFS outpaces UFS in 6 tests by a mean factor of 3.4
- UFS outpaces ZFS in 4 tests by a mean factor of 3.0
- ZFS equals UFS in 2 tests.
The performance differences can be sizable; lets have a closer look
at some of them.
PERFORMANCE DEBRIEF
Lets look at each test to try and understand what is the cause of the
performance differences.
Test 1 (ZFS 3.4X)
open() and allocation of a
128.00 MB file with
write(1024K) then close().
This test is not fully analyzed. We note that in this situation UFS
will regularly kick off some I/O from the context of the write system
call. This would occur whenever a cluster of writes (typically of
size 128K or 1MB) has completed. The initiation of I/O by UFS slows
down the process. On the other hand ZFS can zoom through the test at
a rate much closer to a memcopy. The ZFS I/Os to disks are actually
generated internally by the ZFS transaction group mechanism: every few
seconds a transaction group will come and flush the dirty data to disk
and this occurs without throttling the test.
Test 2 (ZFS 5.3X)
open(O_DSYNC) and
allocation of a
5.00 MB file with
write(128K) then close().
Here ZFS shows an even bigger advantage. Because of it's design and
complexity, UFS is actually somewhat limited in it capacity to write
allocate files in O_DSYNC mode. Every new UFS write requires some
disk block allocation, which must occur one block at a time when
O_DSYNC is set. ZFS can easily outperform UFS for this test.
Test 3 (UFS 1.8X)
open() and allocation of a
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync()
every 10.00 MB
Here ZFS pays the advantage it had in test 1. In this test, we issue
very many writes to a file. Those are cached as the process is racing
along. When the fsync() hits (every 10 MB of outstanding data per the
test definition) the FS must now guarantee that all the data is set to
stable storage. Since UFS kicks off I/O more regularly, when the
fsync() hits UFS has a smaller amount of data left to sync up. What
save the day for ZFS is that, for that leftover data UFS slows down to
a crawl. On the other hand ZFS has accumulated a large amount of data
in the cache and when the fsync() hits. Fortunately ZFS is able to
issue much larger I/Os to disk and catches some of it's lag that has
built up. But the final results shows that UFS wins the horse race
(at least in this specific test); Details of the test will influence
final result here.
However the ZFS team is working on ways to make the fsync() much
better. We actually have 2 possible avenues of improvements. We can
borrow from the UFS behavior and kick off some I/Os when too much
outstanding data is cached. UFS does this at a very regular interval
which does not look right either. But clearly if a file has many MB
of outstanding dirty data sending them off to disk might be
beneficial. On the other hand, keeping the data in cache in
interesting when the pattern of writing is such that the same file
offsets are written and re-written over and over again. Sending the
data to disk is wasteful if data is subsequently rewritten shortly
after. Basically the FS must place a bet on whether a future fsync()
will occur before an new write to the block. We cannot win this bet
on all tests all the time.
Given that fsync() performance is important, I would like to see us
asynchronously kick off I/O when some we reach many MB of outstanding
data to a file. This is nevertheless debatable.
Even if we don't do this, we have another area of improvement that the
ZFS team is looking into. When the Fsync finally hits the fan, even
with a lot of outstanding data; the current implementation does not
issue disk I/Os very efficiently. The proper way to do this is to
kick-off all required I/Os and then wait for them to all complete.
Currently in the intricacies of the code, some I/Os are issued and
waited upon one after the other. This is not yet optimal but we
certainly should see improvements coming in the future and I truly
expect ZFS fsync() performance to be ahead all the time.
Test 4 (ZFS 1.1X)
Sequential read(32K) of a 1024.00
MB file, cached.
Rather simple test, mostly close to memcopy speed between the
Filesystem cache and the user buffer. Contest is almost a wash with
ZFS slightly on top. Not yet analyzed.
Test 5 (N/A)
Sequential read(32K) of a 1024.00
MB file, uncached.
No results dues to lack of control on the ZFS file level caching.
Test 6 (ZFS 2.6X)
Sequential rewrite(32K) of a
10.00 MB file, O_DSYNC,
uncached
Due to the WAFL (Write Anywhere File Layout) ZFS, a rewrite is not
very different to an initial write and it seems to perform very well
on this test. Presumably UFS performance is hindered by the need to
synchronize the cached data. Result not yet analyzed.
Test 7 (UFS 1.3X)
Sequential rewrite() of a 1000.00
MB cached file, size picked
uniformly in the [1K,8K]
range, then close().
In this test we are not timing any of the disk I/O. This is merely a
test about unrolling the filesystem code for 1K to 8K cached writes.
The UFS codepath wins in simplicity and years of performance tuning.
The ZFS codepath here somewhat suffers from it's youth. Understandably
the ZFS current implementation is very well layered and we easily
imagine that the locking strategies of the different layers are
independent of one another. We have found (thanks dtrace) that a small
ZFS cached write would use about 3 times as many lock acquisition that
an equivalent UFS call. Mutex rationalization within or between
layers certainly seems to be an area of potential improvement for ZFS
that would help this particular test. We also realised that the very
clean and layered code implementation is causing the callstack to
follow very many elevator ride up and down between layers. On a Sparc
CPU going up and down 6 or 7 layers deep in the callstack causes a
spill/fill trap and one additional trap for every additional floor
travelled. Fortunately there are very many areas where ZFS will be
able to merge different functions into single one or possibly exploit
the technique of tail calls to regain some of the lost performance.
All in all, we find that the performance difference is small enough to
not be worrysome at this point specially in view of the possible
improvements we already have identified.
Test 8 (ZFS 2.3X)
create a file of size 1/2 of
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No
special cache manipulation.
This test needs to be analyzed further. We note that UFS will
proactively freebehind read blocks. While this is a very responsible
use of memory (give it back after use) it potentially impact the
re-read UFS performance. While we're happy to see ZFS performance on
top, some investigation is warranted to make sure that ZFS does not
overconsume memory in some situations.
Test 9 (UFS 2.3X)
128.00 MB worth of random 8
K-aligned write to a
1024.00 MB file; followed
by fsync(); cached.
In this test we expect a rational similar to the one of Test 3 to take
effect. The same cure should also apply.
Test 10 (draw)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, cached.
Both FS must issue and wait for a 2K I/O on each write. They both do
this as efficiently as possible.
Test 11 (ZFS 5.8X)
1.00 MB worth of 2K write to
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each
writing 1 MB
This test is similar to the previous one except for the 4 cooperating
threads. ZFS being on top highlights a key feature of ZFS, the lack of
single writer lock. UFS can only allow a single write thread working
per file. The only exception is when directio is enabled and then
only with rather restrictive conditions. UFS with directio would allow
concurrent writers with the implied restriction that it did not honor
full POSIX semantics regarding write atomicity. ZFS, out of the box,
is able to allow concurrent writers without requiring any special
setup nor giving up full POSIX semantics. All great news for
simplicity of deployment and great Data-Base performance .
Test 12 (draw)
128.00 MB worth of 8K aligned
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read
page, random offset,
cached.
Both filesystem perform appropriately. Test still require analysis.
Test 13 (N/A)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, uncached
No results dues to lack of control on the ZFS file level caching.
Test 14 (UFS 6.9X)
5.00 MB worth of pread(2K) per
thread within a shared
1024.00 MB file, random
offset, cached 4 threads.
This test unexplicably shows UFS on top. The UFS code can perform
rather well given that the FS cache is stored in the page cache.
Servicing writes from cache can be made very scalable. We are just
starting our analysis of the performance characteristic of ZFS for
this test We have identified some serialization construct in the
buffer management code where we find that reclaiming the buffers into
which to put the cached data is acting as a serial throttle. This is
truly the only test where the ZFS performance disappoint although
there is no doubt that we will be finding a cure to this
implementation issue.
THE TAKEAWAY
ZFS is on top on very many of our test often by a significant
factor. Where UFS is ahead we have a clear view on how to improve the
ZFS implementation. The case of shared readers to a single file will
be the test that requires special attention.
Given the youth of the ZFS implementation, the performance outline
presented in this paper shows that the ZFS design decision are totally
validated from a performance perspective.
FUTURE DIRECTIONS
Clearly, we should now expands the unit test coverage. We would like
to study more metadata intensive workloads. We also would like to see
how ZFS features such as compression and RaidZ perform. Other
interesting studies could focus on CPU consumption and memory
efficiency. We also need to find a solution to running the existing
unit test that requires the files to not be cached in the filesystem.
APPENDIX/ THROUGHPUT MEASURE
Here are the raw throughput measures for each of the 14 Unit test.
Task ID Description ZFS latest+nv25(MB/s) UFS+nv25 (MB/s)
1 open() and allocation of a 486.01572 145.94098
128.00 MB file with
write(1024K) then close(). ZFS 3.4X
2 open(O_DSYNC) and 4.5637 0.86565
allocation of a
5.00 MB file with
write(128K) then close(). ZFS 5.3X
3 open() and allocation of a 27.3327 50.09027
50.00 MB file with write() of
size picked uniformly in
[1K,8K] issuing fsync() 1.8X UFS
every 10.00 MB
4 Sequential read(32K) of a 1024.00 674.77396 612.92737
MB file, cached.
ZFS 1.1X
5 Sequential read(32K) of a 1024.00 1756.57637 17.53705
MB file, uncached.
XXXXXXXXX
6 Sequential rewrite(32K) of a 2.20641 0.85497
10.00 MB file, O_DSYNC,
uncached ZFS 2.6X
7 Sequential rewrite() of a 1000.00 204.31557 257.22829
MB cached file, size picked
uniformly in the [1K,8K] 1.3X UFS
range, then close().
8 create a file of size 1/2 of 698.18182 298.25243
freemem using write(1MB)
followed by 2 full-pass
sequential read(1MB). No ZFS 2.3X
special cache manipulation.
9 128.00 MB worth of random 8 42.75208 100.35258
K-aligned write to a
1024.00 MB file; followed 2.3X UFS
by fsync(); cached.
10 1.00 MB worth of 2K write to 0.117925 0.116375
100.00 MB file, O_DSYNC,
random offset, cached. ====
11 1.00 MB worth of 2K write to 0.42673 0.07391
100.00 MB file, O_DSYNC,
random offset, uncached. 4
cooperating threads each ZFS 5.8X
writing 1 MB
12 128.00 MB worth of 8K aligned 264.84151 266.78044
read&write to 1024.00 MB
file, pattern of 3 X read,
then write to last read =====
page, random offset,
cached.
13 5.00 MB worth of pread(2K) per 75.98432 0.11684
thread within a shared
1024.00 MB file, random XXXXXXXX
offset, uncached
14 5.00 MB worth of pread(2K) per 56.38486 386.70305
thread within a shared
1024.00 MB file, random 6.9X UFS
offset, cached 4 threads.
OpenSolaris,
ZFS

lundi juin 13, 2005
Bonjour Monde
That's "Hello World" in French but one wouldn't say it that way anyway.
Maybe one would say "Bonjour tout le monde" meaning "Hello All" which you
may say for example entering a room filled of people (specially if like me,
you don't care much about greeting everyone individually). So that's your first
hint. I'm a geeky sociopath more likely to communicate through a weblog than in real life.
The next hint is that I master the french language as you might expect
from someone that lives in France. However I've lived in France for only about
15 years and that should allow you to guess that I was not born in France (french
laws prohibits child labor). And I've been working for Sun since 1997.
The reason I master the French language is probably because both my
parents spoke no other language. That was in Quebec, a part of Canada filled
with People that speak English with a funny semi-french accent. So bear with me my
writing also will have this accent.
In Summary: Canadian, lives in France, works for Sun for 8 years.
I do performance engineering which to me means: I take a performance number, I explain why it is
what is it, and propose what needs to be done to improve it.
Welcome to my blog, and your name is ?