ZFS and Databases
Given that we started to have enough understanding on the internal
dynamics of ZFS,
I figured it was time to tackle the next hurdle : running a database
management system (DBMS). Now I know very little myself about DBMS,
so I teamed up with people that have tons of experience with it, my
Colleagues from Performance Engineering (PAE), Neelakanth (Neel) Nagdir and
Sriram Gummuluru getting occasional words of wisdom from Jim Mauro as
well.
Note that UFS (with DIO) has been heavily tuned over the years to
provide very good support for DBMS. We are just beginning to explore
the tweaks and tunings necessary to achieve comparable performance
from ZFS in this specialized domain.
We knew that running a DBMS would be a challenge since, a database
tickles filesystems in ways that are quite different from other types
of loads. We had 2 goals. Primarily, we needed to understand how
ZFS performs in a DB environment and in what specific area it needs to
improve. Secondly, we figured that whatever would come out of the
work, could be used as blog-material, as well as best practice
recommendations. You're reading the blog material now; also watch this
space for Best Practise updates.
Note that it was not a goal of this exercise to generate data for a world record press-release. (There is always a metric where this can be achieved.)
Workload
The workload we use in PAE to characterize DBMSes is called OLTP/Net.
This benchmark was developed inside Sun for the purpose of
engineering performance into DBMS. Modeled on common transaction
processing benchmarks, it is OLTP-like but with a higher network-to-disk ratio. This makes it more representative of real world application. Quoting from Neel's prose:
"OLTP/Net, the New-Order transaction involves multi-hops as it
performs Item validation, and inserts a single item per hop as
opposed to block updates "
I hope that means something to you; Neel will be blogging on his own,
if you need more info.
Reference Point
The reference performance point for this work would be UFS (with VxFS
being also an interesting data point, but I'm not tasked with
improving that metric). For DB loads we know that UFS directio (DIO)
provides a significant performance boost and that would be our target
as well.
Platform & Configuration
Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or
strands per core) with 130 @ 36GB disks attached in JBOD
fashion. Each disk was partitioned in 2 equal slices, with half of
the surface given to a Solaris Volume Manager (SVM) onto which UFS
would be built and the other half was given to ZFS pool.
The benchmark was designed to not fully saturate either the CPU or the disks. While we know that performance varies between inner & outer disk surface, we don't expect the effect to be large enough to require attention here.
Write Cache Enabled (WCE)
ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE). This stays true if ZFS is operating on a disk slice. However,
when given a full disk, ZFS will turn _ON_ the write cache as part of
the import sequence. That is, it won't enable write cache when given only a
slice. So, to be fair to ZFS capabilities we manually turned ON WCE when
running our test over ZFS.
UFS is not designed to work with WCE and will put data at risk if WCE
is set, so we needed to turn it off for the UFS runs. We needed to do
this, to get around the fact that we did not have enough disk to
provide each filesystem. Therefore the performance we measured is what
would be expected when giving full disk to either filesystem. We note
that, for the FC devices we used, WCE does not provide ZFS a
significant performance boost on this setup.
No Redundancy
For this initial effort we also did not configure any form of
redundancy for either filesystem. ZFS RAID-Z does not really have
equivalent feature in UFS and so we settled on simple stripe. We could
eventually configure software mirroring on both filesystems, but we
don't expect that will change our conclusions. But still this will be
interesting in follow-up work.
DBMS logging
Another thing we know already is that a DBMS's log writer latency is
critical to OLTP performance. So in order to improve on that metric,
it's good practice to set aside a number of disks for the DBMS'
logs. So with this in hand, we manage to run our benchmark and get our
target performance number (in relative terms, higher the better):
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Recordsize
OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS),
build a log pool and a
data pool and get going. Note that log writers actually generate a pattern
of sequential I/O of varying sizes. That should map quite well with
ZFS out of the box. But for the DBMS' data pool, we expect a very
random pattern of read and writes to DB records. A commonly known zfs
best practice when servicing fixed record access is to match the ZFS'
recordsize property to that of the application. We note that UFS, by
chance or by design, also works (at least on sparc) using 8K records.
2nd run ZFS/S10U2
So for a fair comparison, we set the recordsize to 8K for the data
pool and run our OLTP/Net and....gasp!:
ZFS/S10U2 : 11.0
Data pool (8K record on FS)
Log pool (no tuning)
So that's no good and we have our work cut out for us.
The role of Prefetch in this result
To some extent we already knew of a subsystem that commonly misbehaves (which is being fix as we speak), the vdev level prefetch code (that I also
refer to as the software track buffer). In this code, whenever ZFS
issues a small read I/O to a device, it will, by default, go and fetch
quite a sizable chunk of data (64K) located at the physical location
being read. In itself, this should not increase the I/O latency which
is dominated by the head-seek and since the data is stored in a small
fixed sized buffer we don't expect this is eating up too much memory
either. However in a heavy-duty environment like we have here, every
extra byte that moves up or down the data channel occupies valuable
space. Moreover, for a large DB, we really don't expect the
speculatively read data to be used very much. So for our next attempt
we'll tune down the prefetch buffer to 8K.
And the role of the vq_max_pending parameter
But we don't expect this to be quite sufficient here. My DBMS savvy
friends would tell me that the I/O latency of reads was quite large in
our runs. Now ZFS prioritizes reads over writes and so we thought we
should be ok. However during a pool transaction group sync, ZFS will
issue quite a number of concurrent writes to each device. This is the
vq_max_pending parameter which default to 35. Clearly during this
phase the read latency even if prioritized will take a somewhat longer
time to complete.
3rd run, ZFS/S10U2 - tuned
So I wrote up a
script
to tune those 2 ZFS knobs. We could then run
with a vdev preftech buffer of 8K and a vq_max_pending of 10. This
boosted our performance almost 2X:
ZFS/S10U2 : 22.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
But not quite satisfying yet.
ZFS/S10U2 known bug
We know of something else about ZFS. In the last few builds before
S10U2, a little bug made it's way into the code base. The effect of
this bug was that for full record rewrite, ZFS would actually input
the old block even though the data is actually not needed at all.
Shouldn't be too bad, perfectly aligned block rewrites of uncached
data is not that common....except for database, bummer.
So S10U2 is plagued with this issue affecting DB performance with no
workaround. So our next step was to move on to ZFS latest bits.
4th run ZFS/Build 44
Build 44 of our next Solaris version has long had this particular
issue fixed. There we topped our past performance with:
ZFS/B44 : 33.0
Data pool (8K record on FS)
Log pool (no tuning)
vq_max_pending : 10
vdev prefetch : 8K
As we compare to umpty-years of super tuned UFS:
UFS/DIO/SVM : 42.5
Separate Data/log volumes
Summary
I think at this stage of ZFS, the results are neither great nor
bad. We have achieved:
UFS/DIO : 100 %
UFS : xx no directio (to be updated)
ZFS Best : 75% best tuned config with latest bits.
ZFS S10U2 : 50% best tuned config.
ZFS S10U2 : 25% simple tuning.
To achieve acceptable performance levels:
The latest ZFS code base. ZFS improves fast these days. We
will need to keep tracking releases for a little while. The
current OpenSolaris release as well as the upcoming Solaris 10
Update 3 (this fall), should perform for these tests, as well
as the Build 44 results shown here.
1 data pool and 1 log pool: common practice to partition HW
resource when we want proper isolation. Going forward I think
that, we will eventually get to the point where this will not be
necessary but it seems an acceptable constraint for now.
Tuned vdev prefetch: the code is being worked on. I expect
that in a near future this will not be necessary.
Tuned vq_max_pending: that may take a little longer. In a DB
workload, latency is key and throughput secondary. There are
a number of ideas that needs to be tested which will help ZFS
improve on both average latency as well as latency
fluctuations. This will help both the Intent log (O_DSYNC
write) latency as well as reads.
Parting Words
As those improvement come out, they may well allow ZFS to catch or
surpass our best UFS numbers. When you match that kind of performance
with all the usability and data integrity features of ZFS, that's a
proposition that becomes hard to pass up.