The standard answer to any computer performance question is
almost always : "it depends" which is semantically
equivalent to "I don't know". The better answer is to state
the dependencies.
I would certainly like to see every performance issue studied with a
scientific approach. OpenSolaris and Dtrace are just incredible
enablers when trying to reach root cause and finding those causes is
really the best way to work toward delivering improved performance.
More generally tough, people use common wisdom or possible faulty
assumption to match their symptoms with that of other similar reported
problems. And, as human nature has it, we'll easily blame the
component we're least familiar with for problems. So we often end up
with a lot of report of ZFS performance that once, drilled down,
become either totally unrelated to ZFS (say HW problems) , or
misconfiguration, departure from Best Practices or, at times,
unrealistic expectations.
That does not mean, there are no issues. But it's important
that users can more easily identify known issues, schedule
for fixes, workarounds etc. So anyone deploying ZFS should
really be familiar with those 2 sites :
ZFS Best Practices and
Evil Tuning Guide
That said, what are real commonly encountered performance problems
I've seen and where do we stand ?
Writes overunning memory
That is a real problem that was fixed last March and is integrated in
the Solaris U6 release. Running out of memory causes many different
types of complaints and erratic system behavior. This can happen
anytime a lot of data is created and streamed at rate greater than
that which can be set into the pool. Solaris U6 will be an important
shift for customers running into this issue. ZFS will still try to
use memory to cache your data (a good thing) but the competition this
creates for memory resources will be much reduced. The way ZFS is
designed to deal with this contention (ARC shrinking) will need a new
evaluation from the community. The lack of throttling was a great
impairement to the ability of the ARC to give back memory under
pressure. In the mean time lots of people are capping their arc size
with success as per the Evil Tuning guide.
For more on this topic check out :
The new ZFS write throttle
Cache flushes on SAN storage
This is a common issue we hit in the entreprise. Although it will
cause ZFS to be totally underwhelming in terms of performance, it's
interestingly not a sign of any defect in ZFS. Sadly this touches
customers that are the most performance minded. The issue is somewhat
related to ZFS and somewhat to the Storage. As is well documented
elsewhere, ZFS will, at critical times, issue "cache flush" request to
the storage elements on which is it layered. This is to take into
account the fact that storage can be layered on top of _volatile_
caches that do need to be set on stable storage for ZFS to reach it's
consistency points. Entreprise Storage Arrays do not use _volatile_
caches to store data and so should ignore the request from ZFS to
"flush caches". The problem is that some arrays don't. This
misunderstanding between ZFS and Storage Arrays leads to underwhelming
performance. Fortunately we have an easy workaround that can be used
to quickly identify if this is indeed the problem : setting
zfs_nocacheflush (see evil tuning guide). The best workaround here is
to configure the storage with the setting to indeed ignore "cache
flush". And we also have the option of tuning sd.conf on a per array
basis. Refer again to the evil tuning guide for more detailed
information.
NFS slow over ZFS (Not True)
This is just not generally true and often a side effect of the
previous Cache flush problem. People have used storage arrays to
accelerate NFS for long time but failed to see the expected gains with
ZFS. Many sighting of NFS problems are traced to this.
Other sightings involve common disks with volatile
caches. Here the performance delta observed are rooted in
the stronger semantics that ZFS offer to this operational
model. See
NFS and ZFS for a more detailed description of the
issue.
While I don't consider ZFS as generally slow serving NFS, we did
identify in recent months a condition that effects high thread count
of synchronous writes (such as a DB). This issue is fixed in the
Solaris 10 Update 6 (
CR 6683293).
I would encourage you to be familiar to where we stand regarding ZFS
and NFS because, I know of no big gapping ZFS over NFS problems (if
there were one, I think I would know). People just need to be aware
that NFS is a protocol need some type of accelaration (such as NVRAM)
in order to deliver a user experience close to what a direct attach
filesystem provides.
ZIL is a problem (Not True)
There is a wide perception that the ZIL is the source of performance
problems. This is just a naive interpretation of the facts. The ZIL
serves a very fundamental component of the filesystem and does that
admirably well. Disabling the synchronous semantics of a filesystem
will necessarely lead to higher performance in a way that is totally
misleading to the outside observer. So while we are looking at further
zil improvements for large scale problems, the ZIL is just not today
the source of common problems. So please don't disable this unless you
know what you're getting into.
Random read from Raid-Z
Raid-Z is a great technology that allows to store blocks on top of
common JBOD storage without being subject to raid-5 write hole
corruption (see : http://blogs.sun.com/bonwick/entry/raid_z). However
the performance characteristics of raid-z departs significantly from
raid-5 as to surprise first time users. Raid-Z as currently
implemented spreads blocks to the full width of the raid group and
creates extra IOPS during random reading. At lower loads, the latency
of operations is not impacted but sustained random read loads can
suffer. However, workloads that end up with frequent cache hits will
not be subject to the same penalty as workloads that access vast
amount of data more uniformly. This is where one truly needs to say,
"it depends".
Interestingly, the same problem does not affect Raid-Z streaming
performance and won't affect workloads that commonly benefit from
caching. That said both random and streaming performance are
perfectible and we are looking at a number different ways to improve
on this situation. To better understand Raid-Z, see one of my very
first ZFS entry on this topic :
Raid-Z
CPU consumption, scalability and benchmarking
This is an area we will need to make more studies. With todays very
capable multicore systems, there are many workloads that won't suffer
from the CPU consumptions of ZFS. Most systems do not run at 100% cpu
bound (being more generally constrained by disk, networks or
application scalability) and the user visible latency of operations
are not strongly impacted by extra cycles spent in say the ZFS
checksumming.
However, this view breaks down when it comes to system benchmarking.
Many benchmarks I encounter (the most crafted ones to boot) end up as
host CPU efficiency benchmarks : How many Operations can I do on this
system given large amount of disk and network resources while
preserving some level X of response time. The answer to this question
is purely the reverse of the cycles spent per operation.
This concern is more relevant when the CPU cycles spent in managing
direct attach storage and filesystem is in direct competition with
cycles spent in the application. This is also why database
benchmarking is often associated with using raw device, a fact must
less encountered in common deployment.
Root causing scalability limits and efficiency problems is just part of the never ending performance optimisation
of filesystems.
Direct I/O
Directio has been a great enabler of database performance in other
filesystems. The problem for me is that Direct I/O is a group of
improvements each with their own contribution to the end result. Some
want the concurrent writes, some wants to avoid a copy, some wants to
avoid double caching, some don't know but see performance gains when
turned on (some also see a degradation). I note that concurrent writes
has never been a problem in ZFS and that the extra copy used when
managing a cache is generally cheap considering common DB rates of
access. Acheiving greater CPU efficiency is certainly a valid goal
and we need to look into what is impacting this in common DB
workloads. In the mean time, ZFS in OpenSolaris got a new feature to
manage the cachebility of Data in the ZFS ARC. The per filesystem
"primarycache" property will allow users to decide if blocks should
actually linger in the ARC cache or just be transient. This will
allow DB deployed on ZFS to avoid any form of double caching that
might have occured in the past.
ZFS Performance is and will be a moving target for some time in the
future. Solaris 10 Update 6 with a new write throttle, will be a
significant change and then Opensolaris offers additional
advantages. But generally just be skeptical of any performance issue that is
not root caused: the problem might not be where you expect it
It'd be interesting to see how PostgreSQL fares in tandem with ZFS.
As for Oracle, there need not be any discussion.
If one wants maximum I/O performance, the only logical choice is Oracle ASM, which is interestingly enough similar to ZFS in some respects, like disk pools.
It'll be tough, if not next to impossible, to beat Oracle at their own game as far as DB I/O workloads are concerned. People are afraid of ASM, but the truth is that it's a wonderful system to use for Oracle databases, and quite similar to ZFS.
Posted by UX-admin on novembre 04, 2008 at 06:47 PM MET #
Yes this probably makes good sense. I think people drawn to ZFS for DB are probably on a different agenda than pure performance. Price / performance and consolidation might be on their mind also.
Posted by Roch on novembre 05, 2008 at 09:54 PM MET #
Please, if you scrape some time, write an article about PostgreSQL I/O performance in tandem with ZFS; also, an article on how to get the most I/O performance from PostgreSQL on ZFS would be a killer and a thriller!
You might need to team up with your colleagues from the PostgreSQL dept., but such an article would be really useful for those of us who also plan to run PostgreSQL side by side with Oracle... who better to write such an article, than the experts on the subject?
Posted by UX-admin on novembre 07, 2008 at 08:44 AM MET #
PostgreSQL can't even get close to using full disk I/O on moderately speedy arrays on linux - it is too CPU inefficient. A single 'select count(1) from table' query on a very large table will be pinned by CPU at about 200 to 350 MB/sec off disk with a 3Ghz CPU on linux -- and a decent direct attached storage array can easily do 3x to 4x that if tuned well. ZFS + OpenSolaris may do better for sequential reads or it may not. For random I/O loads with a flash L2ARC I'm sure ZFS will do much better.
Because PostgreSQL has such a CPU limit per (single threaded) query, the real world question on large systems is about I/O behavior with many concurrent queries. How the file system and OS handle concurrent I/O and the memory pressure that comes with it from both the I/O side (buffers) and the DB side (sort and aggregate space) is the most important here.
The I/O scheduler and read-ahead algorithms will have the largest effect here -- and have to be optimized for concurrent throttled access not single threaded loads. Any benchmark that doesn't have enough I/O concurrency won't get to the bottom of the question for real world use on fast I/O subsystems for databases -- and especially Postgres. ZFS could have higher CPU overhead and show slower single threaded results, but handle concurrency much better than the competition, or it could be worse all around.
Since every read is the same size in Postgres, and small (8k), a lot of the CPU overhead associated with high I/O load on PostgreSQL is due to large numbers of small read() calls -- and much more of it is internal to Postgres. If it has to scan 8GB of sequential data, it will call read() 1 million times - not the most efficient for sure, so file systems and OS's that have the most optimized small paths will probably shine.
Posted by Scott on novembre 12, 2008 at 06:59 AM MET #