So far as the Solaris-centric aspect of 'direct I/O' goes, if the direct I/O article on solarisinternals.com fails to inform, let me know, and I'll change it. Here, I'll try to stay on course with my stated mission.
Much of the confusion around the topic of 'direct I/O' derives from the fact that the term is overloaded. Even in plain English, the word "direct" has many different meanings. In the context of any particular filesystem - such as UFS, VxFS, NFS, and so on - the term can have differing implications in its performance-related details. In the context of Oracle per se, the term DIRECT PATH refers to a different topic altogether. With filesystems, the primary shared meaning of the term 'direct I/O' is "I/O which is not buffered by the operating system" - but there are other implementation-dependent features that tend to get confused with that shared primary meaning. To help clarify, consider these four ways in which various filesystem options can vary:
- One: Buffering
Two: Concurrency
Red: AIO code path
Blue: I/O API's
One: Buffering
By default, the out-of-box behaviour of filesystems is to buffer at the OS level. RAW disk and various direct I/O options do not buffer at the OS level, and are hence called "unbuffered" solutions. At some level of scaling, the extra copy associated with schleping data through OS-level buffers will impact scaling. Perhaps more importantly, the locking activity associated with managing OS memory is often at the root of application scaling issues. This locking activity is invisible to most users, as the whole topic of OS-level memory management is fundamentally magic to the masses - including myself, to a large degree.The notion that a database engine can (or should) know better than a filesystem what to keep in memory is essentially true, so OS-level buffering always has that argument against it. However, there's a lot more that can be said on the subject. Strategically, OS-level buffering provides a convenient way to share memory between different applications, so it's sometimes appropriate in development and functional test environments to use a buffered filesystem rather than fuss with explicitly carving up memory between various databases. Options for using OS memory in conjunction with application-level buffering can add more color to the topic. For example, some I/O solutions such as VxFS Cached Quick I/O (CQIO) allow for selectively buffering certain files at the OS level as an esoteric tuning technique. (This particular option seems to be in decline recently, since it was mainly used with Oracle, and VxFS Oracle Disk Manager (ODM) is on the rise at the favoured database performance solution for VxFS in the Oracle space.)
Two: Concurrency
With RAW disk, multiple writes to the same target can be in-flight at the same time, no problem. Due to the POSIX requirements for I/O ordering to files, that is not true with the default settings of any POSIX-compliant filesystem which are obliged to implement a single-writer lock to enforce serialization. Of course, Oracle does not require this serialization to implement its ACID properties. This is the main reason that RAW is considered the gold standard for database I/O. The Solaris UFS direct I/O implementation delivers concurrency, but most other direct I/O implementations rely on some other setting or feature to achieve RAW-like I/O concurrency.With databases, write concurrency is usually key to log writer and checkpoint performance, so it tends to be a key requirement for database scaling in general.
Red: AIO code path
The manner in which different software stacks do Asynchronous I/O (AIO) is widely varied. In the context of Solaris, most AIO is managed through the libaio.so library. (Two exceptions to this rule are the QFS samaio implementation and the VxFS ODM implementation; each beyond our scope here.) libaio has two execution paths, and the path taken depends on the nature of the I/O target. This mechanism is well-explained in "Configuring and Tuning Databases on the Solaris Platform". These differences become significant at high throughput rates, but continuous refinement in the Solaris libaio implementation continues to shrink the performance advantage of one path over the other.While much is written on the greatness of KAIO, it's impact is rarely actually characterized, and benefits attributed to it are often largely due to the effects of factors one and two.
Blue: I/O API's
Oracle uses a variety of APIs to do I/O, and its choice can vary by circumstances including the platform, the Oracle release level, the setting of certain tunables, and the specific SQL used. Primarily, in the Solaris environment, Oracle uses pread() to fetch data and aio_write() to write data. One is a syscall, the other isn't. One can observe which APIs Oracle is exercising using truss or DTrace -- but due to numerous implementation details, it is not necessarily a straightforward matter. It would require a separate article to explain. (These details result in a steady stream of support calls to Sun, often erroneously claiming that "AIO is broken". Helping with some of those calls takes X% of my time, year after year.)Perhaps the most unusual case in the Oracle-on-Solaris world is the case of the VxFS ODM implementation. With ODM, truss will show only calls to ioctl() for ODM-related operations, and the absence of any recognizable I/O API calls may perplex the new ODM user who goes peeking under the hood.
Here again, the performance of different APIs is rarely actually characterized, but the benefits are largely secondary to the preeminent impact of factors one and two.
The Proof of the Pudding is in the Eating
A very common error in evaluating direct I/O options over buffered options with Oracle is the failure to make well-considered concomitant changes to certain Oracle parameters during the evaluation. When switching from buffered to unbuffered filesystem options, one abandons the features that the OS buffering provided, and this can have multiple impacts, such as:- Pre-fetching - Large reads, such as commonly used for database sustained sequential reads, can significantly benefit from filesystem prefetching. Indeed, there's a lot of SQL code in the world which has either purposefully or unwittingly been tuned to depend on this.
Therefore, a common consequence of switching to an unbuffered I/O mode will be that some things will run slower.
The better way to achieve high sequential read rates is generally to use some application-driven I/O tuning features like pipelined reads (such as with Oracle Parallel Query) and/or large reads - in conjunction with thoughtful configuration of supply-side factors.
- Read hits, in general - As widely discussed, Oracle can (or should) make better decisions about what to keep in memory than the simple LRU algorithms commonly used by filesystems. However, even if filesystems manage their caches naively, taking that cache away without giving Oracle direct access to that same memory can have dramatic impacts!
Absent a filesystem cache, all Oracle physical reads will become real platform-level physical reads, and all those lovely sub-millesecond reads from the page cache which apps were previoulsy enjoying will instantly become multi-millisecond operations being serviced by
DAS,
SAN, or
NAS storage.
- Deferred writes - One of the principle motivations of filesystem caches is their ability
to rapidly cache writes in memory until an fsync() operation is issued, or until fsflushd
shuffles the data to permanent storage.
Certain Oracle DIRECT operations (usually associated with LOB updating) are designed to achieve performance by issuing multiple writes which they expect to be deferred until they issue an fsync() operation to flush them.
Oracle's archiver also depends on deferred writes.
Predictably, some things will go slower when that cache is removed.
- Aggregate demand - Switching from a non-concurrent option to a concurrent option can be like opening the flood gates to the layers beneath the filesystem. While this is largely required for scaling, it can also have unintended side-effects. For example, floods of non-time-critical checkpoint writes from Oracle might seriously compete against time-critical I/O's from in-flight transactions, and some consideration might be required to address such things.
If you are an Oracle user hanging out on a buffered and/or non-concurrent storage stack, and all this seems scary and complicated to you - consider the alternate proposition offered by Kevin Costner's character in "Tin Cup"; "Grip it and rip it!" -- Switch to an unbuffered concurrent I/O stack, crank your SGA, and start testing empirically. Don't start with your production database though - and don't blame me if you miss any footnotes specific to your circumstances!
Here's the FAQs
Q1: "What filesystem options should I use with Oracle?"
A1: Using an unbuffered concurrent I/O solution with Oracle is usually essential to scaling Oracle applications - variously by optimizing CPU efficiency, maximizing throughput, and minimizing variance. There are some tradeoffs and footnotes involved with each technical option - and with various workloads - so there's no one-size-fits-all "correct" answer to the question. You can quote me on that.
Q2: "Should I use direct I/O with Oracle?"
A2: Well, now that's a deep subject. The word 'direct' is essentially a homonym, and often used ambiguously. Of the several I/O options that provide "unbuffered concurrent I/O", some are named "direct" and others are not. While "direct" is an appropriate adjective for all of these options, not all options called "direct" are also concurrent. So, "it depends" - and please - see also: A1.
Eieeeee I/O!
Creeping ever-so-slightly from my main mission, I'll make one more point on this subject. If you ever intend to move a workload to RAC, tuning it to an unbuffered concurrent storage stack can be a crucial first step! Since there are no RAC storage options that use OS-level filesystem buffering, moving a workload directly from buffered filesystems to RAC will usually amount to digital seppuku, creating a unwieldy burden on the interconnect, and predictably leading to lots of scary statistics and high end-user complaints. I call it the "Fast Track to RAC" AntiPattern. We see a lot of it. More on that some other day.
Posted by mike ramchand on March 23, 2007 at 08:55 PM EDT #
Posted by Chris Page on April 12, 2007 at 10:41 PM EDT #
Thank you very much - this should be helpful to many people.
By the way -
Wikipedia no longer includes YMMV. It can now be found at http://en.wiktionary.org/wiki/YMMV
Posted by Mike Shon on February 24, 2009 at 12:16 PM EST #
Oops! Two years later, I'm just figuring out how to manage comments! Sorry about that, folks!
Posted by Robert Sneed on April 30, 2009 at 06:17 PM EDT #