Wednesday June 29, 2005 Like many other people, no doubt, after much time I have eventually decided to become another blogger at Sun. Here, believe it or not, is my first weblog post.
I'm John Brady, part of Sun Microsystems in the United Kingdom. Basically I'm a kind of Technical Consultant specialising in the Oracle database software, and all things related to performance of computer systems. A quick summary of my history:
Since starting as an application developer many years ago, I have always been working with relational database systems on UNIX in one way or another. During the past 15 years, I have been working with some of the largest and most powerful UNIX systems, especially multi-processor (SMP) systems, for different system manufacturers. In this time I have always focused on performance of large databases running on UNIX multi-processor systems. As a result I have experience of the issues involved in designing, building and deploying large complex server based solutions.
Given this, I will mainly be making posts about the Oracle database software, and performance management of systems, including good design of large, scalable systems.
( Jun 29 2005, 03:46:37 PM BST ) Permalink Comments [1]
John:
You have a very interesting post at at http://forums.oracle.com/forums/thread.jspa?messageID=2385147 shown below. Your comment:
"This could cause ZFS to have to do many physical disk writes, just for writing one redo log block." I find most intriguing.
Any chance we could have a quick telcon to discuss?
Thanks.
=================================================
Re: Oracle 10, Solaris and ZFS
Posted: Mar 4, 2008 9:28 AM in response to: user623557
Reply
In principle all should be fine. ZFS obeys all filesystem semantics, and Oracle will access it through the normal filesystem APIs. I'm not sure if Oracle need to officially state that they are compatible with ZFS. I would have thought it was the other way around - ZFS needs to state it is a fully compatible file system, and so any application will work on it.
ZFS has many neat design features in it. But be aware - it is a write only file system! It never updates an existing block on disk. Instead it writes out a new block in a new location with the updated data in it, and also writes out new parent inode blocks that point to this block, and so on. This has some benefits around snapshotting a file system, and providing fallback recovery or quick recovery in the event of a system crash. However, one update in one data block can cause a cascaded series of writes of many blocks to the disk.
This can have a major impact if you put your redo logs on ZFS. You need to consider this, and if possible do some comparison tests between ZFS and UFS with logging and direct I/O. Redo log writes on COMMIT are synchronous and must go all the way to the disk device itself. This could cause ZFS to have to do many physical disk writes, just for writing one redo log block.
Oracle needs its SGA memory up front, permanently allocated. Solaris should handle this properly, and release as much filesystem cache memory as needed when the Oracle shared memory is allocated. If it doesn't then Sun have messed up big time. But I cannot imagine this, so I am sure your Oracle SGA will be created fine.
I like the design of ZFS a lot. It has similarities with Oracle's ASM - a built in volume manager that abstracts underlying raw disks to a pool of directly useful storage. ASM abstracts to pools for database storage objects, ZFS abstracts to pools for filesystems. Much better than simple volume managers that abstract raw disks to just logical disks. You still end up with disks, and other management issues. I'm still undecided as to whether it makes sense to store an OLTP database on it that needs to process a high transaction rate, given the extra writes incurred by ZFS.
I also assume you are going to use an 8 KB database block size to match the filesystem block size? You don't want small database writes leading to bigger ZFS writes, and vice versa.
John
Posted by Scott Myers on December 19, 2008 at 09:15 PM GMT #