Wednesday Nov 16, 2005
Wednesday Nov 16, 2005
I am not a ZFS developer. However I am interested in ZFS performance,
and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it
. What follows is my notes as I made my way through the ZIL
Introduction
Most modern file systems include a logging feature to ensure
faster
write times and crash recovery time (fsck). UFS has supported
logging since Solaris 2.7 and uses logging as the default on Solaris
10.
Our tests internally have shown us that logging file systems perform
as good as (sometimes even better) non-logging file systems.
Logging is implemented via the ZFS Intent Log module in ZFS. ZFS
Intent Log or ZIL is implemented in the
zil.c file. Here is a
brief walk through of the logging implementation in ZFS. All of
this knowledge can be found in the zil.[c|h] files in the ZFS
source
code. I also recommend you check out Neil's blog -- He
is one of the ZFS developers who works on the ZIL.
All file system related system calls are logged as transaction
records
by the ZIL. These transaction records contain sufficient information
to replay them back in the event of a system crash.
ZFS operations are always a part of a DMU (Data Management Unit)
transaction. When a DMU transaction is opened, there is also a ZIL
transaction that is opened. This ZIL transaction is associated with
the DMU transaction, and in most cases discarded when the DMU
transaction commits. These transactions accumulate in memory until
an fsync or O_DSYNC
write happens in which case they are committed to
stable storage. For committed DMU transactions, the ZIL transactions
are discarded (from memory or stable storage).
The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.
ZIL and write(2)
The zil behaves differently for different size of writes that
happens. For small writes, the data is stored as a part of
the log record. For writes greater than zfs_immediate_write_sz
(64KB), the ZIL does not store a copy of the write, but rather syncs
the write to disk and only a pointer to the sync-ed data is stored
in the log record.
We can examine the write(2)
system call on ZFS using dtrace.
230 -> zfs_write 21684 230 -> zfs_prefault_write 28005 230 <- zfs_prefault_write 35446 230 -> zfs_time_stamper 69932 230 -> zfs_time_stamper_locked 72893 230 <- zfs_time_stamper_locked 74813 230 <- zfs_time_stamper 76893 230 -> zfs_log_write 81054 230 <- zfs_log_write 89855 230 <- zfs_write 96257 230 <= writeAs you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.
zil.c walk thorough
There are several zil functions that operate on zil records.
What follows is a very brief description of their functionality.
zfs`dmu_objset_sync+0x6c
zfs`dsl_pool_sync+0x108
zfs`spa_sync+0xac
zfs`txg_sync_thread+0x130
unix`thread_start+0x4
ZFS Mount
During file system mount time, ZFS checks to see if there is an
intent log. If there is an intent log, this implies that the system
crashed (as the ZIL is deleted at umount(2)
time). This intent log is
converted to a replay log and is replayed to updated the file system
to a stable state. If both the replay log and intent log are
present, it implies that the system crashed while replaying the
replay log in which case it is OK to ignore/delete the replay log
and replay the intent log.
ZIL Performance
As you must have figured out by now, ZIL performance is critical for
performance of synchronous writes. A common application that issues
synchronous writes is a database. This means that all of these writes
run at the speed of the ZIL. The ZIL is already quite optimized, and I
am sure ongoing
efforts will optimize this code path even further.
As Neil
mentions, using nvram/solid state disks for
the log would make it scream!. I also recommend that you checkout Roch's
work on ZFS performance for details of other performance studies in
progress.
Dtrace scripts for use with zil
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: ZFS
Posted by Marcelo Leal on July 31, 2007 at 06:25 AM PDT #
Hi. If I understand the behaviour of ZFS correctly (please correct me if I'm wrong!), when an O_DSYNC write or fsync() occurs the log is written to stable storage. In the case of a typical Oracle 8K write, the data is written to stable storage as part of the log record. This means that the data must first be read from disk and then written back again when the next DMU commit occurs. In this scenario, should the ZIL log reads be fulfilled from the ARC cache in most cases?
Posted by Duncan Rutland on May 21, 2008 at 01:22 AM PDT #
The zil reads are fulfilled from the ARC in all cases. If you are concerned about the "double" write, you should note that multiple O_DSYNC writes could be part of one log write, and that the sync later generates mostly sequential writes. The current pain point [that is being solved] is the contention around zil_commit when multiple threads are doing O_DSYNC writes.
"6699227 zil train mode" will help us there.
Posted by realneel on May 21, 2008 at 04:40 AM PDT #
Our Oracle database on ZFS has very slow from time to time. Oracle opens datafile with O_DSYNC which I don't think and don't know how to disable it. Benchmark testing shows O_DSYNC 8k block size writes are much slow then OS buffered write without O_DSYNC. Would you please advise tuning parameters or methods ?
by setting set zfs:zfs_nocacheflush = 1 in /etc/system does little improve, not very much.
Posted by James on September 11, 2008 at 10:18 PM PDT #
Unfortunately, this and other articles (such as http://blogs.sun.com/JaySisodiya/feed/entries/atom?cat=%2FZFS_Oracle which incorrectly states 'Note, Oracle does not require separate certification for Operating System features such as ZFS, so don't worry about it and go looking on metalink.')
fail to mention that ZFS is NOT certified or supported with Oracle databases or Oracle Application Server.. it says so on Metalink (notes 403202.1 and 730691.1).
Posted by Steve on September 16, 2008 at 04:29 AM PDT #
Oracle is nowhere mentioned in this blog post.. Whats your point?
Posted by neel on September 16, 2008 at 07:48 AM PDT #
Our issue is O_DSYNC 8k block writes, which is done by Oracle, on ZFS are very slow. Neelakanth Nadgir's article explains ZIL which I think it could be the cause of writing slowness. I was wondering whether using seperate log would speed up.
Posted by James on September 16, 2008 at 05:32 PM PDT #
James: What version of Solaris are you using? There have been ongoing improvements in the O_DSYNC write latency. If you can use it, I would highly recommend using the separate intent log feature of ZFS. This will isolate the log writes and will give you the lowest latency. Be aware that you can currently only add an slog device; removing it is hard!
Posted by Neelakanth Nadgir on September 16, 2008 at 05:44 PM PDT #