Wednesday Nov 16, 2005
Wednesday Nov 16, 2005
I am not a ZFS developer. However I am interested in ZFS performance,
and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it
. What follows is my notes as I made my way through the ZIL
Introduction
Most modern file systems include a logging feature to ensure
faster
write times and crash recovery time (fsck). UFS has supported
logging since Solaris 2.7 and uses logging as the default on Solaris
10.
Our tests internally have shown us that logging file systems perform
as good as (sometimes even better) non-logging file systems.
Logging is implemented via the ZFS Intent Log module in ZFS. ZFS
Intent Log or ZIL is implemented in the
zil.c file. Here is a
brief walk through of the logging implementation in ZFS. All of
this knowledge can be found in the zil.[c|h] files in the ZFS
source
code. I also recommend you check out Neil's blog -- He
is one of the ZFS developers who works on the ZIL.
All file system related system calls are logged as transaction
records
by the ZIL. These transaction records contain sufficient information
to replay them back in the event of a system crash.
ZFS operations are always a part of a DMU (Data Management Unit)
transaction. When a DMU transaction is opened, there is also a ZIL
transaction that is opened. This ZIL transaction is associated with
the DMU transaction, and in most cases discarded when the DMU
transaction commits. These transactions accumulate in memory until
an fsync or O_DSYNC
write happens in which case they are committed to
stable storage. For committed DMU transactions, the ZIL transactions
are discarded (from memory or stable storage).
The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.
ZIL and write(2)
The zil behaves differently for different size of writes that
happens. For small writes, the data is stored as a part of
the log record. For writes greater than zfs_immediate_write_sz
(64KB), the ZIL does not store a copy of the write, but rather syncs
the write to disk and only a pointer to the sync-ed data is stored
in the log record.
We can examine the write(2)
system call on ZFS using dtrace.
230 -> zfs_write 21684 230 -> zfs_prefault_write 28005 230 <- zfs_prefault_write 35446 230 -> zfs_time_stamper 69932 230 -> zfs_time_stamper_locked 72893 230 <- zfs_time_stamper_locked 74813 230 <- zfs_time_stamper 76893 230 -> zfs_log_write 81054 230 <- zfs_log_write 89855 230 <- zfs_write 96257 230 <= writeAs you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.
zil.c walk thorough
There are several zil functions that operate on zil records.
What follows is a very brief description of their functionality.
zfs`dmu_objset_sync+0x6c
zfs`dsl_pool_sync+0x108
zfs`spa_sync+0xac
zfs`txg_sync_thread+0x130
unix`thread_start+0x4
ZFS Mount
During file system mount time, ZFS checks to see if there is an
intent log. If there is an intent log, this implies that the system
crashed (as the ZIL is deleted at umount(2)
time). This intent log is
converted to a replay log and is replayed to updated the file system
to a stable state. If both the replay log and intent log are
present, it implies that the system crashed while replaying the
replay log in which case it is OK to ignore/delete the replay log
and replay the intent log.
ZIL Performance
As you must have figured out by now, ZIL performance is critical for
performance of synchronous writes. A common application that issues
synchronous writes is a database. This means that all of these writes
run at the speed of the ZIL. The ZIL is already quite optimized, and I
am sure ongoing
efforts will optimize this code path even further.
As Neil
mentions, using nvram/solid state disks for
the log would make it scream!. I also recommend that you checkout Roch's
work on ZFS performance for details of other performance studies in
progress.
Dtrace scripts for use with zil
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: ZFS