Neelakanth Nadgir's blog
« Previous day (Nov 15, 2005) | Main | Next day (Nov 17, 2005) »
Wednesday Nov 16, 2005
The ZFS Intent Log

A quick guide to the ZFS Intent Log (ZIL)

I am not a ZFS developer. However I am interested in ZFS performance, and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it ;-). What follows is my notes as I made my way through the ZIL

Introduction

Most modern file systems include a logging feature to ensure faster write times and crash recovery time (fsck). UFS has supported logging since Solaris 2.7 and uses logging as the default on Solaris 10. Our tests internally have shown us that logging file systems perform as good as (sometimes even better) non-logging file systems.

Logging is implemented via the ZFS Intent Log module in ZFS. ZFS Intent Log or ZIL is implemented in the zil.c file. Here is a brief walk through of the logging implementation in ZFS. All of this knowledge can be found in the zil.[c|h] files in the ZFS source code. I also recommend you check out Neil's blog -- He is one of the ZFS developers who works on the ZIL.

All file system related system calls are logged as transaction records by the ZIL. These transaction records contain sufficient information to replay them back in the event of a system crash.

ZFS operations are always a part of a DMU (Data Management Unit) transaction. When a DMU transaction is opened, there is also a ZIL transaction that is opened. This ZIL transaction is associated with the DMU transaction, and in most cases discarded when the DMU transaction commits. These transactions accumulate in memory until an fsync or O_DSYNC write happens in which case they are committed to stable storage. For committed DMU transactions, the ZIL transactions are discarded (from memory or stable storage).

The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.

ZIL and write(2)
The zil behaves differently for different size of writes that happens. For small writes, the data is stored as a part of the log record. For writes greater than zfs_immediate_write_sz (64KB), the ZIL does not store a copy of the write, but rather syncs the write to disk and only a pointer to the sync-ed data is stored in the log record. We can examine the write(2) system call on ZFS using dtrace.


230  -> zfs_write                                         21684
230    -> zfs_prefault_write                              28005
230    <- zfs_prefault_write                              35446
230    -> zfs_time_stamper                                69932
230      -> zfs_time_stamper_locked                       72893
230      <- zfs_time_stamper_locked                       74813
230    <- zfs_time_stamper                                76893
230    -> zfs_log_write                                   81054
230    <- zfs_log_write                                   89855
230  <- zfs_write                                         96257
230  <= write


As you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.

zil.c walk thorough

There are several zil functions that operate on zil records. What follows is a very brief description of their functionality.

 

             zfs`dmu_objset_sync+0x6c
             zfs`dsl_pool_sync+0x108
             zfs`spa_sync+0xac
             zfs`txg_sync_thread+0x130
             unix`thread_start+0x4

ZFS Mount

During file system mount time, ZFS checks to see if there is an intent log. If there is an intent log, this implies that the system crashed (as the ZIL is deleted at umount(2) time). This intent log is converted to a replay log and is replayed to updated the file system to a stable state. If both the replay log and intent log are present, it implies that the system crashed while replaying the replay log in which case it is OK to ignore/delete the replay log and replay the intent log.

ZIL Tunables
I am almost tempted to mention some tunables here but the truth is that ZFS is intended to not require any tuning.  ZFS should (and will) perform optimally "Out of the Box". You might find some switches in the code, but they are only for internal development and will be yanked out soon!

ZIL Performance
As you must have figured out by now, ZIL performance is critical for performance of synchronous writes. A common application that issues synchronous writes is a database. This means that all of these writes run at the speed of the ZIL. The ZIL is already quite optimized, and I am sure ongoing efforts will optimize this code path even further. As Neil mentions, using nvram/solid state disks for the log would make it scream!. I also recommend that you checkout Roch's work on ZFS performance for details of other performance studies in progress.


Dtrace scripts for use with zil

Finally

Congratulations to the ZFS team for delivering such a world class product. You folks rock!.

Technorati Tag:
Technorati Tag:
Technorati Tag:

Posted at 11:59AM Nov 16, 2005 by Neelakanth Nadgir in ZFS  |  Comments[8]