Tuesday Sep 18, 2007
Tuesday Sep 18, 2007
I wrote a simple perl script to print out the ARC statistics. Sample output is shown below. You can either print the set of default fields, or specify what fields you want printed (see arcstat -v for details)
Download arcstat.pl
In later blogs, I will describe how to make sense of these numbers. Sample output below
Cached random read
thumper1[bash]$ ~/arcstat.pl
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
18:15:21 985M 100M 10 47M 14 53M 8 1M 3 11G 11G
18:15:22 137K 0 0 0 0 0 0 0 0 11G 11G
18:15:23 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:24 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:25 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:26 138K 0 0 0 0 0 0 0 0 11G 11G
18:15:27 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:28 140K 0 0 0 0 0 0 0 0 11G 11G
18:15:29 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:30 140K 0 0 0 0 0 0 0 0 11G 11G
18:15:31 139K 0 0 0 0 0 0 0 0 11G 11G
18:15:32 33K 0 0 0 0 0 0 0 0 11G 11G
^C
Uncached sequential write
Time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
18:17:48 0 0 0 0 0 0 0 0 0 11G 11G
18:17:49 0 0 0 0 0 0 0 0 0 11G 11G
18:17:50 664 372 56 7 2 365 100 5 2 11G 11G
18:17:51 1K 534 36 5 0 529 100 5 0 11G 11G
18:17:52 2K 774 33 6 0 768 100 6 0 10G 10G
18:17:53 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:54 1K 645 35 5 0 640 100 5 0 10G 10G
18:17:55 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:56 1K 646 35 6 0 640 100 6 0 10G 10G
18:17:57 2K 645 31 5 0 640 100 5 0 10G 10G
18:17:58 2K 774 33 6 0 768 100 6 0 10G 10G
Monday Sep 25, 2006
We (PAE - Performance Engineering) recently completed a study to
understand database performance with ZFS. Read on more details and
recommendations. You can also read Roch's blog on
the same study
Databases stress the filesystem in unique ways. Depending on the
workload and configuration, you can have thousands of IO operations per
second. The size of these IO is usually small (database block size).
All the writes are synchronized writes. Reads can be random or
sequential. Some writes are also more critical than others. Depending
on the configuration, Reads are cached by the database program or the
filesystem (if supported/requested). In many cases where filesystems
are used, the IO is spread over a few files. This causes the single
writer lock to be very hot under certain configurations like Buffered
UFS.
Since IO is so important for databases, not surprisingly, there are a
lot heavy weight players in this arena. UFS, QFS, VxFS, are quite
popular with customers as the underlying filesystem. So how does the
new kid on the block (ZFS) do?
We used an internally developed benchmark called OLTP/Net to study
database performance with ZFS. OLTP/Net (O-L-T-P slash Net) is a OLTP
benchmark that simulates an online store. The major feature of the
benchmark is that it has a bunch of tuning knobs that control the ratio
of network IO to disk IO, and/or read/write nature of the transactions,
and/or number of new connects/disconnects to the database etc.. This
makes it quite easy to simulate customer situations in our labs. We use
it quite extensively inside Sun to model real-world database
performance, and have found/fixed quite a few performance issues using
this workload.
For our ZFS study, we used the default settings for OLTP/Net. In this
scenario, we have a read/write ratio of 2:1 and a network/disk IO ratio
of 10:1. Since our goal is to run like most customers, we controlled
the number of users (load generators) such that the box was 60%
utilized.
The hardware configuration consisted of a T2000 with 32x1200Mhz CPUs,
32GB RAM connected to 140 Fibre channel JBODs. We used both Solaris 10
Update 2 as well as Solaris Nevada build 43 to do the analysis We
created one big dynamically stripped pool with all the disks. We set
the recordsize of
this pool to 8k. Each disk was divided into 2 slices. These
slices were allocated to UFS and ZFS in round robin fashion to ensure
that each filesystem got equal number of inner and outer slices.
Normally for OLTP benchmark situations, we try to use the smallest
database blocksize for best performance. When we started out with our
study, we used a block size of 2048 as that gives us the best
performance for other filesystems. But since we are trying to do what
most customers might do, we switched over to a block size of 8192.
We did two kinds of tests, a cached database as well as a large (not
cached) database. Details follow in following sections.
Monday Nov 28, 2005
As I mentioned in my previous
blog entry, the ZIL (ZFS Intent Log) operates with block sizes
between ZIL_MIN_BLKSZ(4K)
and ZIL_MAX_BLKSZ(128k).
Let us take a closer look at this.
The ZIL has to allocate a new zil block before it commits the current zil
block. This is because the zil block being committed has to have
a link to the next zil block. If you do not preallocate, you will have
to update the next pointer in
the previous block whenever you write a new zil block. This means that
you will have read in the previous block, update the next pointer, and rewrite it out.
Obviously this is quite expensive (and quite complicated).
The current block selection strategy is to chose either
the sum of all outstanding ZIL blocks or if no outstanding zil blocks
are present, the size of the last zil block that was committed. If the
size of the outstanding zil blocks is greater than 128k, it is rounded
up to 128k.
The above strategy works in most cases, but behaves badly for certain edge cases.
Let us examine the zil block size for the set of
actions described below
(dtrace -n
::zil_lwb_commit:entry'{@[1]=quantize((lwb_t*)args[2]->lwb_sz);}')
Wednesday Nov 16, 2005
I am not a ZFS developer. However I am interested in ZFS performance,
and am intrigued by ZFS Logging. I figure a good way to learn about something is to blog about it
. What follows is my notes as I made my way through the ZIL
Introduction
Most modern file systems include a logging feature to ensure
faster
write times and crash recovery time (fsck). UFS has supported
logging since Solaris 2.7 and uses logging as the default on Solaris
10.
Our tests internally have shown us that logging file systems perform
as good as (sometimes even better) non-logging file systems.
Logging is implemented via the ZFS Intent Log module in ZFS. ZFS
Intent Log or ZIL is implemented in the
zil.c file. Here is a
brief walk through of the logging implementation in ZFS. All of
this knowledge can be found in the zil.[c|h] files in the ZFS
source
code. I also recommend you check out Neil's blog -- He
is one of the ZFS developers who works on the ZIL.
All file system related system calls are logged as transaction
records
by the ZIL. These transaction records contain sufficient information
to replay them back in the event of a system crash.
ZFS operations are always a part of a DMU (Data Management Unit)
transaction. When a DMU transaction is opened, there is also a ZIL
transaction that is opened. This ZIL transaction is associated with
the DMU transaction, and in most cases discarded when the DMU
transaction commits. These transactions accumulate in memory until
an fsync or O_DSYNC
write happens in which case they are committed to
stable storage. For committed DMU transactions, the ZIL transactions
are discarded (from memory or stable storage).
The ZIL consists of a zil header, zil blocks and zil trailer. The zil header points to a list of records. Each of these log records are variable sized structures whose format depends on the transaction type. Each log record structure consists of a common structure of type lr_t followed by multiple structures/fields that are specific to each transaction. These Log records can reside either in memory or on disk. The on disk format described in zil.h. ZIL records are written to disk in variable sized blocks. The minimum block size is defined as ZIL_MIN_BLKSZ and is currently 4096 (4k) bytes. The maximum block size is defined as ZIL_MAX_BLKSZ which is equal to SPA_MAXBLOCKSIZE (128KB). The zil block size written to disk is chosen to be either the size of all outstanding zil blocks (with a maximum of ZIL_MAX_BLKSZ) or if there are no outstanding ZIL transactions, the size of the last zil block that was committed.
ZIL and write(2)
The zil behaves differently for different size of writes that
happens. For small writes, the data is stored as a part of
the log record. For writes greater than zfs_immediate_write_sz
(64KB), the ZIL does not store a copy of the write, but rather syncs
the write to disk and only a pointer to the sync-ed data is stored
in the log record.
We can examine the write(2)
system call on ZFS using dtrace.
230 -> zfs_write 21684 230 -> zfs_prefault_write 28005 230 <- zfs_prefault_write 35446 230 -> zfs_time_stamper 69932 230 -> zfs_time_stamper_locked 72893 230 <- zfs_time_stamper_locked 74813 230 <- zfs_time_stamper 76893 230 -> zfs_log_write 81054 230 <- zfs_log_write 89855 230 <- zfs_write 96257 230 <= writeAs you can see there is a log entry associated with every write(2) call. If the file was opened with the O_DSYNC flag, writes are supposed to be synchronous. For synchronous writes, the ZIL has to commit the zil transaction to stable storage before returning. For non-synchronous writes the ZIL holds on to the transaction in memory where it is held until the DMU transaction commits or there is an fsync or an O_DSYNC write.
zil.c walk thorough
There are several zil functions that operate on zil records.
What follows is a very brief description of their functionality.
zfs`dmu_objset_sync+0x6c
zfs`dsl_pool_sync+0x108
zfs`spa_sync+0xac
zfs`txg_sync_thread+0x130
unix`thread_start+0x4
ZFS Mount
During file system mount time, ZFS checks to see if there is an
intent log. If there is an intent log, this implies that the system
crashed (as the ZIL is deleted at umount(2)
time). This intent log is
converted to a replay log and is replayed to updated the file system
to a stable state. If both the replay log and intent log are
present, it implies that the system crashed while replaying the
replay log in which case it is OK to ignore/delete the replay log
and replay the intent log.
ZIL Performance
As you must have figured out by now, ZIL performance is critical for
performance of synchronous writes. A common application that issues
synchronous writes is a database. This means that all of these writes
run at the speed of the ZIL. The ZIL is already quite optimized, and I
am sure ongoing
efforts will optimize this code path even further.
As Neil
mentions, using nvram/solid state disks for
the log would make it scream!. I also recommend that you checkout Roch's
work on ZFS performance for details of other performance studies in
progress.
Dtrace scripts for use with zil
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: ZFS