vendredi juin 19, 2009 ZFS AND DIRECTIO
UFS Directio
- output goes directly from application buffer to disk bypassing the filesystem core memory cache. - the FS is not constrained anymore to strictly obey the POSIX write ordering. The FS is thus able to allow multiple thread concurrently issuing some I/Os to a single file. - On input UFS DIO refrains from doing any form of readahead.In a sense, by taking out the middleman (the filesystem cache), UFS/DIO causes files to behave a lot like a raw device. Application reads and writes map one to one onto individual I/Os.
Anything for ZFS ?
don't do any more I/O that requested allow multiple concurrent I/O to a file.I note that UFS readahead is particularly bad for certain usage; when UFS sees access to 2 consecutive pages, it will read a full cluster and those are typically 1MB in sizes today. So avoiding UFS readahead has probably contributed greatly to the success of DIO. As for ZFS there are 2 levels of readahead (a.k.a prefetching). One that is filebased and one device based. Both are being reworked at this stage. I note that filebased readahead code has not and will not behave like UFS. On the other hand device level prefetching probably is being over agressive for DB type loads and it should be avoided. While I have not given hope of that this can be managed automatically, watch this space for tuning scripts to control the device prefetching behavior.
The 2 stages of ZFS
ZFS Performance and DB
The Dynamics of ZFS
Introducing ZFS
A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.
A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage - files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.
ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.
Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.
Recordsize
The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.
The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem's recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.
The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.
Transaction Groups
The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.
Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.
If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.
A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.
ZFS Intent Log (ZIL)
For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.
The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk's commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.
CAVEAT: The current state of the ZIL is such that if there is a lot of pending data in a Filesystem (written to the FS, not yet output to disk) and a process issues an fsync() for one of it's files, then all pending operations will have to be sent to disk before the synchronous command can complete. This can lead to unexpected performance characteristics. Code is under review.
I/O Scheduler and Priorities
ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.
The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.
Read Syscalls
If a read cannot be serviced from the ARC cache, ZFS will issue a 'prioritized' I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general.
However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.
Prefetch
The prefetch code allowing ZFS to detect sequential or strided access to a file and issue I/O ahead of phase is currently under review. To quote the developer "ZFS prefetching needs some love".
Write Syscalls
ZFS never overwrites live data on-disk and will always output full records validated by a checksum. So in order to partially overwrite a file record, ZFS first has to have the corresponding data in memory. If the data is not yet cached, ZFS will issue an input I/O before allowing the write(2) to partially modify the file record. With the data now in cache, more writes can target the blocks. On output ZFS will checksum data before sending to disk. For full record overwrite the input phase is not necessary.
CAVEAT: Simple write calls (not O_DSYNC) are normally absorbed by the ARC cache and so proceed very quickly. Such a sustained dd(1)-like load can quickly overrun a large amount of system memory and cause transaction groups to eventually throttle all applications for large amount of time (10s of seconds). This is probably what underwrites the notion that ZFS needs more RAM (it does not). Write throttling code is under review.
Soft Track Buffer
An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.
This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.
The ARC Cache
The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache and is inspired by a paper of Megiddo/Modha presented at FAST'03 Usenix conference.
That ARC manages it's data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it's very interesting properties is that a large scan of a file will not destroy most of the cached data.
On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it's memory to the kernel until low memory conditions are relieved.
We note that while ZFS has behaved rather well under 'normal' memory pressure, it does not appear to behave satisfactorily under swap shortage. The memory usage pattern of ZFS is very different to other filesystems such as UFS and so exposes VM layer issues in a number of corner cases. For instance, a number of kernel operations fails with ENOMEM not even attempting a reclaim operation. If they did, then ZFS would be responding by releasing some of it's own buffers allowing the initial operation to then succeed.
The fact that ZFS caches data in the kernel address space does mean that the kernel size will be bigger than when using traditional filesystems. For heavy duty usage it is recommended to use a 64-bit kernel i.e. any Sparc system or an AMD configured in 64-bit mode. Some systems that have managed in the past to run without any swap configured should probably start to configure some.
The behavior of the ARC in response to memory pressure is under review.
CPU Consumption
Recent enhancement to ZFS has improved it's CPU efficiency by a large factor. We don't expect to deviate from other filesystems much in terms of cycles per operations. ZFS checksums all disk blocks but this has not proven to be costly at all in terms of CPU consumption.
ZFS can be configured to compress on-disk blocks. We do expect to see some extra CPU consumption from that compression. While it is possible that compression could lead to some performance gain due to reduced I/O load, the emphasis of compression should be to save on-disk space not performance.
What About Your Test ?
This is what I know about the ZFS performance model today. My
performance
comparison on different types of modelled workloads made last fall already had ZFS
ahead on many of them; we have improved the
biggest issues highlighted then and there are further performance
improvements in the pipeline (based on UFS, we know this will never
end). Best Practices are being spelled out.
You can contribute by comparing your actual usage and workload pattern
with the simulated workloads. But nothing will beat having
reports from real workloads at this stage; Your results are
therefore of great interest to us.
And watch this space for updates...
One important performance parameter of ZFS is the recordsize which govern the size of filesystem blocks for large files. This is the unit that ZFS validates through checksums. Filesystem blocks are dynamically striped onto the pooled storage, on a block to virtual device (vdev) basis.
It is expected that for some loads, tuning the recordsize will be required. Note that, in traditional Filesytems such a tunable would govern the behavior of all of the underlying storage. With ZFS, tuning this parameter only affects the tuned Filesystem instance; it will apply to newly created files. The tuning is achieved using
zfs set recordsize=64k mypool/myfs
In ZFS all files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks. Once a file grows to be multiple blocks, it's blocksize if definitively set to the FS recordsize at the time.
Some more experience will be required with the recordsize tuning. Here are some elements to guide along the way.
If one considers the input of a FS block typically in response to an application read, the size of the I/O in question will not basically impact the latency by much. So, as a first approximation, the recordsize does not matter (I'll come back to that) to read-type workloads.
For FS block outputs, those that are governed by the recordsize, actually occur mostly asynchronously with the application; and since applications are not commonly held up by those outputs, the delivered throughput is, as for read-type loads, not impacted by the recordsize.
So the first approximation is that recordsize does not impact performance much. To service loads that are transient in nature with short I/O bursts (< 5 seconds) we do not expect records tuning to be necessary. The same can be said for sequential type loads.
So what about the second approximation ? A problem that can occur with using an inflated recordsize (128K) compared to application read/write sizes, is early storage saturation. If an application requests 64K of data, then providing a 128K record doesn't change the latency that the application sees much. However if the extra data is discarded from the cache before ever being read, we see that the extra occupation of the data channel was occupied for no good reason. If a limiting factor to the storage is, for instance, a 100MB/sec channel, I can handle 700 times 128K records per second onto that channel. If I halves the recordsize that should double the number of small records I can input.
On the small record output loads, the system memory creates a buffer that defer the direct impact to applications. For output, if the storage is saturated this way for tens of seconds, ZFS will eventually throttle applications. This means that, in the end, when the recordsize leads to sustained storage overload on output, there will be an impact as well.
There is another aspect to the recordsize. A partial write to an uncached FS block (a write syscall of size smaller than the recordsize) will have to first input the corresponding data. Conversely, when individual writes are such that they cover full filesystem recordsize blocks, those writes can be handled without the need to input the associated FS blocks. Other consideration (metadata overhead, caching) dictates however that the recordsize not be reduced below a certain point (16K to 64K; do send-in your experience).
So, one advice is to keep an eye on the channel throughput and tune recordsize for random access workloads that saturate to storage. Sequential type workloads should work quite well with the current default recordsize. If the applications' read/write sizes can be increased, that should also be considered. For non-cached workloads that overwrites file data in small aligned chunks , then matching the recordsize with the write access size may bring some performance gains.
DOES ZFS REALLY USE MORE RAM ?
I'll touch 3 aspects of that question here :
- reported freemem
- syscall writes to mmap pages
- application write throttling
Reported freemem will be lower when running with ZFS than say UFS. The UFS page cache is considered as freemem. ZFS will return it's 'cache' only when memory is needed. So you will operate with lower freemem but won't normally suffer from this.
It's been wrongly feared that this mode of operation puts us back to the days of Solaris 2.6 and 7 where we saw a roaller coaster effect on freemem leading to sub-par application performance. We actually DO NOT have this problem with ZFS. The old problem came because the memory reaper could not distinguish between a useful application page and an UFS cached page. That was bad. ZFS frees up it's cache in a way that does not cause this problem.
ZFS is designed to release some of it's memory when kernel modules exert back pressure onto the kmem subsystem. Some kernel code that did not properly exert that pressure was recently fixed (short description here: 4034947).
There is one peculiar workload that does lead ZFS to consume more memory: writing (using syscalls) to pages that are also mmaped. ZFS does not use the regular paging system to manage data that passes through reads and writes syscalls. However mmaped I/O which is closely tied to the Virtual Memory subsystem still goes through the regular paging code . So syscall writting to mmaped pages, means we will keep 2 copies of the associated data at least until we manage to get the data to disk. We don't expect that type of load to commonly use large amount of ram.
Finally, one area where ZFS will behave quite differently from UFS is in throttling writters. With UFS, up to not long ago, we throttled a process trying to write to a file, as soon as that file had 0.5 M B of I/O pending associated with it. This limit has been recently upped to 16 MB. The gain of such throttling is that we prevent an application working on a single file or consuming inordinate amount of system memory. The downside is that we throttle an application possibly unnecessarely when memory is plenty.
ZFS will not throttle individual apps like this. The scheme is mutualized between all writers: when the global load of applications data overflows the I/O subsystem for 5 to 10 seconds then we throttle the applications allowing the I/O to catch up. Applications thus have a lot more ram to play with before being throttled.
This is probably what's behind the notion that ZFS likes more RAM. By and large, to cache some data, ZFS just needs the equivalent amount of RAM as any other filesystem. But currently, ZFS lets applications run a lot more decoupled from the I/O subsystem. This can speed up some loads by very large factor, but at times, will appear as extra memory consumption.
For Instance: - a RW_READER gets in and will keep a long time. ---| - a RW_WRITER hits the lock; is put on hold. | - other RW_READERS now also block. | .... time passes | - the long RW_READER releases <----------------| - the RW_WRITER gets the lock; work; releases - other RW_READER now work concurrently.
[T]: NiagaraCMT Solaris Sun
Numbers of Strands Search/sec Ratio 1 920 1 X 3 (1 core; 3 str/core) 2260 2.45 X 4 (1 core; 4 str/core) 2650 2.88 X 4 (4 core; 1 str/core) 4100 4.45 X 14 (7 cores, 2 str/core) 12500 13.59 X 21 (7 cores, 3 str/core) 16100 17.5 X 28 (7 cores; 4 str/core) 18200 19.8 X
[T]: NiagaraCMT Solaris Sun