Thursday Apr 07, 2005



So... recently the UFS team for Solaris 10 (of which I am a member) went through a very big exercise to create UFS technical documentation. This exercise proved to be immensely fruitful, for both the current Solaris UFS team and any future Solaris UFS teams. I know I learned a lot personally from this exercise and there is so much great data we amassed I think it can only be useful to the broader community. Thus my desire to share. My hope is that when OpenSolaris is primetime all of this data will be available to anyone who is interested in working on UFS.

I love working in the UFS source. I want to share that love with you all as well, because although UFS is very mature, it has many interesting features and quirks that are just fun to learn. UFS is a complicated beast and requires a lot of deep thought and many nights of grinding through code.

UFS is the keeper of the data for many folks running Solaris. As such, it matters a lot. It is one of those things that nobody notices, but everybody notices when it goes bad. The challenge in keeping UFS running smoothly and being the best keeper of the data is what makes all of the hard work worthwhile. Working in UFS provides a good overall understanding of so many parts of the Solaris kernel since it interacts with so many subsystems: page cache, buffer cache, I/O subsystem,... It isn't the most glamorous code on the planet, but well worth the effort for learning fundamental Solaris kernel technology.

Since OpenSolaris is going to happen soon, I thought I would start blogging about UFS technical data, as evidenced by my first blog. I hope that this blogging might prove useful to those of you out there interested in Solaris UFS filesystem technology. I cannot share code just yet, but can share technical concepts and later give pointers to the code.

Locking in UFS
This blog post will give an introduction in to locking in UFS. In the future this will be important data for developers to have and understand, as any feature additions will require an understanding of the locking. And, many bug fixes require this understanding as well.

Today's topic will cover the following things:
  • Basics about some Solaris kernel locks
  • UFS inode locks
  • UFS inode queue locks
  • Generic VNODE layer locks
  • Generic VFS layer locks
  • General Lock ordering
  • A Directory lookup locking pseudo code example

There is an implicit assumption on my part that you have some basic knowledge of the VNODE and VFS layers, basic locking principles, such as reader/write locks, mutual exclusion locks in Solaris and basic UFS inode knowledge. If not, it would be good to do some pre-reading on these topics. A good place to start is with the "Solaris Internals Core Kernel Architecture" book, by Jim Mauro and Richard McDougall.

Solaris Kernel Locks
In this section I will cover some very basic technical details of Solaris kernel locks, specifically those types used in UFS.
  • krwlock_t reader/writer lock, allows multiple readers or 1 writer at a time.
  • kmutex_t Mutual exclusion lock, allows one operator at a time.
    Solaris implements an adaptive mutex lock:
    • If holder is running, spin.
    • If holder is sleeping, sleep.

UFS Inode locks
There are four locks associated with UFS inodes:
  • i_rwlock(krwlock_t)
    • Serializes write requests. Allows reads to proceed in parallel. Serializes directory reads and updates.
    • Does not protect inode fields.
    • Indirectly protects blocks lists since it serializes allocations/deallocations in UFS
    • Must be taken prior to starting UFS logging transactions if operating on a file, otherwise taken after starting logging transaction.
    i_contents(krwlock_t)
    • Protects most fields in the inode.
    • When held as a writer protects all the fields protected by the i_tlock as well.
    i_tlock
    • When held with the i_contents reader lock it protects the following inode fields:
      • i_utime, i_ctime, i_mtime, i_flag, i_delayoff, i_delaylen, i_nextrio, i_writes, i_writer, i_mapcnt
    • Also used as mutex for write throttling in UFS
    • i_contents and i_tlock held together allows parallelism in updates.
    i_hlock
    • UFS inode hash lock

UFS Inode Queue locks
  • ufs_scan_lock(kmutex_t)
    • Synchronizes ufs_scan_inodes threads.
    • Needed because UFS has global inodes lists
  • ufs_q->uq_mutex(kmutex_t)
    • Used to protect idle queues
      • ufs_junk_iq, ufs_useful_iq These are the two inode idle queues, and as you can guess from their names, one holds still potentially useful inodes, the other holds inodes known to not contain valid data.
  • ufs_hlock
    • Used by the hlock thread. For more information see man lockfs(1M), hardlock section.
  • ih_lock
    • Protects the inode hash. The inode hash is global, per system, not per filesystem.

VNODE locks
  • v_lock(kmutex_t)
    • Protects VNODE fields.
    • VN_HOLD/VN_RELE:
      • Uses v_lock
      • Increments/decrements reference count on VNODE by 1.
      • Used to ensure that the VNODE/inode does not go away while in use.

VFS locks
  • vfs_lock(kmutex_t)
    • Locks contents of filesystem and cylinder groups.
    • Also protects fields of the vfs_dio(delayed io)field.
      • vfs_dio is delayed io bit
      • set via ioctl _FIOSDIO
    • vfs_dqrwlock
      • Manages quota subsystem quiescence.
      • Writer held means that the UFS quota subsystem can have major changes going on:
        • Disabling quotas, enabling quotas, setting new quota limits.
      • Protects d_quot structure as well. This structure is used to keep track of all the enabled quotas per filesystem.
      • It is important to note that UFS shadow inodes which are used to hold ACL data and extended attribute directories are not counted against user quotas. Thus this lock is not held for updates to these.
      • Reader held for this lock indicates to quota subsystem that major changes should not be occurring during that time.
      • Held when changes when the i_contents writer lock is held, as described above, indicating changes are occurring that affect user quotas.
      • Since UFS quotas can be enabled/disabled on the fly, this lock must be taken in all appropriate situations. It is not sufficient to check if the UFS quota subsystem is enabled prior to taking the lock.
      ufsvfs_mutex(kmutex_t)
      • Protects access to the list that links together all UFS filesystem instances.
      • Lists are updated as a part of the mount operation.
      • Also for allow syncing of all UFS filesystems.

UFS Inode Updates Lock Ordering
This pictorial representation of the ordering/weighting of UFS locks is intended to show 1) What each of the locks protects 2) The order in which the locks must be taken if you need to protect the fields relevant to the specific lock. This does not mean that you must always take every lock shown, simply that you must take these in the order shown in the picture based on the fields you are trying to protect.

Example of how locks are used:

Directory Lookups and locking
Doing a directory lookup....
dp is the current directory inode we are searching for an entry in.
    rw_enter(&dp->i_rwlock, RW_READER); <---taken to avoid races with a dirremove in the dncl directory cache. Not needed for standard dnlc cache.
    RW_READER is taken in this case as we want to prevent a write from coming in an changing data out from underneath us.

    If found in dnlc directory cache && "." or ".." then we have to do the following:

      /*
      * release the lock on the dir we are searching
      * to avoid a deadlock when grabbing the
      * i_contents lock in getting the allocated inode.
      */
      rw_exit(&dp->i_rwlock);<--drop this lock on the directory in which we are searching. Can deadlock with i_contents(on the directory above) in call to the function getting the allocated inode.
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode;
      rw_exit(&dp->i_ufsvfs->vfs_dqrwlock);
      /*
      * must recheck as we dropped dp->i_rwlock
      */
      rw_enter(&dp->i_rwlock, RW_READER);
      ....
      Now do rechecks here to ensure that data has not changed on the dp(directory inode) during the time we dropped the lock.

    else
      Otherwise if not "." or ".." then proceed as normal for directory lookups
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode; <-i_contents taken in here, no possible contention with the i_rwlock taken above...
      rw_exit(&dp->i_ufsvfs->vfs_dqrwlock);

      No need to recheck anything since we did not drop the i_rwlock.

The important take away from this example is that there are times when you must release and reacquire locks in UFS. If this is necessary, then it is important to recheck the assumptions about the data you are working on since it is possible that it could have changed during the time the lock was released and reacquired.

There are many more locks used in UFS. This blog only covers a portion of those that I felt were good introductions to UFS locking. Perhaps later I will expand more on this topic in a future blog.

Tuesday Apr 05, 2005




UFS in Solaris 10
I spent the better part of the last year getting to know UFS. I think we are on a first name basis now :-). Thus, I begin my blog debut with some interesting UFS bugs and how they were fixed.

UFS had many improvements integrated in to Solaris 10 and Solaris 9 9/04: Bug fixes, logging on by default and general robustness improvements. In this post I will talk about three specific bug fixes which affect the UFS tuneable maxcontig and therefore aspects of UFS performance.

4639871 Logging ufs fails to boot from ATA drive on Ultra-10 if maxphys is too large
4638166 Ultra 5/10 panics with simba and pci errors if logging enabled and maxphys > 1MB
4349828 Inconsiderate tuning of maxcontig causes scsi bus to hang

As a result of these bugs, UFS in Solaris 10 and Solaris 9 9/04 was modified to change the values that could be used to set maxcontig and subsequently the value used for the maximum transfer size when I/O was issued.

Previously, an inconsiderate value set either for maxcontig or maxphys(in /etc/system) would result in a system getting hung. This was due to the fact that the filesystem I/O request size was calculated using the value set for maxcontig. The maximum transfer rate of the underlying device was never considered when calculating the size of the I/O transfer in UFS.

In UFS, the filesystem cluster size, for both reads and writes, is set to the value set for maxcontig. The filesystem cluster size is used to determine:

  • The maximum number of logical blocks contiguously laid out on disk for a UFS filesystem before inserting a rotational delay.
  • When, and the amount to read ahead and/or write behind if the sequential IO case is found. The algorithm that determines sequential read ahead in UFS is broken, so system administrators use the maxcontig value to tune their filesystems to achieve better random I/O performance.
  • The UFS filesystem cluster size also indicates how many pages to attempt to push out to disk at a time. It also determines the frequency of pushing pages because in UFS pages are clustered for writes, based on the filesystem cluster size.


How These Bugs Were Fixed:
1) The UFS filesystem cluster size(maxcontig) and I/O transfer size were separated, therefore removing the dependency that was causing systems to hang. UFS will no longer allow a setting of maxcontig to interrupt or hang any I/O requests to the device. UFS will always issue I/O requests that <= maximum transfer size of the device hosting the filesystem.

The UFS filesystem cluster size is still set using the value indicated for maxcontig. The I/O transfer size will be set in UFS as shown below.

2) The value for rotational delay(gap mkfs(1M),-d tunefs(1M)) no longer makes sense. The devices today are very sophisticated and do not need a delay artificially built in via software. As noted above, the value of maxcontig, determines the length of contiguous blocks placed on disk, before inserting space to account for rotational delay. The value for rotational delay has been obsoleted in Solaris 10 and Solaris 9 9/04 and defaults to 0 now, ensuring contiguous allocation.

Transfer size of I/O requests in UFS:
The device that hosts the filesystem will be queried as to the maximum transfer size it can handle, and the UFS I/O transfer size will default to this, if this information is obtainable. If the device does not support obtaining the maximum transfer data, the maximum transfer will be set using:

  • min(maxphys, ufs_maxmaxphys).

  • ufs_maxmaxphys is currently set to 1MB.


If, however the user sets the value of maxcontig to be less than the maximum device transfer size, UFS will honor the value of maxcontig as the maximum value for data transfers on this device.

maxcontig:
The default value is determined from the disk drive's maximum transfer size as noted above. Any positive integer value is acceptable when setting this parameter, via tunefs(1M) or mkfs(1M).