Thursday Apr 07, 2005



So... recently the UFS team for Solaris 10 (of which I am a member) went through a very big exercise to create UFS technical documentation. This exercise proved to be immensely fruitful, for both the current Solaris UFS team and any future Solaris UFS teams. I know I learned a lot personally from this exercise and there is so much great data we amassed I think it can only be useful to the broader community. Thus my desire to share. My hope is that when OpenSolaris is primetime all of this data will be available to anyone who is interested in working on UFS.

I love working in the UFS source. I want to share that love with you all as well, because although UFS is very mature, it has many interesting features and quirks that are just fun to learn. UFS is a complicated beast and requires a lot of deep thought and many nights of grinding through code.

UFS is the keeper of the data for many folks running Solaris. As such, it matters a lot. It is one of those things that nobody notices, but everybody notices when it goes bad. The challenge in keeping UFS running smoothly and being the best keeper of the data is what makes all of the hard work worthwhile. Working in UFS provides a good overall understanding of so many parts of the Solaris kernel since it interacts with so many subsystems: page cache, buffer cache, I/O subsystem,... It isn't the most glamorous code on the planet, but well worth the effort for learning fundamental Solaris kernel technology.

Since OpenSolaris is going to happen soon, I thought I would start blogging about UFS technical data, as evidenced by my first blog. I hope that this blogging might prove useful to those of you out there interested in Solaris UFS filesystem technology. I cannot share code just yet, but can share technical concepts and later give pointers to the code.

Locking in UFS
This blog post will give an introduction in to locking in UFS. In the future this will be important data for developers to have and understand, as any feature additions will require an understanding of the locking. And, many bug fixes require this understanding as well.

Today's topic will cover the following things:
  • Basics about some Solaris kernel locks
  • UFS inode locks
  • UFS inode queue locks
  • Generic VNODE layer locks
  • Generic VFS layer locks
  • General Lock ordering
  • A Directory lookup locking pseudo code example

There is an implicit assumption on my part that you have some basic knowledge of the VNODE and VFS layers, basic locking principles, such as reader/write locks, mutual exclusion locks in Solaris and basic UFS inode knowledge. If not, it would be good to do some pre-reading on these topics. A good place to start is with the "Solaris Internals Core Kernel Architecture" book, by Jim Mauro and Richard McDougall.

Solaris Kernel Locks
In this section I will cover some very basic technical details of Solaris kernel locks, specifically those types used in UFS.
  • krwlock_t reader/writer lock, allows multiple readers or 1 writer at a time.
  • kmutex_t Mutual exclusion lock, allows one operator at a time.
    Solaris implements an adaptive mutex lock:
    • If holder is running, spin.
    • If holder is sleeping, sleep.

UFS Inode locks
There are four locks associated with UFS inodes:
  • i_rwlock(krwlock_t)
    • Serializes write requests. Allows reads to proceed in parallel. Serializes directory reads and updates.
    • Does not protect inode fields.
    • Indirectly protects blocks lists since it serializes allocations/deallocations in UFS
    • Must be taken prior to starting UFS logging transactions if operating on a file, otherwise taken after starting logging transaction.
    i_contents(krwlock_t)
    • Protects most fields in the inode.
    • When held as a writer protects all the fields protected by the i_tlock as well.
    i_tlock
    • When held with the i_contents reader lock it protects the following inode fields:
      • i_utime, i_ctime, i_mtime, i_flag, i_delayoff, i_delaylen, i_nextrio, i_writes, i_writer, i_mapcnt
    • Also used as mutex for write throttling in UFS
    • i_contents and i_tlock held together allows parallelism in updates.
    i_hlock
    • UFS inode hash lock

UFS Inode Queue locks
  • ufs_scan_lock(kmutex_t)
    • Synchronizes ufs_scan_inodes threads.
    • Needed because UFS has global inodes lists
  • ufs_q->uq_mutex(kmutex_t)
    • Used to protect idle queues
      • ufs_junk_iq, ufs_useful_iq These are the two inode idle queues, and as you can guess from their names, one holds still potentially useful inodes, the other holds inodes known to not contain valid data.
  • ufs_hlock
    • Used by the hlock thread. For more information see man lockfs(1M), hardlock section.
  • ih_lock
    • Protects the inode hash. The inode hash is global, per system, not per filesystem.

VNODE locks
  • v_lock(kmutex_t)
    • Protects VNODE fields.
    • VN_HOLD/VN_RELE:
      • Uses v_lock
      • Increments/decrements reference count on VNODE by 1.
      • Used to ensure that the VNODE/inode does not go away while in use.

VFS locks
  • vfs_lock(kmutex_t)
    • Locks contents of filesystem and cylinder groups.
    • Also protects fields of the vfs_dio(delayed io)field.
      • vfs_dio is delayed io bit
      • set via ioctl _FIOSDIO
    • vfs_dqrwlock
      • Manages quota subsystem quiescence.
      • Writer held means that the UFS quota subsystem can have major changes going on:
        • Disabling quotas, enabling quotas, setting new quota limits.
      • Protects d_quot structure as well. This structure is used to keep track of all the enabled quotas per filesystem.
      • It is important to note that UFS shadow inodes which are used to hold ACL data and extended attribute directories are not counted against user quotas. Thus this lock is not held for updates to these.
      • Reader held for this lock indicates to quota subsystem that major changes should not be occurring during that time.
      • Held when changes when the i_contents writer lock is held, as described above, indicating changes are occurring that affect user quotas.
      • Since UFS quotas can be enabled/disabled on the fly, this lock must be taken in all appropriate situations. It is not sufficient to check if the UFS quota subsystem is enabled prior to taking the lock.
      ufsvfs_mutex(kmutex_t)
      • Protects access to the list that links together all UFS filesystem instances.
      • Lists are updated as a part of the mount operation.
      • Also for allow syncing of all UFS filesystems.

UFS Inode Updates Lock Ordering
This pictorial representation of the ordering/weighting of UFS locks is intended to show 1) What each of the locks protects 2) The order in which the locks must be taken if you need to protect the fields relevant to the specific lock. This does not mean that you must always take every lock shown, simply that you must take these in the order shown in the picture based on the fields you are trying to protect.

Example of how locks are used:

Directory Lookups and locking
Doing a directory lookup....
dp is the current directory inode we are searching for an entry in.
    rw_enter(&dp->i_rwlock, RW_READER); <---taken to avoid races with a dirremove in the dncl directory cache. Not needed for standard dnlc cache.
    RW_READER is taken in this case as we want to prevent a write from coming in an changing data out from underneath us.

    If found in dnlc directory cache && "." or ".." then we have to do the following:

      /*
      * release the lock on the dir we are searching
      * to avoid a deadlock when grabbing the
      * i_contents lock in getting the allocated inode.
      */
      rw_exit(&dp->i_rwlock);<--drop this lock on the directory in which we are searching. Can deadlock with i_contents(on the directory above) in call to the function getting the allocated inode.
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode;
      rw_exit(&dp->i_ufsvfs->vfs_dqrwlock);
      /*
      * must recheck as we dropped dp->i_rwlock
      */
      rw_enter(&dp->i_rwlock, RW_READER);
      ....
      Now do rechecks here to ensure that data has not changed on the dp(directory inode) during the time we dropped the lock.

    else
      Otherwise if not "." or ".." then proceed as normal for directory lookups
      rw_enter(&dp->i_ufsvfs->vfs_dqrwlock, RW_READER);
      get allocated inode; <-i_contents taken in here, no possible contention with the i_rwlock taken above...
      rw_exit(&dp->i_ufsvfs->vfs_dqrwlock);

      No need to recheck anything since we did not drop the i_rwlock.

The important take away from this example is that there are times when you must release and reacquire locks in UFS. If this is necessary, then it is important to recheck the assumptions about the data you are working on since it is possible that it could have changed during the time the lock was released and reacquired.

There are many more locks used in UFS. This blog only covers a portion of those that I felt were good introductions to UFS locking. Perhaps later I will expand more on this topic in a future blog.
Comments:

Excellent explanation of the UFS lock mechanism. If there are some examples of the bugs caused by the lock errors, it'll be more self-evident and informative. Cheers. Joey

Posted by skyhorse on April 26, 2005 at 07:32 AM MST #

buf

Posted by 83.208.171.69 on September 07, 2007 at 01:47 AM MST #

Post a Comment:
  • HTML Syntax: NOT allowed