Solaris UFS lockfs protocol
On the event of today's full release of OpenSolaris (Open Source version of Solaris -- The Best Operating System in the Planet), I would like to explain a small piece of Solaris UFS and an interesting BUG I fixed in the piece recently.
Solaris UFS lockfs Protocol :
Purpose :
* Provides a facility to quiescen
a file system
* Provides a facility for file system
locking
* Provides a facility for forced unmount
UFS VNODE Operations :
Various VNODE operations for implementing
UFS is defined by ufs_vnodeops.
On a running system to find out the functions of UFS VNODE operatios, as
root type : "echo 'ufs_vnodeops::print
vnodeops_t'
| mdb -k"
Interface for user land :
Interface for ufs_vnodeops :
ufs_lockfs_begin{_getpage}()
ufs_lockfs_end()
ufs_quiesce()
The Protocol :
Various lock state of a UFS :
Suspends writes that would modify the file system. Access times are not kept while a file system is write locked.* NAME LOCK
Suspends accesses that could change or remove existing directories entries.* DELETE LOCK
Suspends access that could remove directory entries.* HARD LOCK
Returns an error on every access to the locked file system, and cannot be unlocked. Hard Locked file systems can be unmounted. Hard lock is for supporting forcible unmount.* ERROR LOCK
Blocks all local access to the file system and returns EWOULDBLOCK on all remote access. File systems are error locked by UFS on detection of internal inconsistency. They may only be unlocked after successful repair by fsck, which is usually done automatically. Error locked file systems can be unmounted. Once the file system becomes clean, it may be upgraded to a hard lock.* SOFT LOCK
Quiescens a file system.* UNLOCK
Awakens suspended accesses, Releases existing locks, Flushes the file system.ufs_vnodeops functions that conflict with the above file system lock types will get either suspended, or get a EAGAIN error, or get an EIO error if the file system is hard locked, or will block if the file system is error locked.
A per UFS counter will get incremented by 1 when a ufs_vnodeops is entered; it will be decremented by 1 when a ufs_vnodeops is exited.
A file system is in a quiescent state if the counter is zero.
When a ufs_vnodeops is under execution on a UFS, there can be a call to another function of ufs_vnodeops on the same UFS or a different UFS. This is called as recursive ufs_vnodeops. The per UFS counter is not incremented or decremented during the recursive ufs_vnodeops.
There are exceptions that the following
ufs_vnodeops
do not obey the locking protocol :
ufs_open,
ufs_close,
ufs_inactive,
ufs_rwlock,
ufs_rwunlock,
ufs_putpage,
ufs_addmap,
ufs_delmap,
ufs_poll
Implementation of The Protocol :
The structure ulockfs used to implement UFS lockfs protocol is embeded into and part of ufsvfs_t . It is created while mounting a UFS and stored in vfs_data of vfs_t .
The member ul_vnops_cnt of structure ulockfs acts as a per UFS counter. It will be incremented (ufs_lockfs_begin{_getpage}()) by 1 when a ufs_vnodeops is entered; it will be decremented (ufs_lockfs_end()) by 1 when a ufs_vnodeops is exited.
A file system is considered to be in a quiescent state (ufs_quiesce()) if ufs_vnops_cnt is zero.
The function ufs_check_lockfs() checks whether any of ufs_vnodeops function conflicts with the file system lock types. If it conflicts then it will either suspend or return EAGAIN or EIO or EWOULDBLOCK based on current file system lock state.
To detect recursive ufs_vnodeops, ulockfs_info_t is attached to every thread who calls one of the function from ufs_vnodeops.
Real World Advantages :
* HA service fail over with the help
of forced umount
* Stop the file system activity for
taking backups
* Provides a reliable method for synchronous
flush of all the file system data to disk.
BUG 4648917
:
I fixed a long lasting BUG 4648917
in Solaris 10 to correct the broken UFS lockfs protocol. The broken
piece - UFS lockfs implementation fails to detect recursive ufs_vnodeops
on different file systems. As a result, sometimes the system panics if
one of the file system is forcefully unmounted.
A system suffered due to BUG 4648917 will have stack trace (SPARC version) in the crash file like :
ufs_getpage
---> (A)
segvn_fault
as_fault
pagefault
trap
ktl0
hat_memload
uiomove
wrip
ufs_write
---> (B)
vn_rdwr
core_write
core_seg
do_core
psig
post_syscall
syscall_trap
The system panicked when a process trying to dump core due to an unexpected signal delivery. At (A) and (B) a thread from the process is doing recursive ufs_vnodeops on different file systems. The test case we came up with, was just a while (1) loop program with big data segment (Several Mega Byte of a static array).
To reproduce the problem, on a system have two UFS (root, and another file system). Have the test program stored on the other file system. Start the test program from root. Send SIGFPE to the test program and forcefully unmount the other file system in parallel.
When the program receives SIGFPE the default signal handler initiates the core dump. While creating the core file (B) (Kernel writes the entire address space of the process) kernel detects that some part of the process' data segment is not available in the process' address space. Hence it needs to be brought into the process' address space from the file system (A).
In the meantime the other file system where the executable of the process is stored was forcefully unmounted. Due to the BUG 4648917 the forcible unmount was allowed during an on going ufs_vnodeops on the other file system.
Since, the file system was unmounted during an on going ufs_vnodeops the system ended up in a panic state due to an unexpected data fault.
The root cause for the problem was -- The earlier UFS lockfs protocol implementation used the flag T_DONTBLOCK in t_flag, of kthread_t to indicate that a thread has entered one of function of ufs_vnodeops. Single flag obviously fails to detect the recursive ufs_vnodeops on different file systems.
I fixed this BUG by attaching ulockfs_info_t to every thread who calls one of the function from ufs_vnodeops on a UFS for the first time. Recursive ufs_vnodeops is detected by checking the thread for ulockfs_info_t.
Technorati Tag : OpenSolaris
Technorati Tag : Solaris
Technorati Tag : mdb
Posted at 04:20PM Jun 14, 2005 by prabahar in General | Comments[1]
Tuesday Jun 14, 2005
Posted by cobra on June 14, 2005 at 08:14 PM PDT #