pageicon Tuesday Jun 14, 2005

Solaris UFS lockfs protocol



On  the event of today's full release of OpenSolaris  (Open Source version of Solaris -- The Best Operating System in the Planet), I would like to explain a small piece of Solaris UFS and an interesting BUG I fixed in the piece recently.

Solaris UFS lockfs Protocol :

Purpose :
* Provides a facility to quiescen a file system
* Provides a facility for file system locking
* Provides a facility for forced unmount

UFS VNODE  Operations :
Various VNODE operations for implementing UFS is defined by ufs_vnodeops. On a running system to find out the functions of UFS VNODE operatios, as root type : "echo 'ufs_vnodeops::print vnodeops_t' | mdb -k"

Interface for user land :

IOCTL _FIOLFS on a UFS

Interface for ufs_vnodeops :

ufs_lockfs_begin{_getpage}()
ufs_lockfs_end()
ufs_quiesce()

The Protocol :

Various lock state of a UFS :

* WRITE LOCK

Suspends writes that would modify the file system. Access times are not kept while a file system is write locked.
* NAME LOCK
Suspends accesses that could change or remove existing directories entries.
* DELETE LOCK
Suspends access that could remove directory entries.
* HARD LOCK
Returns an error on every access to the locked file system, and cannot be unlocked. Hard Locked file systems can be unmounted. Hard lock is for supporting forcible unmount.
* ERROR LOCK
Blocks all local access to the file system and returns EWOULDBLOCK on all remote access. File systems are error locked by UFS on detection of internal inconsistency. They may only be unlocked after successful repair by fsck, which is usually done automatically. Error locked file systems can be unmounted. Once the file system becomes clean, it may be upgraded to a hard lock.
* SOFT LOCK
Quiescens a file system.
* UNLOCK
Awakens suspended accesses, Releases existing locks, Flushes the file system.
ufs_vnodeops functions  that conflict with the above file system lock types will get either suspended, or get a EAGAIN error, or get an EIO error if the file system is hard locked, or will block if the file system is error locked.

A per UFS counter  will get incremented by 1 when a ufs_vnodeops is entered; it will be decremented by 1 when a ufs_vnodeops is exited.

A file system is in a quiescent state if  the counter is zero.

When a ufs_vnodeops is under execution on a UFS, there can be a call to another  function of  ufs_vnodeops on the same UFS or a different UFS. This is called as recursive  ufs_vnodeops. The per UFS counter is not incremented or decremented during the recursive ufs_vnodeops.

There are exceptions that the following ufs_vnodeops do not obey the locking protocol :
ufs_open, ufs_close, ufs_inactive, ufs_rwlock, ufs_rwunlock, ufs_putpage, ufs_addmap, ufs_delmap, ufs_poll

Implementation of The Protocol :

The structure ulockfs  used to implement UFS lockfs protocol is embeded into and part of ufsvfs_t . It is created while mounting a UFS and stored in vfs_data of vfs_t .

The member ul_vnops_cnt of structure ulockfs acts as a per UFS counter. It  will be incremented (ufs_lockfs_begin{_getpage}())  by 1 when a ufs_vnodeops is entered; it will be decremented (ufs_lockfs_end()) by 1 when a ufs_vnodeops is exited.

A file system is considered to be in a quiescent state (ufs_quiesce()) if ufs_vnops_cnt is zero.

The function ufs_check_lockfs()  checks whether any of ufs_vnodeops function conflicts with the file system lock types. If it conflicts then it will either suspend or return EAGAIN or  EIO or  EWOULDBLOCK based on current file system lock state.

To detect recursive ufs_vnodeops, ulockfs_info_t  is attached to every thread who calls one of the function from ufs_vnodeops.

Real World Advantages :
* HA service fail over with the help of forced umount
* Stop the file system activity for taking backups
* Provides a reliable method for synchronous flush of all the file system data to disk.
 

BUG 4648917 :
I  fixed a long lasting BUG 4648917  in Solaris 10  to correct the broken UFS lockfs protocol. The broken piece - UFS lockfs implementation fails to detect recursive ufs_vnodeops on different file systems. As a result, sometimes the system panics if one of the file system is forcefully unmounted.

A system suffered due to BUG 4648917 will have  stack trace (SPARC version) in the crash file like :

ufs_getpage                ---> (A)
segvn_fault
as_fault
pagefault
trap
ktl0
hat_memload
uiomove
wrip
ufs_write                       ---> (B)
vn_rdwr
core_write
core_seg
do_core
psig
post_syscall
syscall_trap

The system panicked when a process trying to dump core due to an unexpected signal delivery. At (A) and (B) a thread from the process is doing recursive ufs_vnodeops on different file systems. The test case we came up with, was just a while (1) loop program with big data segment (Several Mega Byte of a static array).

To reproduce the problem, on a system have two UFS (root, and another file system). Have the test program stored on the other file system. Start the test program from root. Send SIGFPE to the test program and forcefully unmount the other file system in parallel.

When the program receives SIGFPE the default signal handler initiates the core dump. While creating the core file (B) (Kernel writes the entire address space of the process)  kernel detects that some part of the process' data segment is not available in the process' address space. Hence it needs to be brought into the process' address space from the file system (A).

In the meantime the other file system where the executable of the process is stored was forcefully unmounted. Due to the BUG 4648917 the forcible unmount was allowed during an on going ufs_vnodeops on the other file system.

Since, the file system was unmounted during an on going ufs_vnodeops the system ended up in a panic state due to an unexpected data fault.

The root cause for the problem was -- The earlier UFS lockfs protocol implementation used  the flag T_DONTBLOCK  in t_flag,  of kthread_t to indicate that a thread has entered one of function of ufs_vnodeops. Single flag  obviously fails to detect the recursive ufs_vnodeops on different file systems.

I fixed this BUG by attaching  ulockfs_info_t  to every thread who calls one of the function from ufs_vnodeops on a UFS for the first time. Recursive ufs_vnodeops is detected by checking the thread for ulockfs_info_t.



Technorati Tag : OpenSolaris
Technorati Tag : Solaris
Technorati Tag : mdb
Comments:

Yep, I agree that Memory Management System of Solaris is especially Superior. Yeah, there are no memory leaks. It's better than Linux. I am interested in the superiority of Solaris memory management architecture, "Why?". Exactly Solaris is one of the stable & safety Operating System, as expected on Business.

Posted by cobra on June 14, 2005 at 08:14 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

« November 2009
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
     
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 6