More UFS technical tidbits in anticipation of OpenSolaris. Today's talk is about UFS I/O. It is a complicated beast and has many different parts and paths it can take.
Overview of file system I/O in Solaris:

The interaction of UFS and the VM subsystem has been the cause of numerous bugs, and hard to find problems. Today's blog is an overview of the UFS I/O, with particular attention paid to the VM subsystem interaction. Details on the paths taken when a read() system call is initiated are to show the interaction of UFS and the VM subsystem. I am making some assumptions here that the readers of this blog will have some basic Solaris file system knowledge, or at a minimum some of the basic Solaris file system terminology is understood.
Basic Solaris VM facts
Solaris virtual memory is demand paged, and globally managed. There is integrated file caching and it is layered to allow VM to describe multiple memory types. The paging vnode cache is the unification of file and memory management by use of a vnode object. 1 page of memory == <vnode, offset> tuple. The UFS file system uses this relationship to implement caching for vnodes. The paging vnode cache provides a set of functions for cache management and I/O for vnodes.
The paging vnode cache functions are specified with a pvn_ <xxx> title. The source code for this is located at: xxxx. Some of the more important paging vnode functions are listed below, with basic function descriptions. Also shown is pointers to the code so you can get more detailed data about each of these.
Some important paging vnode cache functions:
pvn_read_kluster():
Finds range of continuous pages within the supplied address/length that fit within the <vnode, offset> values that do not already exist.
Caller should call pagezero() on any part of last page that is not read from disk.
pvn_write_kluster():
Finds dirty pages within the offset and length. Returns a list of locked pages ready to be written.
Caller then sets up write call with pageio_setup().
Write is initiated via a call to bdev_strategy().
Synchronous writes require the caller to call pvn_write_done(). Otherwise io_done() will call this when write is complete.
pvn_vplist_dirty():
Finds all pages in page cache >= offset and pushes these pages.
Will cluster pages with adjacent pages if it can.
What is a seg_map and why do you care?
The seg_map segment maintains mappings of pieces of files into kernel address space. It is only used by file systems and it allows copying of data to or from user to kernel address space. At any given time, seg_map segment has some portion of total file system cache mapped in to the kernel address space. The seg_map segment driver divides the segment in to file system block sized slots.
Some important seg_map functions:
segmap_getmap() && segmap_getmapflt():
Retrieves or creates mapping
getmapflt allows for creation of segment if not found, calls ufs_getpage()
segmap_release():
Releases the mapping for a file segment
segmap_pagecreate():
Creates new pages of memory and slots in the seg_map for a given files
Used for extending files or writing holes to a file
Important in the mapping and getting data from the segmap driver is the fbuf structure. It is defined as follows:
struct fbuf {
caddr_t fb_addr;
u int_t fb_count;
};
This structure is used to get a mapping to part of a file via the segkmap interfaces. It is also used by the pseudo bio functions(shown below) for reading and writing of data. fbuf is used by directory reading to get on UFS on disk contents via a call to blkatoff().
seg_vn and UFS and memory mapped I/O:
Memory mapping allows for a file to be mapped in the a processes address space. This mapping is done via the VOP_MAP call and the seg_vn memory driver. File pages are read when a fault occurs in the address space. The seg_vn driver enables I/O's without process initiated system calls. I/O is performed ,,in units of pages, upon reference to the pages mapped into the address space. Reads are initiated by a memory access, writes are initiated as the VM subsystem finds dirty pages in the mapped address space.
So, why not use the seg_vn driver for non mmap'd I/O as well.? It could be used for mapping the file in to the kernel's address space, but seg_vn is a complex segment driver that manages the mapping of protections, copy-on-write fault handling, shared memory, etc...This is too heavy weight for what is needed for read and write system calls, so the seg_map driver was developed. Read and write system calls only require a few basic mapping functions since they do not map files into a process's address space. seg_map reduces locking complexity and gives better performance.
Pseudo bio functions:
Solaris has a set of interfaces which are considered buffered I/O interfaces, but that are used to read and write buffers containing directory entries only. These interfaces all use the seg_map driver for mapping to address file data. The functions are fbread(), fbwrite(), fbrelese(), fbdwrite(), fbiwrite(), fbzero(). Although these are not directly shown in the picture above, they are important enough to be worth mentioning.
A UFS/VM example, read() system call - non mmap'd:
Note: In general UFS caches the pages for write, but will also cache pages for reads if they are frequently reusable.
read()->ufs_read()->rdip():
Checks for directio1 enabled, if so tries to bypass page cache
If cache_read_ahead is set, set appropriate flags for placement of pages on cache list(used in freebehind2)
calculate whether we need to free pages(freebehind +) behind our read, this will come in later
if i_contents(reader)3 held, drop it to avoid deadlock in ufs_getpage().
Calls segmap_getmapflt() which transitions to ufs_getpage() since we are forcing a fault via S_READ
ufs_getpage():
If calling thread is thread owning the current i_contents lock no need to acquire the lock. Also checks to see if the vfs_dqrwlock is required.
Checks to see if the file has holes via bmap_has_holes(), this will be important later
For a read in ufs_getpage() loop through all the pages in the range off, off + len:
Call ufs_getpage_ra() to initiate an asynchronous read ahead of the current page. This helps us in page_lookup() process later.
Check if we should initiate a read ahead of the next cluster of bytes, cluster size is determined from the UFS maxcontig4 value. Read ahead is true if:
seqmode5 + pageoff + cluster size >= i_nextrio(start of next cluster) && pgoff <= i_nextrio && i_nextrio < current file size
Call page_lookup() to see if page is in page cache
if yes, update appropriate pointers, continue
If no, call ufs_getpage_miss():
Page is either read from disk or created. It is created, without disk read if we call it with S_CREATE or there is a hole in the file at this offset(not backed by a real disk block) in case of read()
Calls uiomove() to move data in to pages
We start freeing pages behind the current read if the i_nextr(next byte offset which was set after reading in the pages) > smallfile offset(32k), because we are reading in sequential mode so we know we won't need them
Call segmap_release() regardless, if cachemode set to freebehind(SM_FREE|SM_DONTNEED|SM_ASYNC) will put them to the head of the page cache
Technorati Tag: Solaris
1UFS directio will be saved for a later post.
2freebehind is always set to 1.
3The i_contents lock is a krwlock_t which is part of the ufs inode data structure. It protects most of the inodes contents. See my previous blog posting on UFS locking for more details.
4See my previous blog post regarding the use of maxcontig in UFS
5seqmode is determined from the i_nextr field in the current working inode. i_nextr represents the next byte offset for reads. If i_nextr == current offset and we are not creating a page, then we set seqmode == 1.