|
Page Fault Handling in Solaris
Welcome to OpenSolaris! In
this entry, I'll walk through the page fault handling code,
which is ground zero of the Solaris virtual memory subsystem. Due to
the nature of this level in the system, part of the code (the lowest
level that interfaces to hardware registers) is machine dependent,
while the rest is common code written in C. Hence, I will present this
topic in three parts: x64 machine dependent code, which has the most
hardware handling for TLB misses, followed by the more
complex SPARC machine dependent code, which relies on assembly code to
handle TLB misses from trap context; I'll wrap up by covering the
common code which is executed from kernel context.
Part 1: x64 Machine Dependent Layer
Since all x86-class machines handle TLB misses using a hardware page
table walk mechanism, the Hardware Address Translation, or HAT, layer
for x64 systems is the least complex of the two system architectures
Solaris currently supports. Both the x86 and AMD systems use a
page directory scheme to map per-address-space virtual memory
addresses to physical memory addresses. When a TLB miss occurs, the
MMU (memory management unit) hardware searches the page table for the
page table entry (PTE) associated with the virtual address of the
memory access, if one exists. In the page directory model, the virtual
address is divided up into several parts; each successive part of the
virtual address forms an index into each successive level in the
directory, while the higher level directory entries point to the
address in memory of the next lowest directory. Each directory table
is 4K in size, which corresponds to the base page size of the
processor. The pointer to the top-level page directory is programmed
into the cr3 hardware register on context switch.
| |
![[Directory based page table]](http://blogs.sun.com/roller/resources/elowe/pagedir.gif) |
| |
Directory-based page tables |
Since we're discussing the page fault path in this blog entry, we are
interested in the case where the processor fails to find a valid PTE
in the lowest level of the directory. This results in a page fault
exception (#pf), which passes control synchronously to a page
fault handler in trap context. This low-level handler is
pftrap(), located in exception.s. The
handler jumps to cmntrap() over in locore.s
which pushes the machine state onto the stack, switches to kernel
context, and invokes the C kernel-side trap handler trap() in
trap.c
with a trap type of T_PGFLT. The trap() routine
figures out that this is a user fault since it lies below
KERNELBASE, and calls pagefault() in vm_machdep.c. The
pagefault() routine collects the necessary arguments for the
common as_fault() routine, and passes control to it.
For more information regarding the x64 HAT layer, refer to Joe Bonasera's blog where
he has started blogging about this subsystem which he and Nils
Nieuwejaar redesigned from the ground up for the AMD64 port in Solaris
10.
Part 2: SPARC Machine Dependent Layer
The UltraSPARC architecture, the only SPARC architecture currently
supported by Solaris -- relies entirely on software to handle TLB
misses1. Hence, the HAT layer for SPARC
is a bit more complex than the x64 one. To speed up handling of TLB
miss traps, the processor provides a hardware-assisted lookup
mechanism2 called the Translation
Storage Buffer (TSB). The TSB is a virtually indexed,
direct-mapped, physically contiguous, and size-aligned region of
physical memory which is used to cache recently used Translation Table
Entries (TTEs) after retrieval from the page tables. When a TLB miss
occurs, the hardware uses the virtual address of the miss combined
with the contents of a TSB base address register (which is
pre-programmed on context switch) to calculate the pointer into the
TSB of the entry corresponding to the virtual address. If the TSB
entry tag matches the virtual address of the miss, the TTE is loaded
into the TLB by the TLB miss handler, and the trapped instruction is
retried. See DTLB_MISS() in trap_table.s
and sfmmu_udtlb_slowpath in sfmmu_asm.s. If
no match is found, the trap handler branches to a slow path routine
called the TSB miss handler3.
The SPARC HAT layer (named sfmmu after the codename
spitfire MMU, the first UltraSPARC MMU supported) uses an open
hashing technique to implement the page tables in software. The hash
lookup is performed using the struct hat pointer for the
currently running process and the virtual address of the TLB miss. On
a TSB miss, the function sfmmu_tsb_miss_tt in sfmmu_asm.s
searches the hash for successive page sizes using the
GET_TTE() assembly macro. If a match is found, the TTE is
inserted into the TSB, loaded into the TLB, and the trapped
instruction is re-issued. If a match is not found, or the access type
does not match the permitted access for this mapping (e.g. a write is
attempted to a read-only mapping) control is transferred to the
sys_trap() routine in mach_locore.s
after setting up the appropriate fault type. The sys_trap()
routine (which is very involved due to SPARC's register windows) saves
the machine state to the stack, switches from trap context to kernel
context, and invokes the kernel-side trap handler in C,
trap() over in trap.c. The
trap() routine recognizes the T_DATA_MMU_MISS trap
code and branches to pagefault() in vm_dep.c. As
its x64 counterpart does, pagefault() collects the
appropriate arguments and invokes the common handler
as_fault().
For more information about the sfmmu HAT layer, keep coming back --
this subsystem warrants a more in-depth tour in future blogs.
Part 3: Common Code Layer
The Solaris virtual memory (VM) subsystem uses a segmented
model to map each process' address space, as well as the kernel
itself. Each segment object maps a contiguous range of virtual memory
with common attributes. The backing store for each segment
may be device memory, a file, physical memory, etc. Each backing store
type is handled by a different segment driver. The most
commonly used segment driver is seg_vn, so-named because it
maps vnodes associated with files. Perhaps more
interestingly, the seg_vn segment driver is also responsible
for implementing anonymous memory which is so-called because
it is private to a process and is backed by swap space rather than by
a file object. Since seg_vn maps the majority of a process'
address space, including all text, heap, and stack, I'll use it to
illustrate the most common page fault path encountered by a process4.
Returning to the page fault path, assume that the page fault being
examined has occurred in a virtual address range that corresponds to a
process heap -- for instance, the first touch of new memory allocated
by a brk() system call performed by the C library's
malloc() routine. Such a fault will allocate process private,
anonymous memory which is pre-filled with zeros, known to VM geeks as
a ZFOD fault -- short for zero fill on demand. In such a
situation, the as_fault() routine (vm_as.c)
will search the process' segment tree looking for the segment that
maps the virtual address range corresponding to the fault. If
as_fault() discovers that no such segment exists, a fatal
segmentation violation is signalled to the process causing it to
terminate. In our example, a segment is found whose seg_ops
corresponds to segvn_ops (seg_vn.c). The
SEGOP_FAULT() macro is called, which invokes the
segvn_fault() routine in seg_vn.c. In
our example, the backing store is swap, so segvn_faultpages()
will find there is no vnode backing this range, but rather an anon
object and will allocate a page to back this virtual address through
anon_zero() in vm_anon.c.
Here is a sample callstack into anon_zero() as viewed from DTrace on
my workstation (which is a dual-CPU Opteron running Solaris 64-bit kernel):
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'
genunix`segvn_faultpage+0x16c
genunix`segvn_fault+0x647
genunix`as_fault+0x3c8
unix`pagefault+0x7e
unix`trap+0x792
unix`_cmntrap+0x83
Here is another sample callstack into anon_zero(), this time from
a Ultra-Enterprise 10000:
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'
genunix`segvn_faultpage+0x238
genunix`segvn_fault+0x920
genunix`as_fault+0x4b8
unix`pagefault+0xac
unix`trap+0xc5c
unix`utl0+0x4c
Note that, in both cases, we can only see back as far as where we switched
to kernel context, since trap context uses only registers or scratch space
for its work and does not save traceable stack frames for us.
For those of you who wish to trace the path of a process' zero fill
page faults from beginning to end, you may do so quite easily by
running this DTrace script, as root. The script
takes one argment, which is the exec name of the binary to trace. I
recommend a simple one like "ls" since it is relatively small and
short lived.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
1 The topic of the details of SPARC TLB
handling is one that will take many blog entries to cover from
beginning to end, so I'm skipping over many of the details here for
now. For the impatient, pick up a copy of Solaris Internals by
Jim Mauro and Richard McDougall
(ISBN 0-13-022496-0); though much of the material is dated now, many
of the details are still accurate.
2 This TSB mechanism could be employed by
the hardware for a little extra effort. No current sun4u
systems do so, but some future systems may support the TSB lookup in
hardware.
3 I'm skipping a step here for the sake
of brevity -- there are actually two TSB searches in the case of a
process which is using large pages, since a separate 4M-indexed TSB is
kept for large pages. If the process is using 4M or larger pages, the
second TSB must be searched also prior to a TSB miss. This second
search is performed using a software generated TSB index, since the
hardware assist only generates a 8K-indexed TSB pointer into the first
TSB. See sfmmu_udtlb_slowpath() in
the source if you care to see what really happens... Go on, you
really have the source now, so no excuses :)
4 In some ways, this is unfortunate
because the seg_vn segment driver is the most complicated of
all the segment drivers in the Solaris VM subsystem, and as such has a
very steep learning curve. Within Sun, we often joke that nobody
understands how it all works, as it has evolved over a period of many,
many years, and all of the original implementors have since moved on
or are now part of Sun's upper management. While the spirit of the
code hasn't changed significantly from the original SVR4 code, much of
the complexity added over the years has evolved to support modern
features like superpages that were not anticipated in the
original design. This can make for a few twists and turns in the
source even for following the path of a simple example like our ZFOD
fault.
(2005-06-14 09:04:04.0/2005-06-14 09:00:00.0)
Permalink
Trackback: http://blogs.sun.com/elowe/entry/page_fault_handling_in_solaris
|