|
The secret sauce myth -- exposed
Dangers of having the source (and the "secret sauce" myth exposed)
The other day, there was a good internal thread going on one of the bigger internal Sun mailing lists, and it reminded me of an important point that I think is worth sharing with the outside world -- particularly now that OpenSolaris has landed. Having access to the source can be dangerous to developers if you're trying to develop stable, forward-compatible software for your users. The particular thread was centered around the use of private members of the vnode in a third party filesystem -- but it really could apply to development of any software on any system where folks have unrestricted access to the source code.
There is a myth that so-called "private" interfaces in Solaris are private simply because Solaris is a proprietary operating system, and Sun does not want some external software developer having access to our mytical secret sauce. It goes on -- furthermore, Sun intentionally makes these interfaces cryptically presented in the header files, and does not document them, simply because they provide a better way of doing something than the documented ones. Sun does this so that our software will somehow be better than anyone else can write, giving us a competitive edge, so we can keep selling our proprietary system running on our proprietary hardware.
BZZT! WRONG!!!
Spin again!
Solaris provides an interface stability taxonomy which defines certain levels of stability for the programming interfaces which are presented to developers (this taxonomy is documented in the attributes(5) man page). The interface stability taxonomy provides classifications which make it clear what the commitment level of Sun is to maintain backward compatability with any programs that use that interface in the future.
- Standard means the interface is defined by one of the various standards Solaris supports (e.g. POSIX) and hence cannot change as long as Solaris claims to support that standard
- Stable interfaces are guaranteed not to change incompatibly until the next major release. To put things into perspective, the last major release of SunOS was from 4.x to 5.0, which as you all know was a LONG time ago!
- Evolving and Unstable interfaces may not change incompatibly except in a minor release. Solaris 10 was a minor release, as was Solaris 9; Solaris 10 quarterly updates and 2.5.1 are examples of micro releases, which are not allowed to change these interfaces incompatibly.
A good rule to know is that if the interface has a man page, and there is no release taxonomy info at the bottom of the man page, the interface stability is Stable. If there isn't a man page, then the interface may be a private interface, or it may be an implementation artifact (I don't believe there is any way to tell which from outside of Sun yet, but I wouldn't be surprised if there is one on the OpenSolaris website soon). Private interfaces and implementation details of the system may change in micro releases and patches -- wherein lies the hidden danger!
When devloping software, if you want your software to run on future releases of Solaris, you must be careful to use only interfaces which have a suitable commitment level. Interfaces you run across while looking through the source code or headers, if not documented, aren't guaranteed to work in the future -- and are very likely NOT to work in the future. This means your program will stop working if Sun or the OpenSolaris community decides to change that interface for any reason, at any time, even in a patch. Which, needless to say, will cause headaches for the users of your software!
Getting back to the thread, Jim Carlson summed things up quite well, when he stated thus:
...we[1] reserve the right to change these interfaces in whatever way we
choose, at any time at all (including in patches), without notice of
any kind, and without any attempt to preserve any sort of
compatibility.
Thus, if you depend on private interfaces by writing your own software
that uses them, then you'll end up hurting yourself.
This is roughly the equivalent of a "no user serviceable parts inside"
warning on an appliance. It doesn't mean that you can't open it up
and poke around inside if you really know what you're doing. But if
you do, and you end up electrocuting yourself or starting a fire, you
have nobody else to blame.
[1] "We" in the above sentence means "everyone working on the code."
If that turns out to be a community-based effort, then it's the
members of that community who own and direct it.
Don't believe the myth, and don't electrocute your customers -- it's not worth it. Now that Solaris is opened up, if there isn't a way to do what you want to do, you can create one (or ask the community to help)!
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
(2005-06-14 19:39:25.0/2005-06-14 21:17:44.0)
Permalink
Trackback: http://blogs.sun.com/elowe/entry/dangers_of_having_the_source
Page Fault Handling in Solaris
Welcome to OpenSolaris! In
this entry, I'll walk through the page fault handling code,
which is ground zero of the Solaris virtual memory subsystem. Due to
the nature of this level in the system, part of the code (the lowest
level that interfaces to hardware registers) is machine dependent,
while the rest is common code written in C. Hence, I will present this
topic in three parts: x64 machine dependent code, which has the most
hardware handling for TLB misses, followed by the more
complex SPARC machine dependent code, which relies on assembly code to
handle TLB misses from trap context; I'll wrap up by covering the
common code which is executed from kernel context.
Part 1: x64 Machine Dependent Layer
Since all x86-class machines handle TLB misses using a hardware page
table walk mechanism, the Hardware Address Translation, or HAT, layer
for x64 systems is the least complex of the two system architectures
Solaris currently supports. Both the x86 and AMD systems use a
page directory scheme to map per-address-space virtual memory
addresses to physical memory addresses. When a TLB miss occurs, the
MMU (memory management unit) hardware searches the page table for the
page table entry (PTE) associated with the virtual address of the
memory access, if one exists. In the page directory model, the virtual
address is divided up into several parts; each successive part of the
virtual address forms an index into each successive level in the
directory, while the higher level directory entries point to the
address in memory of the next lowest directory. Each directory table
is 4K in size, which corresponds to the base page size of the
processor. The pointer to the top-level page directory is programmed
into the cr3 hardware register on context switch.
| |
![[Directory based page table]](http://blogs.sun.com/roller/resources/elowe/pagedir.gif) |
| |
Directory-based page tables |
Since we're discussing the page fault path in this blog entry, we are
interested in the case where the processor fails to find a valid PTE
in the lowest level of the directory. This results in a page fault
exception (#pf), which passes control synchronously to a page
fault handler in trap context. This low-level handler is
pftrap(), located in exception.s. The
handler jumps to cmntrap() over in locore.s
which pushes the machine state onto the stack, switches to kernel
context, and invokes the C kernel-side trap handler trap() in
trap.c
with a trap type of T_PGFLT. The trap() routine
figures out that this is a user fault since it lies below
KERNELBASE, and calls pagefault() in vm_machdep.c. The
pagefault() routine collects the necessary arguments for the
common as_fault() routine, and passes control to it.
For more information regarding the x64 HAT layer, refer to Joe Bonasera's blog where
he has started blogging about this subsystem which he and Nils
Nieuwejaar redesigned from the ground up for the AMD64 port in Solaris
10.
Part 2: SPARC Machine Dependent Layer
The UltraSPARC architecture, the only SPARC architecture currently
supported by Solaris -- relies entirely on software to handle TLB
misses1. Hence, the HAT layer for SPARC
is a bit more complex than the x64 one. To speed up handling of TLB
miss traps, the processor provides a hardware-assisted lookup
mechanism2 called the Translation
Storage Buffer (TSB). The TSB is a virtually indexed,
direct-mapped, physically contiguous, and size-aligned region of
physical memory which is used to cache recently used Translation Table
Entries (TTEs) after retrieval from the page tables. When a TLB miss
occurs, the hardware uses the virtual address of the miss combined
with the contents of a TSB base address register (which is
pre-programmed on context switch) to calculate the pointer into the
TSB of the entry corresponding to the virtual address. If the TSB
entry tag matches the virtual address of the miss, the TTE is loaded
into the TLB by the TLB miss handler, and the trapped instruction is
retried. See DTLB_MISS() in trap_table.s
and sfmmu_udtlb_slowpath in sfmmu_asm.s. If
no match is found, the trap handler branches to a slow path routine
called the TSB miss handler3.
The SPARC HAT layer (named sfmmu after the codename
spitfire MMU, the first UltraSPARC MMU supported) uses an open
hashing technique to implement the page tables in software. The hash
lookup is performed using the struct hat pointer for the
currently running process and the virtual address of the TLB miss. On
a TSB miss, the function sfmmu_tsb_miss_tt in sfmmu_asm.s
searches the hash for successive page sizes using the
GET_TTE() assembly macro. If a match is found, the TTE is
inserted into the TSB, loaded into the TLB, and the trapped
instruction is re-issued. If a match is not found, or the access type
does not match the permitted access for this mapping (e.g. a write is
attempted to a read-only mapping) control is transferred to the
sys_trap() routine in mach_locore.s
after setting up the appropriate fault type. The sys_trap()
routine (which is very involved due to SPARC's register windows) saves
the machine state to the stack, switches from trap context to kernel
context, and invokes the kernel-side trap handler in C,
trap() over in trap.c. The
trap() routine recognizes the T_DATA_MMU_MISS trap
code and branches to pagefault() in vm_dep.c. As
its x64 counterpart does, pagefault() collects the
appropriate arguments and invokes the common handler
as_fault().
For more information about the sfmmu HAT layer, keep coming back --
this subsystem warrants a more in-depth tour in future blogs.
Part 3: Common Code Layer
The Solaris virtual memory (VM) subsystem uses a segmented
model to map each process' address space, as well as the kernel
itself. Each segment object maps a contiguous range of virtual memory
with common attributes. The backing store for each segment
may be device memory, a file, physical memory, etc. Each backing store
type is handled by a different segment driver. The most
commonly used segment driver is seg_vn, so-named because it
maps vnodes associated with files. Perhaps more
interestingly, the seg_vn segment driver is also responsible
for implementing anonymous memory which is so-called because
it is private to a process and is backed by swap space rather than by
a file object. Since seg_vn maps the majority of a process'
address space, including all text, heap, and stack, I'll use it to
illustrate the most common page fault path encountered by a process4.
Returning to the page fault path, assume that the page fault being
examined has occurred in a virtual address range that corresponds to a
process heap -- for instance, the first touch of new memory allocated
by a brk() system call performed by the C library's
malloc() routine. Such a fault will allocate process private,
anonymous memory which is pre-filled with zeros, known to VM geeks as
a ZFOD fault -- short for zero fill on demand. In such a
situation, the as_fault() routine (vm_as.c)
will search the process' segment tree looking for the segment that
maps the virtual address range corresponding to the fault. If
as_fault() discovers that no such segment exists, a fatal
segmentation violation is signalled to the process causing it to
terminate. In our example, a segment is found whose seg_ops
corresponds to segvn_ops (seg_vn.c). The
SEGOP_FAULT() macro is called, which invokes the
segvn_fault() routine in seg_vn.c. In
our example, the backing store is swap, so segvn_faultpages()
will find there is no vnode backing this range, but rather an anon
object and will allocate a page to back this virtual address through
anon_zero() in vm_anon.c.
Here is a sample callstack into anon_zero() as viewed from DTrace on
my workstation (which is a dual-CPU Opteron running Solaris 64-bit kernel):
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'
genunix`segvn_faultpage+0x16c
genunix`segvn_fault+0x647
genunix`as_fault+0x3c8
unix`pagefault+0x7e
unix`trap+0x792
unix`_cmntrap+0x83
Here is another sample callstack into anon_zero(), this time from
a Ultra-Enterprise 10000:
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'
genunix`segvn_faultpage+0x238
genunix`segvn_fault+0x920
genunix`as_fault+0x4b8
unix`pagefault+0xac
unix`trap+0xc5c
unix`utl0+0x4c
Note that, in both cases, we can only see back as far as where we switched
to kernel context, since trap context uses only registers or scratch space
for its work and does not save traceable stack frames for us.
For those of you who wish to trace the path of a process' zero fill
page faults from beginning to end, you may do so quite easily by
running this DTrace script, as root. The script
takes one argment, which is the exec name of the binary to trace. I
recommend a simple one like "ls" since it is relatively small and
short lived.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
1 The topic of the details of SPARC TLB
handling is one that will take many blog entries to cover from
beginning to end, so I'm skipping over many of the details here for
now. For the impatient, pick up a copy of Solaris Internals by
Jim Mauro and Richard McDougall
(ISBN 0-13-022496-0); though much of the material is dated now, many
of the details are still accurate.
2 This TSB mechanism could be employed by
the hardware for a little extra effort. No current sun4u
systems do so, but some future systems may support the TSB lookup in
hardware.
3 I'm skipping a step here for the sake
of brevity -- there are actually two TSB searches in the case of a
process which is using large pages, since a separate 4M-indexed TSB is
kept for large pages. If the process is using 4M or larger pages, the
second TSB must be searched also prior to a TSB miss. This second
search is performed using a software generated TSB index, since the
hardware assist only generates a 8K-indexed TSB pointer into the first
TSB. See sfmmu_udtlb_slowpath() in
the source if you care to see what really happens... Go on, you
really have the source now, so no excuses :)
4 In some ways, this is unfortunate
because the seg_vn segment driver is the most complicated of
all the segment drivers in the Solaris VM subsystem, and as such has a
very steep learning curve. Within Sun, we often joke that nobody
understands how it all works, as it has evolved over a period of many,
many years, and all of the original implementors have since moved on
or are now part of Sun's upper management. While the spirit of the
code hasn't changed significantly from the original SVR4 code, much of
the complexity added over the years has evolved to support modern
features like superpages that were not anticipated in the
original design. This can make for a few twists and turns in the
source even for following the path of a simple example like our ZFOD
fault.
(2005-06-14 09:04:04.0/2005-06-14 09:00:00.0)
Permalink
Trackback: http://blogs.sun.com/elowe/entry/page_fault_handling_in_solaris
|