/var/crash/elowe

Core Dumps of a Kernel Hacker's Brain - Eric Lowe's Blog


20050614 Tuesday June 14, 2005

 The secret sauce myth -- exposed

Dangers of having the source (and the "secret sauce" myth exposed)

The other day, there was a good internal thread going on one of the bigger internal Sun mailing lists, and it reminded me of an important point that I think is worth sharing with the outside world -- particularly now that OpenSolaris has landed. Having access to the source can be dangerous to developers if you're trying to develop stable, forward-compatible software for your users. The particular thread was centered around the use of private members of the vnode in a third party filesystem -- but it really could apply to development of any software on any system where folks have unrestricted access to the source code.

There is a myth that so-called "private" interfaces in Solaris are private simply because Solaris is a proprietary operating system, and Sun does not want some external software developer having access to our mytical secret sauce. It goes on -- furthermore, Sun intentionally makes these interfaces cryptically presented in the header files, and does not document them, simply because they provide a better way of doing something than the documented ones. Sun does this so that our software will somehow be better than anyone else can write, giving us a competitive edge, so we can keep selling our proprietary system running on our proprietary hardware.

BZZT! WRONG!!!

Spin again!

Solaris provides an interface stability taxonomy which defines certain levels of stability for the programming interfaces which are presented to developers (this taxonomy is documented in the attributes(5) man page). The interface stability taxonomy provides classifications which make it clear what the commitment level of Sun is to maintain backward compatability with any programs that use that interface in the future.

  • Standard means the interface is defined by one of the various standards Solaris supports (e.g. POSIX) and hence cannot change as long as Solaris claims to support that standard
  • Stable interfaces are guaranteed not to change incompatibly until the next major release. To put things into perspective, the last major release of SunOS was from 4.x to 5.0, which as you all know was a LONG time ago!
  • Evolving and Unstable interfaces may not change incompatibly except in a minor release. Solaris 10 was a minor release, as was Solaris 9; Solaris 10 quarterly updates and 2.5.1 are examples of micro releases, which are not allowed to change these interfaces incompatibly.
A good rule to know is that if the interface has a man page, and there is no release taxonomy info at the bottom of the man page, the interface stability is Stable. If there isn't a man page, then the interface may be a private interface, or it may be an implementation artifact (I don't believe there is any way to tell which from outside of Sun yet, but I wouldn't be surprised if there is one on the OpenSolaris website soon). Private interfaces and implementation details of the system may change in micro releases and patches -- wherein lies the hidden danger!

When devloping software, if you want your software to run on future releases of Solaris, you must be careful to use only interfaces which have a suitable commitment level. Interfaces you run across while looking through the source code or headers, if not documented, aren't guaranteed to work in the future -- and are very likely NOT to work in the future. This means your program will stop working if Sun or the OpenSolaris community decides to change that interface for any reason, at any time, even in a patch. Which, needless to say, will cause headaches for the users of your software!

Getting back to the thread, Jim Carlson summed things up quite well, when he stated thus:

...we[1] reserve the right to change these interfaces in whatever way we                      
choose, at any time at all (including in patches), without notice of                 
any kind, and without any attempt to preserve any sort of                            
compatibility.                                                                       
                                                                                     
Thus, if you depend on private interfaces by writing your own software               
that uses them, then you'll end up hurting yourself.                                 
                                                                                     
This is roughly the equivalent of a "no user serviceable parts inside"               
warning on an appliance.  It doesn't mean that you can't open it up                  
and poke around inside if you really know what you're doing.  But if                 
you do, and you end up electrocuting yourself or starting a fire, you                
have nobody else to blame.                                                           
                                                                                     
[1] "We" in the above sentence means "everyone working on the code."                 
    If that turns out to be a community-based effort, then it's the                  
    members of that community who own and direct it.                                 

Don't believe the myth, and don't electrocute your customers -- it's not worth it. Now that Solaris is opened up, if there isn't a way to do what you want to do, you can create one (or ask the community to help)!


Technorati Tag:
Technorati Tag:

(2005-06-14 19:39:25.0/2005-06-14 21:17:44.0) Permalink
Trackback: http://blogs.sun.com/elowe/entry/dangers_of_having_the_source

 Page Fault Handling in Solaris

Welcome to OpenSolaris! In this entry, I'll walk through the page fault handling code, which is ground zero of the Solaris virtual memory subsystem. Due to the nature of this level in the system, part of the code (the lowest level that interfaces to hardware registers) is machine dependent, while the rest is common code written in C. Hence, I will present this topic in three parts: x64 machine dependent code, which has the most hardware handling for TLB misses, followed by the more complex SPARC machine dependent code, which relies on assembly code to handle TLB misses from trap context; I'll wrap up by covering the common code which is executed from kernel context.

Part 1: x64 Machine Dependent Layer

Since all x86-class machines handle TLB misses using a hardware page table walk mechanism, the Hardware Address Translation, or HAT, layer for x64 systems is the least complex of the two system architectures Solaris currently supports. Both the x86 and AMD systems use a page directory scheme to map per-address-space virtual memory addresses to physical memory addresses. When a TLB miss occurs, the MMU (memory management unit) hardware searches the page table for the page table entry (PTE) associated with the virtual address of the memory access, if one exists. In the page directory model, the virtual address is divided up into several parts; each successive part of the virtual address forms an index into each successive level in the directory, while the higher level directory entries point to the address in memory of the next lowest directory. Each directory table is 4K in size, which corresponds to the base page size of the processor. The pointer to the top-level page directory is programmed into the cr3 hardware register on context switch.

  [Directory based page table]
   Directory-based page tables

Since we're discussing the page fault path in this blog entry, we are interested in the case where the processor fails to find a valid PTE in the lowest level of the directory. This results in a page fault exception (#pf), which passes control synchronously to a page fault handler in trap context. This low-level handler is pftrap(), located in exception.s. The handler jumps to cmntrap() over in locore.s which pushes the machine state onto the stack, switches to kernel context, and invokes the C kernel-side trap handler trap() in trap.c with a trap type of T_PGFLT. The trap() routine figures out that this is a user fault since it lies below KERNELBASE, and calls pagefault() in vm_machdep.c. The pagefault() routine collects the necessary arguments for the common as_fault() routine, and passes control to it.

For more information regarding the x64 HAT layer, refer to Joe Bonasera's blog where he has started blogging about this subsystem which he and Nils Nieuwejaar redesigned from the ground up for the AMD64 port in Solaris 10.

Part 2: SPARC Machine Dependent Layer

The UltraSPARC architecture, the only SPARC architecture currently supported by Solaris -- relies entirely on software to handle TLB misses1. Hence, the HAT layer for SPARC is a bit more complex than the x64 one. To speed up handling of TLB miss traps, the processor provides a hardware-assisted lookup mechanism2 called the Translation Storage Buffer (TSB). The TSB is a virtually indexed, direct-mapped, physically contiguous, and size-aligned region of physical memory which is used to cache recently used Translation Table Entries (TTEs) after retrieval from the page tables. When a TLB miss occurs, the hardware uses the virtual address of the miss combined with the contents of a TSB base address register (which is pre-programmed on context switch) to calculate the pointer into the TSB of the entry corresponding to the virtual address. If the TSB entry tag matches the virtual address of the miss, the TTE is loaded into the TLB by the TLB miss handler, and the trapped instruction is retried. See DTLB_MISS() in trap_table.s and sfmmu_udtlb_slowpath in sfmmu_asm.s. If no match is found, the trap handler branches to a slow path routine called the TSB miss handler3.

The SPARC HAT layer (named sfmmu after the codename spitfire MMU, the first UltraSPARC MMU supported) uses an open hashing technique to implement the page tables in software. The hash lookup is performed using the struct hat pointer for the currently running process and the virtual address of the TLB miss. On a TSB miss, the function sfmmu_tsb_miss_tt in sfmmu_asm.s searches the hash for successive page sizes using the GET_TTE() assembly macro. If a match is found, the TTE is inserted into the TSB, loaded into the TLB, and the trapped instruction is re-issued. If a match is not found, or the access type does not match the permitted access for this mapping (e.g. a write is attempted to a read-only mapping) control is transferred to the sys_trap() routine in mach_locore.s after setting up the appropriate fault type. The sys_trap() routine (which is very involved due to SPARC's register windows) saves the machine state to the stack, switches from trap context to kernel context, and invokes the kernel-side trap handler in C, trap() over in trap.c. The trap() routine recognizes the T_DATA_MMU_MISS trap code and branches to pagefault() in vm_dep.c. As its x64 counterpart does, pagefault() collects the appropriate arguments and invokes the common handler as_fault().

For more information about the sfmmu HAT layer, keep coming back -- this subsystem warrants a more in-depth tour in future blogs.

Part 3: Common Code Layer

The Solaris virtual memory (VM) subsystem uses a segmented model to map each process' address space, as well as the kernel itself. Each segment object maps a contiguous range of virtual memory with common attributes. The backing store for each segment may be device memory, a file, physical memory, etc. Each backing store type is handled by a different segment driver. The most commonly used segment driver is seg_vn, so-named because it maps vnodes associated with files. Perhaps more interestingly, the seg_vn segment driver is also responsible for implementing anonymous memory which is so-called because it is private to a process and is backed by swap space rather than by a file object. Since seg_vn maps the majority of a process' address space, including all text, heap, and stack, I'll use it to illustrate the most common page fault path encountered by a process4.

Returning to the page fault path, assume that the page fault being examined has occurred in a virtual address range that corresponds to a process heap -- for instance, the first touch of new memory allocated by a brk() system call performed by the C library's malloc() routine. Such a fault will allocate process private, anonymous memory which is pre-filled with zeros, known to VM geeks as a ZFOD fault -- short for zero fill on demand. In such a situation, the as_fault() routine (vm_as.c) will search the process' segment tree looking for the segment that maps the virtual address range corresponding to the fault. If as_fault() discovers that no such segment exists, a fatal segmentation violation is signalled to the process causing it to terminate. In our example, a segment is found whose seg_ops corresponds to segvn_ops (seg_vn.c). The SEGOP_FAULT() macro is called, which invokes the segvn_fault() routine in seg_vn.c. In our example, the backing store is swap, so segvn_faultpages() will find there is no vnode backing this range, but rather an anon object and will allocate a page to back this virtual address through anon_zero() in vm_anon.c.

Here is a sample callstack into anon_zero() as viewed from DTrace on my workstation (which is a dual-CPU Opteron running Solaris 64-bit kernel):
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

              genunix`segvn_faultpage+0x16c
              genunix`segvn_fault+0x647
              genunix`as_fault+0x3c8
              unix`pagefault+0x7e
              unix`trap+0x792
              unix`_cmntrap+0x83
Here is another sample callstack into anon_zero(), this time from a Ultra-Enterprise 10000:
# dtrace -qn '::anon_zero:entry { stack(8); exit(0); }'

              genunix`segvn_faultpage+0x238
              genunix`segvn_fault+0x920
              genunix`as_fault+0x4b8
              unix`pagefault+0xac
              unix`trap+0xc5c
              unix`utl0+0x4c
Note that, in both cases, we can only see back as far as where we switched to kernel context, since trap context uses only registers or scratch space for its work and does not save traceable stack frames for us.

For those of you who wish to trace the path of a process' zero fill page faults from beginning to end, you may do so quite easily by running this DTrace script, as root. The script takes one argment, which is the exec name of the binary to trace. I recommend a simple one like "ls" since it is relatively small and short lived.

Technorati Tag:
Technorati Tag:
1 The topic of the details of SPARC TLB handling is one that will take many blog entries to cover from beginning to end, so I'm skipping over many of the details here for now. For the impatient, pick up a copy of Solaris Internals by Jim Mauro and Richard McDougall (ISBN 0-13-022496-0); though much of the material is dated now, many of the details are still accurate.

2 This TSB mechanism could be employed by the hardware for a little extra effort. No current sun4u systems do so, but some future systems may support the TSB lookup in hardware.

3 I'm skipping a step here for the sake of brevity -- there are actually two TSB searches in the case of a process which is using large pages, since a separate 4M-indexed TSB is kept for large pages. If the process is using 4M or larger pages, the second TSB must be searched also prior to a TSB miss. This second search is performed using a software generated TSB index, since the hardware assist only generates a 8K-indexed TSB pointer into the first TSB. See sfmmu_udtlb_slowpath() in the source if you care to see what really happens... Go on, you really have the source now, so no excuses :)

4 In some ways, this is unfortunate because the seg_vn segment driver is the most complicated of all the segment drivers in the Solaris VM subsystem, and as such has a very steep learning curve. Within Sun, we often joke that nobody understands how it all works, as it has evolved over a period of many, many years, and all of the original implementors have since moved on or are now part of Sun's upper management. While the spirit of the code hasn't changed significantly from the original SVR4 code, much of the complexity added over the years has evolved to support modern features like superpages that were not anticipated in the original design. This can make for a few twists and turns in the source even for following the path of a simple example like our ZFOD fault.



(2005-06-14 09:04:04.0/2005-06-14 09:00:00.0) Permalink
Trackback: http://blogs.sun.com/elowe/entry/page_fault_handling_in_solaris


« June 2005 »
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
8
9
10
11
12
13
16
17
19
20
21
22
24
25
26
27
28
29
30
  
       
Today


XML









Today's Page Hits: 57