Andrew Rutz's blog
Solaris boot-time and time-complexity
The only problem with algorithms reified as software programs is that they have to run on a real computer. Wall Street and Main Street are not interested in the theoretical world of Turing machines . As experienced in the real world, an algorithm must have a "running time" that is "practical". As taught in computer science classes, "practicality" is represented as time-complexity, which is represented using "O notation" , as in kO(n) (for an algorithm whose execution time is linearly proportional to a number of inputs, n). It's that darn little k, though, that makes this blog more interesting.
Until Solaris supported platforms with very many resources (CPUs, main memory, IO devices), its boot algorithm, in general, was "conveniently agnostic" towards resource-dense systems. Until the last half of the 1990's, user and engineering focus had been given to measurements that flattered the system's steady-state capabilities (e.g., Dhrystones, Whetstones, MIPs , and bandwidth). However, a fairly recent industry focus on RAS (Reliability, Availability, and Serviceability) accordingly swayed the system's internal and external customers to the degree that the system's non-steady-state lifecycle events (boot time, crash time, low-power-cycle time) were equally important. Mission-critical systems are anything if not practical, and their return-on-investment and effectiveness were begrudgingly admitted to also be a function of the "punctuations" within a system's lifetime.
So, along comes Solaris 10 and its internal attack on significantly reducing boot time. Optimization of one function within the boot algorithm reduced boot time about 30% (on systems with a sparse IO device tree). cached_va_to_pa() is the optimized form of va_to_pa(). cached_va_to_pa() caches the last virtual- to physical-address translation so as to significantly decrease the translation cost.
va_to_pa() translates a kernel virtual address ("VA") into a physical address ("PA"). After Solaris has booted, the kernel uses its virtual memory (VM) metadata to quickly translate a VA to a PA. Early during boot, however, "the firmware" (OBP) is (still) responsible for this translation.
As any optimization in good-standing would be expected to do, cached_va_to_pa() leverages empirically-derived edge-cases to specialize a general purpose algorithm. Even though va_to_pa() is still one of the most popular dynamic call sites in Solaris, its generality during steady-state execution is its liability during boot-time.
cached_va_to_pa()'s lone static call site has this call-graph:
_start
main
startup
startup_modules
sfmmu_init_nucleus_hblks
cached_va_to_pa
However, this lone static call-site is (dynamically) called millions of times during boot. It initializes one field within a key data structure used within Solaris's virtual memory implementation. At this point during boot, cached_va_to_pa() is able to leverage these facts:
- the kernel is single-threaded
- a physical page has a statically-known and fixed size (8KB)
- the caller is initializing a static array of small data structures whose virtual addresses are contiguous
These facts allow cached_va_to_pa() to quickly translate a VA to a PA. cached_va_to_pa() caches the last (starting) PA that was computed. If the next translation maps to that same physical page, then the PA is reused. If not, a call to va_to_pa() is made (and the computed (starting) PA is saved). Trivially, since the kernel is single-threaded when cached_va_to_pa() is called, one is guaranteed that the previous and current calls are from the same kernel thread. As a result, the hit ratio of this one-element lookup cache is optimal: only one in twenty five calls (per 8KB physical page) requires a call to the "heavyweight" va_to_pa() implementation.
So, even though the time-complexity of this part of Solaris's boot algorithm is still linear (k * O(n)), the constant number of instructions used per physical page, k, is significantly smaller.
The large reduction is due to not having to call OBP. OBP's binary interface requires its client (e.g., Solaris) to implement a call that is consistent with OBP's stack-based machine. As a result, a call occurs by having Solaris translate its "C" language implementation values into FORTH language tokens, pushing the token(s), executing the tokens, popping the tokens, and translating the tokens back into "C" from FORTH.
Posted at 09:16PM Aug 23, 2005 by Andrew Rutz in Solaris | Comments[0]