This is a quick spin around one of the lower regions of the Solaris Virtual Memory subsystem which you can have a look at now, due to today's launch of OpenSolaris. The famous words from Dante's Inferno “Abandon Hope all ye that Enter” may be appropriate in this context. I will endeavour to play the part of Virgil, and guide you faithfully, through the darkness and the flames. Take my hand...
The hme_blk structures are used, on SPARC systems, to keep track of active mappings and perform a function analogous to page tables on other architectures. Each hme_blk represents a contiguous area of mapped virtual memory for a particular address space, defined by a base page address and a span. A single hme_blk maps a maximum of either eight 8KB virtually contiguous pages or one large page, which on current SPARC processors is 64KB, 512KB or 4MB. These structures are arranged in a series of hash buckets with a hashing function based on an address space identifier, the base virtual address which is mapped by the hme_blk and the page size. In order to locate a mapping, the system hashes using the appropriate base virtual address and page size for each page size in turn (8KB and 64KB hme_blks map the same virtual address range and so only a single search is required for these) until a match is found.
There is however a problem when unmapping an entire user address space. The obvious way to do this would be to search the hash chains for each hme_blk associated with the process address space, by iterating through the entire process virtual address range searching for the mapping size equal to the range of one hme_blk . With a large user address space and given that with 8KB pages an hme_blk can only map 64KB, this would be time consuming and inefficient. In order to speed up this process the shadow hme_blk was introduced. Each hme_blk allocated for a process has an additional, larger, shadow hme_blk. These are dummy hme_blk structures which have the shadow flag set and map 4MB rather than 64KB for 8KB pages. The strategy for unmapping an address space, is to walk through it, searching the hash chains looking for 4MB mappings. In an address space sparsely populated with mappings the usual case is that no mapping exists for this range and so no 4MB shadow hme_blk will be found and that address range can be skipped. If, however, a 4MB shadow hme_blk is found then there must be at least one, smaller, real mapping so we now step through the 4MB address range looking for the real mappings.
There are a number of possible improvements that spring to mind when considering this design. The multiple scans of hash chains when locating a mapping, may be inefficient particularly if the hash chains are long and future SPARC processors will have larger page sizes, therefore increasing the number of scans required. Also, it would be more cache friendly if large mappings were grouped, so that there was more than a single mapping per hme_blk. My initial proposal was to redesign the hme_blk structure so as to integrate all the different page size mappings, into a single larger structure, which always maps a fixed span size of 32MB. This would avoid the current behaviour of multiple searches for each page size in turn, instead we would search the hash chains once for the hme_blk mapping a 32MB span and then locate smaller mappings within the span. This proposal was implemented, but unfortunately, it proved difficult to show measurable performance improvements on standard benchmarks and so it was decided that further investigation was required. There are several complex issues involved in measuring performance in this area.
One of the new features introduced in Solaris 10 is the Dynamic TSB. The TSB is a software cache, when a TLB miss occurs, software checks the TSB to see if the mapping is present, if it is the TLB is reloaded and processing continues, otherwise the hme hash chains are searched. If a mapping is located in the hash chains, then both the TSB and the TLB are updated otherwise this is treated as a page fault. In earlier versions of Solaris the TSB for each process was statically allocated at system startup and potentially shared. The Dynamic TSB project provides support for up to two per-process TSBs, one per page size, which can grow and shrink as necessary, therefore significantly reducing TSB misses. As a result of this, most searches of the hme_blk hash chains are during process startup, shutdown and the fork system call. These events occur relatively infrequently on most benchmarks and so improved hme_blk performance can be hard to measure. Also, in order to see any performance improvement the hme_blk hash chains must have a non-zero length, otherwise multiple hash chain searches have effectively no cost. However, determining a sensible hash chain length to use for performance testing is difficult without data from real user workloads. The actual chain length itself is a result of three factors, the number of hme_blk hash buckets, the amount of virtual memory allocated on the system and the effectiveness of the hashing algorithm.
As a result of these issues, this is now an area for further investigation and only some fairly minor changes have been integrated into the Solaris source. One of the main changes was that, at system startup, the number of hme_blk hash buckets was calculated based on the amount of memory attached to the system, with an upper limit based on a system with 16GB of memory. This was out of date with respect to the much larger memory systems now being produced and so the limit was removed. The number of hash buckets is calculated in the function ndata_alloc_hat using the formula
uhmehash_num = (npages * HMEHASH_FACTOR)/ (HMENT_HASHAVELEN * (HMEBLK_SPAN(TTE8K) >> MMU_PAGESHIFT));
Where HMEHASH_FACTOR (= 16) is the factor used to obtain an estimate for the maximum virtual memory likely to be allocated on a system based on the amount of physical memory attached (npages), and HMENT_HASHAVELEN (= 4) is the average hash chain length which the system is designed to handle. The HMEBLK_SPAN macro returns the span of an 8KB hme_blk which is 64KB and shifting by MMU_PAGESHIFT, converts this into 8KB pages. The uhmehash_num value is then rounded up to the next largest power of two so that it can be used efficiently in the hash calculation. In order to guard against over-allocating hash buckets, on larger memory systems (>64GB) the calculated number of hash buckets is scaled down to the next smaller power of two rather than up.
Brave souls that have come this far, may want to browse the files that implement this functionality, these are hat_sfmmu.c, hat_sfmmu.h and sfmmu_asm.s.
Technorati Tag: OpenSolaris, Solaris


