Paul Sandhu's Weblog

Paul Sandhu's Weblog

All | General | Java | Music

20071009 Tuesday October 09, 2007

Shared Context and other UltraSPARC T2 processor MMU Features

This is a brief tour around some of the new memory management features in the UltraSPARC T2 processor, there are a number of interesting changes in this area. Here is a brief summary:


Hardware Table Walk

Enables the Memory Management Unit (MMU) hardware to search the Translation Storage Buffers (TSBs) which are the software maintained address translation caches, in the event of a miss on the the Translation Lookaside Buffer (TLB) the hardware address translation cache.


Shared Context

This is an innovative feature which allows the TLB to be used more efficiently by processes using shared memory. We had to make substantial software changes to take advantage of this.


Improved TSB support

There is also better hardware support for the TSBs, in that up to 4 TSBs (that is 4 user and 4 kernel) can be configured. Larger TSBs - up to 256MB – were supported in the UltraSPARC T1 processor but software support has now been added. Previously 2 TSBs were used and mappings were stored in the TSBs according to page size so mappings with a page size less than 4MB were stored in the first TSB and those with a page size greater than or equal to 4MB were stored in the second TSB. The 4 TSB support allowed the 2 additional TSBs to be used to support shared contexts.


Hashed Cache Index

The hashed cache feature eliminates one of the disadvantages of using 4MB and 256MB pages which is a reduction in the number of page colours available, this then leads to increased contention in the L2 cache.


I've been working mainly as part of the team implementing shared contexts software, so I'll describe this in a little more detail.


Shared Context – more detailed description

Well, why do we need contexts to begin with? Imagine multiple processes executing the same program, these will then map the same program segments to the same virtual addresses within each process, the question then is how do we find a match for these addresses in the TLB. Obviously, a straight lookup doesn't work because the same virtual addresses within each process must map to different physical addresses. So, the MMU designers came up with the idea of a context, each process has a unique context number and when we look up an address in the TLB we search for a match on the <va, ctx> pair. So, that's fine surely we can put our feet up and watch daytime TV now. Unfortunately this doesn't work well in the case in which multiple processes actually share memory which is mapped at the same virtual address within each process. This scenario is pretty common with Oracle and other databases. In this case we have all these <va, ctx> mappings in the TLB and here the underlying physical addresses are the same, however the TLB will not notice because each process will have a different context and so each process will have to load its own version of the shared mapping into the TLB. Bad news - this TLB real estate is expensive!


The MMU designers went back to the drawing board and came up with the idea of having more than one context. How is this going to help anybody? Well suppose the MMU hardware searches the TLB for a match on either <va, ctx0> or <va, ctx1> and all the processes which share memory segments that are attached at the same virtual address load these shared mappings into the TLB using ctx1, then we always get a hit on <va, ctx1> and the TLB entries are shared rather than being per-process.


All that is left for the software team to do now is manage this new context register which turns out to be pretty tricky. How do we identify processes which can use shared contexts? There isn't an existing mechanism in Solaris to identify sharing at this level. We found an interesting white paper which considered this problem Improving the Address Translation Performance of Widely Shared Pages; by Yousef A. Khalidi and Madhusudhan Talluri. We defined a Region as a segment attached to an address space with a fixed set of attributes which included the virtual address it was attached at, its size, access permissions and the underlying object it represented. The idea was to represent each of these regions by an integer, the region identifier which could be used in a bitmap. Basically, we end up with two bitmaps, one per process to represent the regions attached to it and one for the Shared Context Domain which represents the regions are common to a number of processes and can be loaded with the shared context. In order to limit the size of the bitmaps required we restricted the scope of the region ids to processes which share the same executable file. Looking at common applications, most sharing took place in this situation. This gave rise to the Shared Region Domain which is the data structure which is used to manage the regions associated with a particular executable file.


One of the ideas in the white paper which proved particularly useful was shared hme entries. Intimate Shared Memory (ISM) already supports shared hme entries but has its own specific mechanism to do this, other methods of sharing memory duplicate hme entries so each process has its own copy. The implementation of Regions and Shared Region Domains support allows the sharing of hme entries to be used more widely. We also implemented shared mappings in TSBs for mappings which are part of a Shared Context Domain. So, a process which is part of a Shared Context Domain will have 2 private TSBs and 2 TSBs which are shared with all the other processes in the Domain.

( Oct 09 2007, 01:10:06 PM PDT ) Permalink

20050614 Tuesday June 14, 2005

Changes to Hme Block Handling Changes to Hme Block Handling



This is a quick spin around one of the lower regions of the Solaris Virtual Memory subsystem which you can have a look at now, due to today's launch of OpenSolaris. The famous words from Dante's Inferno “Abandon Hope all ye that Enter” may be appropriate in this context. I will endeavour to play the part of Virgil, and guide you faithfully, through the darkness and the flames. Take my hand...

The hme_blk structures are used, on SPARC systems, to keep track of active mappings and perform a function analogous to page tables on other architectures. Each hme_blk represents a contiguous area of mapped virtual memory for a particular address space, defined by a base page address and a span. A single hme_blk maps a maximum of either eight 8KB virtually contiguous pages or one large page, which on current SPARC processors is 64KB, 512KB or 4MB. These structures are arranged in a series of hash buckets with a hashing function based on an address space identifier, the base virtual address which is mapped by the hme_blk and the page size. In order to locate a mapping, the system hashes using the appropriate base virtual address and page size for each page size in turn (8KB and 64KB hme_blks map the same virtual address range and so only a single search is required for these) until a match is found.

There is however a problem when unmapping an entire user address space. The obvious way to do this would be to search the hash chains for each hme_blk associated with the process address space, by iterating through the entire process virtual address range searching for the mapping size equal to the range of one hme_blk . With a large user address space and given that with 8KB pages an hme_blk can only map 64KB, this would be time consuming and inefficient. In order to speed up this process the shadow hme_blk was introduced. Each hme_blk allocated for a process has an additional, larger, shadow hme_blk. These are dummy hme_blk structures which have the shadow flag set and map 4MB rather than 64KB for 8KB pages. The strategy for unmapping an address space, is to walk through it, searching the hash chains looking for 4MB mappings. In an address space sparsely populated with mappings the usual case is that no mapping exists for this range and so no 4MB shadow hme_blk will be found and that address range can be skipped. If, however, a 4MB shadow hme_blk is found then there must be at least one, smaller, real mapping so we now step through the 4MB address range looking for the real mappings.

There are a number of possible improvements that spring to mind when considering this design. The multiple scans of hash chains when locating a mapping, may be inefficient particularly if the hash chains are long and future SPARC processors will have larger page sizes, therefore increasing the number of scans required. Also, it would be more cache friendly if large mappings were grouped, so that there was more than a single mapping per hme_blk. My initial proposal was to redesign the hme_blk structure so as to integrate all the different page size mappings, into a single larger structure, which always maps a fixed span size of 32MB. This would avoid the current behaviour of multiple searches for each page size in turn, instead we would search the hash chains once for the hme_blk mapping a 32MB span and then locate smaller mappings within the span. This proposal was implemented, but unfortunately, it proved difficult to show measurable performance improvements on standard benchmarks and so it was decided that further investigation was required. There are several complex issues involved in measuring performance in this area.

One of the new features introduced in Solaris 10 is the Dynamic TSB. The TSB is a software cache, when a TLB miss occurs, software checks the TSB to see if the mapping is present, if it is the TLB is reloaded and processing continues, otherwise the hme hash chains are searched. If a mapping is located in the hash chains, then both the TSB and the TLB are updated otherwise this is treated as a page fault. In earlier versions of Solaris the TSB for each process was statically allocated at system startup and potentially shared. The Dynamic TSB project provides support for up to two per-process TSBs, one per page size, which can grow and shrink as necessary, therefore significantly reducing TSB misses. As a result of this, most searches of the hme_blk hash chains are during process startup, shutdown and the fork system call. These events occur relatively infrequently on most benchmarks and so improved hme_blk performance can be hard to measure. Also, in order to see any performance improvement the hme_blk hash chains must have a non-zero length, otherwise multiple hash chain searches have effectively no cost. However, determining a sensible hash chain length to use for performance testing is difficult without data from real user workloads. The actual chain length itself is a result of three factors, the number of hme_blk hash buckets, the amount of virtual memory allocated on the system and the effectiveness of the hashing algorithm.

As a result of these issues, this is now an area for further investigation and only some fairly minor changes have been integrated into the Solaris source. One of the main changes was that, at system startup, the number of hme_blk hash buckets was calculated based on the amount of memory attached to the system, with an upper limit based on a system with 16GB of memory. This was out of date with respect to the much larger memory systems now being produced and so the limit was removed. The number of hash buckets is calculated in the function ndata_alloc_hat using the formula

    uhmehash_num = (npages * HMEHASH_FACTOR)/ 
       (HMENT_HASHAVELEN * (HMEBLK_SPAN(TTE8K) >> MMU_PAGESHIFT)); 

Where HMEHASH_FACTOR (= 16) is the factor used to obtain an estimate for the maximum virtual memory likely to be allocated on a system based on the amount of physical memory attached (npages), and HMENT_HASHAVELEN (= 4) is the average hash chain length which the system is designed to handle. The HMEBLK_SPAN macro returns the span of an 8KB hme_blk which is 64KB and shifting by MMU_PAGESHIFT, converts this into 8KB pages. The uhmehash_num value is then rounded up to the next largest power of two so that it can be used efficiently in the hash calculation. In order to guard against over-allocating hash buckets, on larger memory systems (>64GB) the calculated number of hash buckets is scaled down to the next smaller power of two rather than up.

Brave souls that have come this far, may want to browse the files that implement this functionality, these are hat_sfmmu.c, hat_sfmmu.h and sfmmu_asm.s.

Technorati Tag: OpenSolaris, Solaris











( Jun 14 2005, 08:33:51 AM PDT ) Permalink Comments [0]

20050513 Friday May 13, 2005

Introduction

Introduction

Here's a brief introduction as a prelude to presenting more technical content in the future. I've worked as a software engineer with Sun for the past two years, working mostly on SPARC specific Virtual Memory related projects. In particular the HAT (Hardware Address Translation) layer which Joe describes in The Mad Hatter. I'm based in the UK on the Guillemont park campus, having previously worked with Solaris support for five years. Prior to that I worked for ICL/Fujitsu for more years than I care to count, mostly on kernel development on DRS6000, their multi-processor SPARC platform.

The HAT Layer

The HAT layer lends itself to a number of doubtful puns which I will carefully avoid. This layer acts as an interface between the generic VM code above it and the architecture-specific memory management hardware beneath. I have mostly been involved with the HAT software, which interfaces with the Sun4u platform, which is known as the sfmmu. If you browse the source code you will find a large number of VM routines prefixed with sfmmu, to indicate that they belong to this layer. One of the problems with making sense of this code is that a general understanding of the memory management hardware is required in order to get a good idea of what is going on.

SPARC MMU Hardware

The SPARC MMU hardware consists of (at least) two Translation Lookaside Buffers or TLBs together with a number of support registers. The TLBs vary in size and structure depending upon the particular processor type. Older processors have 64 entry fully associative TLBs, more recent processors have increased the number of entries but reduced the associativity. The common factors are that there are separate instruction and data TLBs and that each entry in the TLB is a TTE which may be thought of as a page table entry or PTE on other architectures. A TTE is made up of two components - the tag and the translation data, each of length 64 bits. The TTE tag contains the encoded virtual address and context number, and the TTE data contains the corresponding physical address together with various properties associated with the translation. The context number is a 13 bit quantity which is used to distinguish between different user address spaces, so that the same virtual addresses in different address spaces can coexist in the TLB. One of the most significant properties of the mapping is its size. Each TTE maps a contiguous area of memory, which can be 8KB, 64KB, 512KB or 4MB in size, on current processors. Larger page sizes will be supported on future processors. A TLB hit occurs if both the virtual address and context supplied to the TLB correspond to those of a particular TTE entry, In the event of a TLB miss trap, an in-memory array of translations, called the Translation Storage Buffer, provides a software managed, directly mapped cache, which is used to reload the TLB. When a translation is not present in the TSB, a software lookup mechanism is used to obtain the TTE. This is based on the hme_blk and its associated data structures which are used to keep track of active mappings. These structures perform a similar function to page tables in other architectures.

hme_blk structures

The hme_blk structures each define virtual to physical mappings for a particular address space and virtual address range. They are organised into a series of hash buckets based on an address space identifier, the virtual address and the page size used. In the event of a TSB miss a hash of these elements is used to obtain the correct hash bucket and then a linear search of the list is made to find the corresponding hme_blk for the mapping.

Something Completely Different

My main interests outside work are focussed on music and books. I am passionate about classical music, one of the advantages of living in the UK (apart from the weather!) is that we have an excellent classical radio station in BBC Radio 3. One of my favourite writers is Saul Bellow, who died recently, I have admired his work for many years. More recently, I have just finished a book by Haldor Laxness, a Nobel prize winning Icelandic writer, called "Under the Glacier" which is both very amusing and unlike anything I've have ever read before.

( May 13 2005, 07:02:12 AM PDT ) Permalink Comments [0]


Today's Page Hits: 37