Paul Sandhu's Weblog
Paul Sandhu's Weblog

Tuesday October 09, 2007
Shared Context and other UltraSPARC T2 processor MMU Features
This is a brief tour
around some of the new memory management features in the UltraSPARC T2 processor, there
are a number of interesting changes in this area. Here is a
brief summary:
Hardware Table Walk
Enables the Memory Management Unit (MMU) hardware
to search the Translation Storage Buffers (TSBs) which are the software maintained address
translation caches, in the event of a miss on the the Translation Lookaside Buffer (TLB) the
hardware address translation cache.
Shared Context
This is an innovative
feature which allows the TLB to be used more efficiently by processes
using shared memory. We had to make substantial software changes to
take advantage of this.
Improved TSB support
There is also better
hardware support for the TSBs, in that up to 4 TSBs (that is 4 user
and 4 kernel) can be configured. Larger TSBs - up to 256MB – were supported in
the UltraSPARC T1 processor but software support has now been added. Previously 2 TSBs
were used and mappings were stored in the TSBs according to page size
so mappings with a page size less than 4MB were stored in the first
TSB and those with a page size greater than or equal to 4MB were stored in the
second TSB. The 4 TSB support allowed the 2 additional TSBs to be
used to support shared contexts.
Hashed Cache Index
The hashed cache feature
eliminates one of the disadvantages of using 4MB and 256MB pages
which is a reduction in the number of page colours available, this then
leads to increased contention in the L2 cache.
I've been working mainly
as part of the team implementing shared contexts software, so I'll
describe this in a little more detail.
Shared Context – more detailed description
Well, why do we need
contexts to begin with? Imagine multiple processes executing the same
program, these will then map the same program segments to the same
virtual addresses within each process, the question then is how do we
find a match for these addresses in the TLB. Obviously, a straight
lookup doesn't work because the same virtual addresses within each
process must map to different physical addresses. So, the MMU
designers came up with the idea of a context, each process has a
unique context number and when we look up an address in the TLB we
search for a match on the <va, ctx> pair. So, that's fine surely we
can put our feet up and watch daytime TV now. Unfortunately this
doesn't work well in the case in which multiple processes actually
share memory which is mapped at the same virtual address within each
process. This scenario is pretty common with Oracle and other
databases. In this case we have all these <va, ctx> mappings in
the TLB and here the underlying physical addresses are the same,
however the TLB will not notice because each process will have a
different context and so each process will have to load its own
version of the shared mapping into the TLB. Bad news - this TLB real
estate is expensive!
The MMU designers went
back to the drawing board and came up with the idea of having more
than one context. How is this going to help anybody? Well suppose the
MMU hardware searches the TLB for a match on either <va, ctx0>
or <va, ctx1> and all the processes which share memory segments
that are attached at the same virtual address load these shared
mappings into the TLB using ctx1, then we always get a hit on <va,
ctx1> and the TLB entries are shared rather than being
per-process.
All that is left for the
software team to do now is manage this new context register which
turns out to be pretty tricky. How do we identify processes which can
use shared contexts? There isn't an existing mechanism in Solaris to
identify sharing at this level. We found an interesting white paper
which considered this problem
Improving the Address Translation Performance of Widely Shared Pages; by Yousef A. Khalidi and
Madhusudhan Talluri. We defined a Region as a segment attached to an
address space with a fixed set of attributes which included the
virtual address it was attached at, its size, access permissions and
the underlying object it represented. The idea was to represent each
of these regions by an integer, the region identifier which could be
used in a bitmap. Basically, we end up with two bitmaps, one per
process to represent the regions attached to it and one for the
Shared Context Domain which represents the regions are common to a
number of processes and can be loaded with the shared context. In
order to limit the size of the bitmaps required we restricted the
scope of the region ids to processes which share the same executable
file. Looking at common applications, most sharing took place in this
situation. This gave rise to the Shared Region Domain which is the
data structure which is used to manage the regions associated with a
particular executable file.
One of the ideas in the
white paper which proved particularly useful was shared hme entries.
Intimate Shared Memory (ISM) already supports shared hme entries but has its own specific
mechanism to do this, other methods of sharing memory duplicate hme
entries so each process has its own copy. The implementation of
Regions and Shared Region Domains support allows the sharing of hme
entries to be used more widely. We also implemented shared mappings
in TSBs for mappings which are part of a Shared Context Domain. So, a
process which is part of a Shared Context
Domain will have 2 private TSBs and 2 TSBs which are shared with all
the other processes in the Domain.
( Oct 09 2007, 01:10:06 PM PDT )
Permalink

Tuesday June 14, 2005
Changes to Hme Block Handling
Changes to Hme Block Handling
This is a quick spin around one of the lower regions of the
Solaris Virtual Memory subsystem which you can have a look at now,
due to today's launch of OpenSolaris.
The famous words from Dante's Inferno “Abandon Hope all ye that
Enter” may be appropriate in this context. I will endeavour to
play the part of Virgil, and guide you faithfully, through the
darkness and the flames. Take my hand...
The hme_blk structures are used, on SPARC systems,
to keep track of active mappings and perform a function analogous to
page tables on other architectures. Each hme_blk represents a
contiguous area of mapped virtual memory for a particular address
space, defined by a base page address and a span. A single hme_blk
maps a maximum of either eight 8KB virtually contiguous pages or one
large page, which on current SPARC processors is 64KB, 512KB or 4MB.
These structures are arranged in a series of hash buckets with a
hashing function based on an address space identifier, the base
virtual address which is mapped by the hme_blk and the page
size. In order to locate a mapping, the system hashes using the
appropriate base virtual address and page size for each page size in
turn (8KB and 64KB hme_blks map the same virtual address range
and so only a single search is required for these) until a match is
found.
There is however a problem when unmapping an entire user address
space. The obvious way to do this would be to search the hash chains
for each hme_blk associated with the process address space, by
iterating through the entire process virtual address range searching
for the mapping size equal to the range of one hme_blk . With
a large user address space and given that with 8KB pages an hme_blk
can only map 64KB, this would be time consuming and inefficient. In
order to speed up this process the shadow hme_blk was
introduced. Each hme_blk allocated for a process has an
additional, larger, shadow hme_blk. These are dummy
hme_blk structures which have the shadow flag set and map 4MB
rather than 64KB for 8KB pages. The strategy for unmapping an address
space, is to walk through it, searching the hash chains looking for
4MB mappings. In an address space sparsely populated with mappings
the usual case is that no mapping exists for this range and so no 4MB
shadow hme_blk will be found and that address range can be
skipped. If, however, a 4MB shadow hme_blk is found then there
must be at least one, smaller, real mapping so we now step through
the 4MB address range looking for the real mappings.
There are a number of possible improvements that spring to mind
when considering this design. The multiple scans of hash chains when
locating a mapping, may be inefficient particularly if the hash
chains are long and future SPARC processors will have larger page
sizes, therefore increasing the number of scans required. Also, it
would be more cache friendly if large mappings were grouped, so that
there was more than a single mapping per hme_blk. My initial
proposal was to redesign the hme_blk structure so as to
integrate all the different page size mappings, into a single larger
structure, which always maps a fixed span size of 32MB. This would
avoid the current behaviour of multiple searches for each page size
in turn, instead we would search the hash chains once for the hme_blk
mapping a 32MB span and then locate smaller mappings within the span.
This proposal was implemented, but unfortunately, it proved difficult
to show measurable performance improvements on standard benchmarks
and so it was decided that further investigation was required. There
are several complex issues involved in measuring performance in this
area.
One of the new features introduced in Solaris 10 is the Dynamic
TSB. The TSB is a software cache, when a TLB miss occurs, software
checks the TSB to see if the mapping is present, if it is the TLB is
reloaded and processing continues, otherwise the hme hash chains are
searched. If a mapping is located in the hash chains, then both the
TSB and the TLB are updated otherwise this is treated as a page
fault. In earlier versions of Solaris the TSB for each process was
statically allocated at system startup and potentially shared. The
Dynamic TSB project provides support for up to two per-process TSBs,
one per page size, which can grow and shrink as necessary, therefore
significantly reducing TSB misses. As a result of this, most searches
of the hme_blk hash chains are during process startup,
shutdown and the fork system call. These events occur relatively
infrequently on most benchmarks and so improved hme_blk
performance can be hard to measure. Also, in order to see any
performance improvement the hme_blk hash chains must have a
non-zero length, otherwise multiple hash chain searches have
effectively no cost. However, determining a sensible hash chain
length to use for performance testing is difficult without data from
real user workloads. The actual chain length itself is a result of
three factors, the number of hme_blk hash buckets, the amount
of virtual memory allocated on the system and the effectiveness of
the hashing algorithm.
As a result of these issues, this is now an area for further
investigation and only some fairly minor changes have been integrated
into the Solaris source. One of the main changes was that, at system
startup, the number of hme_blk hash buckets was calculated
based on the amount of memory attached to the system, with an upper
limit based on a system with 16GB of memory. This was out of date
with respect to the much larger memory systems now being produced and
so the limit was removed. The number of hash buckets is calculated in
the function ndata_alloc_hat
using the formula
uhmehash_num = (npages * HMEHASH_FACTOR)/
(HMENT_HASHAVELEN * (HMEBLK_SPAN(TTE8K) >> MMU_PAGESHIFT));
Where HMEHASH_FACTOR (= 16) is the factor used to
obtain an estimate for the maximum virtual memory likely to be
allocated on a system based on the amount of physical memory attached
(npages), and HMENT_HASHAVELEN (= 4) is the average hash chain length
which the system is designed to handle. The HMEBLK_SPAN macro returns
the span of an 8KB hme_blk which is 64KB and shifting by
MMU_PAGESHIFT, converts this into 8KB pages. The uhmehash_num value
is then rounded up to the next largest power of two so that it can be
used efficiently in the hash calculation. In order to guard against
over-allocating hash buckets, on larger memory systems (>64GB) the
calculated number of hash buckets is scaled down to the next smaller
power of two rather than up.
Brave souls that have come this far, may want to browse the files
that implement this functionality, these are hat_sfmmu.c,
hat_sfmmu.h
and sfmmu_asm.s.
Technorati Tag: OpenSolaris,
Solaris
( Jun 14 2005, 08:33:51 AM PDT )
Permalink

Friday May 13, 2005
Introduction
Introduction
Here's a brief introduction as a prelude to presenting more technical content in the future.
I've worked as a software engineer with Sun for the past two years, working mostly on SPARC
specific Virtual Memory related projects. In particular the HAT (Hardware Address Translation)
layer which Joe describes in
The Mad Hatter.
I'm based in the UK on the Guillemont park campus, having previously worked with Solaris support
for five years. Prior to that I worked for ICL/Fujitsu for more years than I care to count, mostly
on kernel development on DRS6000, their multi-processor SPARC platform.
The HAT Layer
The HAT layer
lends itself to a number of doubtful puns which I will carefully avoid. This layer acts as an
interface between the generic VM code above it and the architecture-specific memory management
hardware beneath. I have mostly been involved with the HAT software, which interfaces with the
Sun4u platform, which is known as the
sfmmu. If you browse the source code you will find
a large number of VM routines prefixed with sfmmu, to indicate that they belong to this layer.
One of the problems with making sense of this code is that a general understanding of the
memory management hardware is required in order to get a good idea of what is going on.
SPARC MMU Hardware
The SPARC MMU hardware consists of (at least) two
Translation Lookaside Buffers or
TLBs together with a number of support registers. The TLBs vary in size and structure
depending upon the particular processor type. Older processors have 64 entry fully associative TLBs,
more recent processors have increased the number of entries but reduced the associativity.
The common factors are that there are separate
instruction and data TLBs and that each entry in the TLB is a
TTE which may be thought
of as a page table entry or PTE on other architectures. A TTE is made up of two components
- the tag and the translation data, each of length 64 bits. The TTE tag contains the encoded
virtual address and context number, and the TTE data contains the corresponding physical address
together with various properties associated with the translation. The context number is a 13 bit
quantity which is used to distinguish between different user address spaces, so that the
same virtual addresses in different address spaces can coexist in the TLB. One of the
most significant properties of the mapping is its size. Each TTE maps a contiguous area
of memory, which can be 8KB, 64KB, 512KB or 4MB in size, on current processors. Larger page
sizes will be supported on future processors. A TLB hit occurs if both the virtual address
and context supplied to the TLB correspond to those of a particular TTE entry,
In the event of a TLB miss trap, an in-memory array of translations, called the
Translation Storage Buffer, provides a software managed, directly mapped cache,
which is used to reload the TLB. When a translation is not present in the TSB, a software
lookup mechanism is used to obtain the TTE. This is based on the
hme_blk
and its associated data structures which are used to keep track of active mappings.
These structures perform a similar function to page tables in other architectures.
hme_blk structures
The hme_blk structures each define virtual to physical mappings for a particular address
space and virtual address range. They are organised into a series of hash buckets based
on an address space identifier, the virtual address and the page size used. In the event
of a TSB miss a hash of these elements is used to obtain the correct hash bucket and then
a linear search of the list is made to find the corresponding hme_blk for the mapping.
Something Completely Different
My main interests outside work are focussed on music and books. I am passionate about classical
music, one of the advantages of living in the UK (apart from the weather!) is that we have an
excellent classical radio station in
BBC Radio 3. One of my favourite
writers is Saul Bellow, who died recently, I have admired his work for many years.
More recently, I have just finished a book by Haldor Laxness,
a Nobel prize winning Icelandic writer, called "Under the Glacier" which is both very amusing
and unlike anything I've have ever read before.
( May 13 2005, 07:02:12 AM PDT )
Permalink
Today's Page Hits: 37