Jon Masamitsu's Weblog

pageicon Wednesday Aug 30, 2006

More of a Good Thing

The throughput collector is getting more parallel.

In our J2SE 6 release there is an option in the throughput collector to collect the tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn't all that we wanted it to be in terms of parallelism. Without this recent work only the young generation was being collected in parallel.

The option to turn on the parallel collection of the tenured generation is -XX:+UseParallelOldGC. That's the punch line for this blog. Oh, also it's off by default.

Below are a few details on the new collector. Actually, first I'll describe the serial collector used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.

The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector.

1. Mark all the live objects

2. Calculate the new locations of the objects

3. Adjust all object references to point to the new locations

4. Move the objects to their new locations

Phase 1 found and marked the live objects.

In phase 2 the collector kept a pointer to the first location in the tenured generation to which we could move a live object during the collection. Let's call that location T. T started at the bottom of the generation. The collector scanned forward from the bottom of the generation looking for marked (i.e., live) objects. When a live object A was found, the collector saved T in A (i.e., saved the new location of A in A). It then calculated the next location T as T + sizeof(A) and restarted the scan looking for the next live object to put at the new T.

Phase 3 started at the roots, references to objects known to the application (for the most part), and changed all the references in the roots and all the reference found from the roots to point to the new location of the objects (to the T's saved in the A's).

Phase 4 again started at the bottom of the tenured generation and scanned forward to find live objects. When a live object was found, it was moved to it's new location (the T's).

The end result was that all the live objects were moved to the lower end of the tenured generation and all the free space was in one contiguous region at the higher end of the tenured generation. It simplifies allocations from the tenured generation to have all the free space in a single region. Also there are no issues of fragmentation as would be the case if the collection resulted in multiple regions of free space separated by live objects. This single region of free space seemed an important property to keep when thinking about how to do a collection of the tenured generation using multiple GC threads.

Another feature of our serial mark-sweep-compact collector that we liked is the flexibility to leave some deadwood at the lower end of the tenured generation. We use the term "deadwood" to refer to garbage at the lower end of the generation that we're not going to collect. We don't collect it to reduce the number of objects that we have to move. If we have objects

A B C D

and if B is garbage and we collect it, then we have to move C and D. If we treat B as deadwood, then C and D can stay where they are. We make a judgment call to waste a certain amount of space (that occupied by the deadwood) in order to move fewer objects.

You may have noticed that each of the four phases of our mark-sweep-compaction walked over at least all the live objects. In some cases it actually walked over all the objects (live and dead). When we're scanning the heap (such as in phases 2), we look at a dead object just to find its size so that we can step over it. Any part of the collection that walks over the objects is costly so with the new parallel collector for the tenured generation we wanted to have fewer walks over the objects.

Keeping all this in mind resulted in a parallel collector that

* Marked the live objects using multiple GC threads and maintained some summary information (explained below).

* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep.

* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.

I'll refer to these three phases as the marking phase, summary phase and move-and-update phase. In some of the other documentation on this new collector the last phase is referred to as the compaction phase. I prefer move-and-update because that's the type of name used in the code. Just in case you're ever looking at the code.

Marking phase

Doing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us the size the of data for objects that begin in each chunk.

Summary phase

The summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap the dense prefix. Live objects that are moved will be placed in the heap starting immediately after the dense prefix. Given that we know the amount of live data for each chunk, we can calculate the new location of any live object using the summary data. Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land. The summary phase is currently done by one thread but can be done in parallel if it turns out to be a scaling bottleneck.

Update-and-move phase

Given the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunk at any given time. If the chunks are A, B, C, D ... and the only dead object is in A and fits entirely in A, then A is the only ready chunk. The move-and-update phase will first fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen this "gap" of ready chunks. We expect to implement it, but it's not in this release.

That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.

For more on this new collector see the slides for the GC presentation done at this year's JavaOne. They can be download from

http://java.sun.com/javaone/

The last time I looked, there was a "Click here" link for the technical sessions in the first paragraph. Following that link, download the Java SE bundle and in there our GC presentation is TS-1168.

pageicon Thursday Aug 24, 2006

The GC whitepaper

There's a very nice whitepaper on our GC at

http://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf

It talks about basic concepts, some specifics about our GC, suggestions on tuning, and some of the latest improvements such as the parallel old collector. If you've looked at the GC tuning guides, some of it will look familiar, but it nicely pulls together information from several sources.

pageicon Tuesday Aug 15, 2006

The Real Thing

Here's a real treat dished up by Ross K. who hangs out 3 offices down the hall. A while back I wrote a blog about thread local allocation buffers (TLAB's). The short comings of my blog have probably been annoying Ross so much that he's written this one for us. If there are any discrepancies between my earlier blog and this one, believe this one. Thanks, Ross.


A Thread Local Allocation Buffer (TLAB) is a region of Eden
that is used for allocation by a single thread.  It enables
a thread to do object allocation using thread local top and
limit pointers, which is faster than doing an atomic operation
on a top pointer that is shared across threads.


A thread acquires a TLAB at it's first object allocation
after a GC scavenge.  The size of the TLAB is computed via
a somewhat complex process discribed below.  The TLAB is
released when it is full (or nearly so), or the next GC scavenge
occurs. TLABs are allocated only in Eden, never from From-Space
or the OldGen.

Flags        	       default 	description

	    		       	
 UseTLAB     		true   	Use thread-local object allocation
 ResizeTLAB  		false  	Dynamically resize tlab size for threads
 TLABSize    	       (below) 	Default (or starting) size of a TLAB (in bytes)
 TLABWasteTargetPercent   1    	Percentage of Eden that can be wasted
 PrintTLAB              false  	Print various TLAB related information

AggressiveHeap settings:

  TLABSize   	       256Kb
  ResizeTLAB 	       true     (Corrected 2007/05/09)

Minor flags

 MinTLABSize             2*K   	Minimum allowed TLAB size (in bytes)
 TLABAllocationWeight     35   	Weight for exponential averaging of allocation
 TLABRefillWasteFraction  64   	Max TLAB waste at a refill (internal fragmentation)
 TLABWasteIncrement        4   	Increment allowed waste at slow allocation
 ZeroTLAB              false   	Zero out the newly created TLAB

These flags are for tuning the current implementation of
TLABs and maybe disappear or change their initial value in a
future release of the jvm.

If it is not specified on the command-line (or specified as zero)
via the -XX:TLABSize flag, the initial size of a TLAB is computed as:

  init_size = size_of_eden / (allocating_thread_count * target_refills_per_epoch)

where:

  a) Allocating_thread_count is the expected number of threads
    which will be actively allocating during the next epoch
    (an epoch is the mutator time between GC scavenges.)
    At jvm startup this is defined to be one.  It is then
    recomputed at each GC scavenge from the number of threads
    that did at least one allocation of a tlab during the
    latest epoch. It's then exponentially averaged over the
    past epochs.
  
  b) Target_refills_per_epoch is the desired number of tlab
    allocations per thread during an epoch.  It is computed from
    the value of TLABWasteTargetPercent which is the percentage of
    Eden allowed to be wasted due to TLAB fragmentation.
  
    From a mutator thread's perspective a GC scavenge can occur
    unexpectedly at any time.  So, on average, only half of a
    thread's current TLAB will be allocated when a GC scavenge
    occurs.
  
      TLABWasteTargetPercent = 0.5 * (1/target_refills_per_epoch) * 100
  
    Solving for target_refills_per_epoch:
  
      target_refills_per_epoch = ( 0.5 * 100) / TLABWasteTargetPercent
  
    With the default value of 1 for TLABWasteTargetPercent
  
      target_refills_per_epoch =  50


When TLABResize is true (which it is by default) the tlab size
  is recomputed for each thread that did an allocation in the latest
  epoch.  Threads that did not allocate in the latest epoch do not
  have their TLABs resized.  The resize goal is to get the number
  of refills closer to the ideal: target_refills_per_epoch (default
  value is 50).  For each thread, the number of refills in the latest
  epoch is exponentially averaged with values from previous
  epochs.  If this average refill number is greater than
  target refills_per_epoch, then the tlab size is increased.  If
  the average is less, the tlab size is decreased.
  
  The computation is (approximately):
  
     new_size = (old_size * avg_refills_per_epoch) / target_refills_per_epoch
  
  It's actually computed from the fraction of the latest epoch's
  eden size used by this thread, because the next epoch may use a
  resized eden.

To experiment with a specific TLAB size, two -XX flags need
  to be set, one to define the initial size, and one to disable
  the resizing:

     -XX:TLABSize= -XX:-ResizeTLAB

  The minimum size of a tlab is set with -XX:MinTLABSize which
  defaults to 2K bytes.  The maximum size is the maximum size
  of an integer Java array, which is used to fill the unallocated
  portion of a TLAB when a GC scavenge occurs.


Diagnostic Printing Options

-XX:+PrintTLAB

  Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: "
  without the "'s) and one summary line.

Thread example:

TLAB: gc thread: 0x0004ac00 [id:  2] size: 61KB slow allocs: 5  refill waste: 980B
         alloc: 0.99996     3072KB refills: 50 waste  0.1% gc: 0B slow: 4144B fast: 0B

  The tag "gc" indicates that this information was printed at a GC scavenge,
  after the tlabs have been filled. The "gc" tag doesn't mean a thread is a gc
  thread.
  
  Fields:
  
  thread:
  
     The address of the gc thread structure and it's system thread id.
  
  size:
  
     The size of the tlab in kilobytes.
  
  slow allocs:
  
     The number of allocations too large for remaining space in the TLAB.
     The allocation was done directly in eden space.
  
  refill waste: (in HeapWord units)
  
     The name is truncated in the dump, and should be: refill_waste_limit
     and is used to limit the amount of wasted space from internal
     fragmentation.  If the remaining space in the TLAB is larger than
     this amount, and an allocation is requested that is too large to
     be allocated in the TLAB, then the allocation is done directly in
     Eden and the TLAB is not retired.  If the remaining space is less
     than refill_waste_limit then the TLAB is retired, a new TLAB is
     allocated, and the object allocation is attempted in the new TLAB.

     After each allocation outside of the TLAB, the refill_waste_limit
     is incremented by TLABWasteIncrement to prevent an allocation of
     a size slightly less than refill_waste_limit from continually
     being allocated outside of the TLAB.

  alloc: [fraction]  [sizeInBytes]

     Expected amount of eden allocated by this thread computed as
     a fraction of eden and number of heap words.

  refills:

     Number of tlab refills.     

  waste [percent] gc: [bytes] slow: [bytes] fast: [bytes]

     Percentage of eden allocated to this thread that was wasted.
     Waste is the sum of three components:
        gc:   unused space in the current TLAB when stopped for a scavenge.
        slow: sum of unused space in TLABs when they're retired to allocate a new one.
        fast: the client system can allocate a TLAB with a fast allocator.
              This is the amount of waste via that method.

Summary example:

TLAB totals: thrds: 1  refills: 50 max: 50 slow allocs: 5 max 5 waste:  0.1%
             gc: 0B max: 0B slow: 4144B max: 4144B fast: 0B max: 0B

  thrds:

     Number of threads that did an allocation.
              
  refills: [tt] max: [mm]

     Total number of TLAB refills by all threads, and
     maximun number of TLAB refills by a single thread.

  slow allocs: [ss] max [mm]

     Total number of allocations done outside of a TLAB, and
     maximum number by a single thread.

  waste [percent] gc: [bytes] slow: [bytes] max: [mm] fast: [bytes] max: [mm]

     Percentage of eden that was wasted across all threads.
     Waste is the sum of three components:
        gc:   unused space in the current TLABs when scavenge starts.
        slow: sum of unused space in TLABs when they're retired to allocate a new ones.
        fast: the client system can allocate a TLAB with a fast allocator.
              This is the amount of waste via that method.

     For "slow" and "fast", the maximum value by a single thread is printed.


More detail with addition of Verbose flag.

-XX:+PrintTLAB -XX:+Verbose

  Using both: -XX:+PrintTLAB -XX:+Verbose will print the
  new tlab sizes for each thread when it is resized.  Resizing is
  only done at GC scavenges.

  Example:

TLAB new size: thread: 0x001eac00 [id: 19] refills 50  alloc: 0.402570 size: 19684 -> 18996

  New size 18996.  Previous size was: 19684.

  refills:

     Number of tlab refills for this thread.

  alloc:

     The expected fraction of eden this thread will use.
pageicon Monday Aug 14, 2006

Location, Location, Location

Allocating a bunch of objects together and then using them together seems to be a good thing for cache utilization. Not always true, of course, but, if you're willing to take that as a quasi-truth, then you might be interested in how garbage collections can stir up that good locality.

I'm going to talk first about our serial garbage collector. It is a generational garbage collector with a copying young generation collector and a mark-sweep-compact tenured generation collector. My comments generally apply to all our collectors but allow me to do the simplest one first. After that I'll comment on how doing the collection in parallel makes it even more interesting. By the way the techniques I'll be describing for the serial collector are pretty generic and are used by other garbage collectors.

When a application thread T allocates objects, the allocations are generally done out of the young generation. Those allocations done by T together (i.e., at about the same time) are allocated in close proximity in the heap. This is mostly true even though allocations are being done in parallel by different application threads (different T's). Each T allocates from its own thread local allocation buffer (TLAB). I described recent improvement to TLAB's in an earlier blog with the title "A Little Thread Privacy, Please". If you read the first part of that one, you'll get the idea. At this point after the allocations are done, we have the good locality we've come to expect from the allocate-together, use-together practice. Then a garbage collection happens.

When the young generation is collected, the surviving objects can be copied to either the survivor spaces or the tenured generation. The survivor spaces are areas in the young generation where young generation objects are kept if they survive a few collections. After surviving a few collections objects in the survivor spaces are copied (promoted) to the tenured generation where long lived objects are kept. Let's consider the very first collection and the case where all the objects that survive that first collection are copied to a survivor space.

The young generation collection starts by finding objects being used directly by the application. By "directly" I mean the application has a reference to those objects A. That's as opposed to having a reference to an object (A) that contains a reference to a second object (B). The garbage collector scans the stacks of each application thread and, when it finds a reference to an object A, it copies A to a survivor space. Incidentally, that's also when the reference to A in the thread's stack is updated to point to the new location of A. Only A is copied at this point. If A contains references to another object B, the object B is not touched. The garbage collector continues to scan the stacks of the threads until it's found and copied all the objects directly referenced by the application. At that point the garbage collector starts looking inside the objects A that it has copied. It's then that the garbage collector would find B in A and would copy B to a survivor space. And as before if B contains a reference to object C, nothing is yet done about C.

As an example, suppose an application thread T

* allocates object C

* allocates object B and puts a reference to C into B

* allocates object A and puts a reference to B into A

repeatedly. The layout of objects in the heap would look something like

... C1 B1 A1 C2 B2 A2 C3 B3 A3 ...

If thread T only kept a reference to the objects A (i.e., there are no reference directly to the B's or the C's on the thread stacks), then after the young generation collection the layout in the survivor space would look like

... A1 A2 A3 ... B1 B2 B3 ... C1 C2 C3 ...

where the (...) represent other objects X found directly from T's stack or objects found from those X's. Here I'm being just a little generous in grouping the A's, B's, and C's together. That grouping assumes that A1 is found first, then A2 is found, and then A3 is found.

Before the collection loading A1 into a cache line could reasonably also load B1 and C1 into the same cache line. After the collection B1 and C1 are in a galaxy far, far away. The A's, B's and C's have been separated so that their allocation order is no longer reflected in their location in the heap. The basic damage has been done.

In the case where all the objects A, B, and C are promoted to the tenured generation, the layout in the tenured generation would be similar (just in the tenured generation instead of in a survivor space). In the case where some of the objects are copied to a survivor space and some are promoted, some of each of the A's, B's, and C's can end up in either a survivor space or the tenured generation. You might think that it would tend to look something like

(survivor space) ... A1 A2 A3 ...

(tenured generation) ... B1 B2 B3 ... C1 C2 C3 ...

It might. Or it might be a bit messier. The above example would result if there was room in the survivor spaces for the A's but not for the B's and C's. If the B's were particularly large (so that they did not fit into the space remaining in the survivor space) and the C's were particularly small, then you might see

(survivor space) ... A1 A2 A3 ... C1 C2 C3 ...

(tenured generation) ... B1 B2 B3 ...

If there are references in the application thread stack to the B's and C's as well as to the A's, and the collector found objects in the order

C2 B3 A1 A2 A3

then after the collection the object layout would look like

... C2 B3 A1 A2 A3 ... B1 B2 ... C1 C3 ...

The mixtures are as endless as the ways in which the application references its objects.

How does collecting in parallel affect things? Recall that the serial collector first looks at all the references found in the application thread stacks (in the simpliest case the A's) before it looks at references inside the A's (i.e., the B's). The parallel collector starts by looking at some of the A's, but then may look at some of the B's before all the A's have been found. A parallel GC thread copies an A and puts its new location on a stack. I'll refer to this stack as its grey stack so as not to confuse it with the stacks of the application threads. Each GC thread has its own grey stack. When its grey stack starts to get full, a GC thread will start taking the locations of the A's off its grey stack and looking inside them for B's to copy. It starts looking inside the A's so that its grey stack does not grow too large. In general we prefer to do some copying instead of having to malloc more space for the grey stack. Very quickly there are A's, B's and C's on the grey stacks and the order in which they are copied looks largely random from outside the garbage collector.

When a parallel GC thread promotes an object, it promotes it into its own distinct area (let me call them PLAB's) in the tenured generation. A GC thread that has run out of objects to copy (no more application thread stacks to scan and no more objects on its own grey stack) can steal objects from another GC thread's grey stack. So A1 may be copied by one GC thread into its PLAB while B1 (if it were stolen) may be copied by another GC thread into a different PLAB.

So collections can disrupt good object locality in a variety of different ways. We don't like that. We're thinking about ways to improve it.

Hope this was understandable. Please ask if it was not.


« August 2006 »
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
16
17
18
19
20
21
22
23
25
26
27
28
29
31
  
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 32