Jon Masamitsu's Weblog
The Real Thing
Here's a real treat dished up by Ross K. who hangs out 3 offices down the hall. A while back I wrote a blog about thread local allocation buffers (TLAB's). The short comings of my blog have probably been annoying Ross so much that he's written this one for us. If there are any discrepancies between my earlier blog and this one, believe this one. Thanks, Ross.
A Thread Local Allocation Buffer (TLAB) is a region of Eden
that is used for allocation by a single thread. It enables
a thread to do object allocation using thread local top and
limit pointers, which is faster than doing an atomic operation
on a top pointer that is shared across threads.
A thread acquires a TLAB at it's first object allocation
after a GC scavenge. The size of the TLAB is computed via
a somewhat complex process discribed below. The TLAB is
released when it is full (or nearly so), or the next GC scavenge
occurs. TLABs are allocated only in Eden, never from From-Space
or the OldGen.
Flags default description
UseTLAB true Use thread-local object allocation
ResizeTLAB false Dynamically resize tlab size for threads
TLABSize (below) Default (or starting) size of a TLAB (in bytes)
TLABWasteTargetPercent 1 Percentage of Eden that can be wasted
PrintTLAB false Print various TLAB related information
AggressiveHeap settings:
TLABSize 256Kb
ResizeTLAB true (Corrected 2007/05/09)
Minor flags
MinTLABSize 2*K Minimum allowed TLAB size (in bytes)
TLABAllocationWeight 35 Weight for exponential averaging of allocation
TLABRefillWasteFraction 64 Max TLAB waste at a refill (internal fragmentation)
TLABWasteIncrement 4 Increment allowed waste at slow allocation
ZeroTLAB false Zero out the newly created TLAB
These flags are for tuning the current implementation of
TLABs and maybe disappear or change their initial value in a
future release of the jvm.
If it is not specified on the command-line (or specified as zero)
via the -XX:TLABSize flag, the initial size of a TLAB is computed as:
init_size = size_of_eden / (allocating_thread_count * target_refills_per_epoch)
where:
a) Allocating_thread_count is the expected number of threads
which will be actively allocating during the next epoch
(an epoch is the mutator time between GC scavenges.)
At jvm startup this is defined to be one. It is then
recomputed at each GC scavenge from the number of threads
that did at least one allocation of a tlab during the
latest epoch. It's then exponentially averaged over the
past epochs.
b) Target_refills_per_epoch is the desired number of tlab
allocations per thread during an epoch. It is computed from
the value of TLABWasteTargetPercent which is the percentage of
Eden allowed to be wasted due to TLAB fragmentation.
From a mutator thread's perspective a GC scavenge can occur
unexpectedly at any time. So, on average, only half of a
thread's current TLAB will be allocated when a GC scavenge
occurs.
TLABWasteTargetPercent = 0.5 * (1/target_refills_per_epoch) * 100
Solving for target_refills_per_epoch:
target_refills_per_epoch = ( 0.5 * 100) / TLABWasteTargetPercent
With the default value of 1 for TLABWasteTargetPercent
target_refills_per_epoch = 50
When TLABResize is true (which it is by default) the tlab size
is recomputed for each thread that did an allocation in the latest
epoch. Threads that did not allocate in the latest epoch do not
have their TLABs resized. The resize goal is to get the number
of refills closer to the ideal: target_refills_per_epoch (default
value is 50). For each thread, the number of refills in the latest
epoch is exponentially averaged with values from previous
epochs. If this average refill number is greater than
target refills_per_epoch, then the tlab size is increased. If
the average is less, the tlab size is decreased.
The computation is (approximately):
new_size = (old_size * avg_refills_per_epoch) / target_refills_per_epoch
It's actually computed from the fraction of the latest epoch's
eden size used by this thread, because the next epoch may use a
resized eden.
To experiment with a specific TLAB size, two -XX flags need
to be set, one to define the initial size, and one to disable
the resizing:
-XX:TLABSize= -XX:-ResizeTLAB
The minimum size of a tlab is set with -XX:MinTLABSize which
defaults to 2K bytes. The maximum size is the maximum size
of an integer Java array, which is used to fill the unallocated
portion of a TLAB when a GC scavenge occurs.
Diagnostic Printing Options
-XX:+PrintTLAB
Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: "
without the "'s) and one summary line.
Thread example:
TLAB: gc thread: 0x0004ac00 [id: 2] size: 61KB slow allocs: 5 refill waste: 980B
alloc: 0.99996 3072KB refills: 50 waste 0.1% gc: 0B slow: 4144B fast: 0B
The tag "gc" indicates that this information was printed at a GC scavenge,
after the tlabs have been filled. The "gc" tag doesn't mean a thread is a gc
thread.
Fields:
thread:
The address of the gc thread structure and it's system thread id.
size:
The size of the tlab in kilobytes.
slow allocs:
The number of allocations too large for remaining space in the TLAB.
The allocation was done directly in eden space.
refill waste: (in HeapWord units)
The name is truncated in the dump, and should be: refill_waste_limit
and is used to limit the amount of wasted space from internal
fragmentation. If the remaining space in the TLAB is larger than
this amount, and an allocation is requested that is too large to
be allocated in the TLAB, then the allocation is done directly in
Eden and the TLAB is not retired. If the remaining space is less
than refill_waste_limit then the TLAB is retired, a new TLAB is
allocated, and the object allocation is attempted in the new TLAB.
After each allocation outside of the TLAB, the refill_waste_limit
is incremented by TLABWasteIncrement to prevent an allocation of
a size slightly less than refill_waste_limit from continually
being allocated outside of the TLAB.
alloc: [fraction] [sizeInBytes]
Expected amount of eden allocated by this thread computed as
a fraction of eden and number of heap words.
refills:
Number of tlab refills.
waste [percent] gc: [bytes] slow: [bytes] fast: [bytes]
Percentage of eden allocated to this thread that was wasted.
Waste is the sum of three components:
gc: unused space in the current TLAB when stopped for a scavenge.
slow: sum of unused space in TLABs when they're retired to allocate a new one.
fast: the client system can allocate a TLAB with a fast allocator.
This is the amount of waste via that method.
Summary example:
TLAB totals: thrds: 1 refills: 50 max: 50 slow allocs: 5 max 5 waste: 0.1%
gc: 0B max: 0B slow: 4144B max: 4144B fast: 0B max: 0B
thrds:
Number of threads that did an allocation.
refills: [tt] max: [mm]
Total number of TLAB refills by all threads, and
maximun number of TLAB refills by a single thread.
slow allocs: [ss] max [mm]
Total number of allocations done outside of a TLAB, and
maximum number by a single thread.
waste [percent] gc: [bytes] slow: [bytes] max: [mm] fast: [bytes] max: [mm]
Percentage of eden that was wasted across all threads.
Waste is the sum of three components:
gc: unused space in the current TLABs when scavenge starts.
slow: sum of unused space in TLABs when they're retired to allocate a new ones.
fast: the client system can allocate a TLAB with a fast allocator.
This is the amount of waste via that method.
For "slow" and "fast", the maximum value by a single thread is printed.
More detail with addition of Verbose flag.
-XX:+PrintTLAB -XX:+Verbose
Using both: -XX:+PrintTLAB -XX:+Verbose will print the
new tlab sizes for each thread when it is resized. Resizing is
only done at GC scavenges.
Example:
TLAB new size: thread: 0x001eac00 [id: 19] refills 50 alloc: 0.402570 size: 19684 -> 18996
New size 18996. Previous size was: 19684.
refills:
Number of tlab refills for this thread.
alloc:
The expected fraction of eden this thread will use.
Posted at 08:18AM Aug 15, 2006 by jonthecollector in Java | Comments[2]
Location, Location, Location
Allocating a bunch of objects together and then using them together seems to be a good thing for cache utilization. Not always true, of course, but, if you're willing to take that as a quasi-truth, then you might be interested in how garbage collections can stir up that good locality.I'm going to talk first about our serial garbage collector. It is a generational garbage collector with a copying young generation collector and a mark-sweep-compact tenured generation collector. My comments generally apply to all our collectors but allow me to do the simplest one first. After that I'll comment on how doing the collection in parallel makes it even more interesting. By the way the techniques I'll be describing for the serial collector are pretty generic and are used by other garbage collectors.
When a application thread T allocates objects, the allocations are generally done out of the young generation. Those allocations done by T together (i.e., at about the same time) are allocated in close proximity in the heap. This is mostly true even though allocations are being done in parallel by different application threads (different T's). Each T allocates from its own thread local allocation buffer (TLAB). I described recent improvement to TLAB's in an earlier blog with the title "A Little Thread Privacy, Please". If you read the first part of that one, you'll get the idea. At this point after the allocations are done, we have the good locality we've come to expect from the allocate-together, use-together practice. Then a garbage collection happens.
When the young generation is collected, the surviving objects can be copied to either the survivor spaces or the tenured generation. The survivor spaces are areas in the young generation where young generation objects are kept if they survive a few collections. After surviving a few collections objects in the survivor spaces are copied (promoted) to the tenured generation where long lived objects are kept. Let's consider the very first collection and the case where all the objects that survive that first collection are copied to a survivor space.
The young generation collection starts by finding objects being used directly by the application. By "directly" I mean the application has a reference to those objects A. That's as opposed to having a reference to an object (A) that contains a reference to a second object (B). The garbage collector scans the stacks of each application thread and, when it finds a reference to an object A, it copies A to a survivor space. Incidentally, that's also when the reference to A in the thread's stack is updated to point to the new location of A. Only A is copied at this point. If A contains references to another object B, the object B is not touched. The garbage collector continues to scan the stacks of the threads until it's found and copied all the objects directly referenced by the application. At that point the garbage collector starts looking inside the objects A that it has copied. It's then that the garbage collector would find B in A and would copy B to a survivor space. And as before if B contains a reference to object C, nothing is yet done about C.
As an example, suppose an application thread T
* allocates object C
* allocates object B and puts a reference to C into B
* allocates object A and puts a reference to B into A
repeatedly. The layout of objects in the heap would look something like
... C1 B1 A1 C2 B2 A2 C3 B3 A3 ...
If thread T only kept a reference to the objects A (i.e., there are no reference directly to the B's or the C's on the thread stacks), then after the young generation collection the layout in the survivor space would look like
... A1 A2 A3 ... B1 B2 B3 ... C1 C2 C3 ...
where the (...) represent other objects X found directly from T's stack or objects found from those X's. Here I'm being just a little generous in grouping the A's, B's, and C's together. That grouping assumes that A1 is found first, then A2 is found, and then A3 is found.
Before the collection loading A1 into a cache line could reasonably also load B1 and C1 into the same cache line. After the collection B1 and C1 are in a galaxy far, far away. The A's, B's and C's have been separated so that their allocation order is no longer reflected in their location in the heap. The basic damage has been done.
In the case where all the objects A, B, and C are promoted to the tenured generation, the layout in the tenured generation would be similar (just in the tenured generation instead of in a survivor space). In the case where some of the objects are copied to a survivor space and some are promoted, some of each of the A's, B's, and C's can end up in either a survivor space or the tenured generation. You might think that it would tend to look something like
(survivor space) ... A1 A2 A3 ...
(tenured generation) ... B1 B2 B3 ... C1 C2 C3 ...
It might. Or it might be a bit messier. The above example would result if there was room in the survivor spaces for the A's but not for the B's and C's. If the B's were particularly large (so that they did not fit into the space remaining in the survivor space) and the C's were particularly small, then you might see
(survivor space) ... A1 A2 A3 ... C1 C2 C3 ...
(tenured generation) ... B1 B2 B3 ...
If there are references in the application thread stack to the B's and C's as well as to the A's, and the collector found objects in the order
C2 B3 A1 A2 A3
then after the collection the object layout would look like
... C2 B3 A1 A2 A3 ... B1 B2 ... C1 C3 ...
The mixtures are as endless as the ways in which the application references its objects.
How does collecting in parallel affect things? Recall that the serial collector first looks at all the references found in the application thread stacks (in the simpliest case the A's) before it looks at references inside the A's (i.e., the B's). The parallel collector starts by looking at some of the A's, but then may look at some of the B's before all the A's have been found. A parallel GC thread copies an A and puts its new location on a stack. I'll refer to this stack as its grey stack so as not to confuse it with the stacks of the application threads. Each GC thread has its own grey stack. When its grey stack starts to get full, a GC thread will start taking the locations of the A's off its grey stack and looking inside them for B's to copy. It starts looking inside the A's so that its grey stack does not grow too large. In general we prefer to do some copying instead of having to malloc more space for the grey stack. Very quickly there are A's, B's and C's on the grey stacks and the order in which they are copied looks largely random from outside the garbage collector.
When a parallel GC thread promotes an object, it promotes it into its own distinct area (let me call them PLAB's) in the tenured generation. A GC thread that has run out of objects to copy (no more application thread stacks to scan and no more objects on its own grey stack) can steal objects from another GC thread's grey stack. So A1 may be copied by one GC thread into its PLAB while B1 (if it were stolen) may be copied by another GC thread into a different PLAB.
So collections can disrupt good object locality in a variety of different ways. We don't like that. We're thinking about ways to improve it.
Hope this was understandable. Please ask if it was not.
Posted at 10:46PM Aug 14, 2006 by jonthecollector in Java | Comments[4]
Top 10 GC Reasons
to upgrade to J2SE 5.010) We fixed some bugs.
9) Thread local allocation buffer dynamic tuning. (see "A Little Thread Privacy Please")
8) Promotion failure handling was implemented for the low pause collector.
Promotion failure handling is the ability to start a minor collection and then backout of it if there is not enough space in the tenured generation to promote all the objects that need to be promoted.
In 5.0 the "young generation guarantee" need not apply for the low pause collector because we implemented promotion failure handling for that collector. The net effect of having promotion failure handling is that more minor collections can be done before a major collection is needed. Since minor collections are typically more efficient than major collections, it's better. The "young generation guarantee" is discussed in the document below.
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html
It was particularly harsh on the low pause collector because the "young generation guarantee" required one contiguous chunk of space in the tenured generation. For the low pause collector that requirement could be difficult to satisfy because the low pause collector does not compact the tenured generation and so is subject to fragmentation.
7) Parallel remark in the low pause collector.
For the low pause collector there are two stop-the-world pauses associated with each collection of the tenured generation: the initial marking pause and the remarking pause. The latter is typically much longer than the former. In 5.0 the remarking is done with multiple threads in order to shorten it.
6) Parallel reference processing in the low pause collector.
For an application that uses Reference objects (see)
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/ref/Reference.html
extensively, the GC work to process the Reference objects can be noticeable. It's not necessarily worse in the low pause collector than in the other collects, but it hurts more (because we're trying to keep the pauses low). Parallel reference processing is available for the low pause collector but is not on by default. Unless there are tons of Reference Objects, doing the reference processing serially is usually faster. Turn it on with the flag -XX:+ParallelRefProcEnabled if you make extensive use of Reference Objects (most applications don't).
5) Server class machine defaults.
On server class machines (machines with 2 or more cpus and 2G or more of physical memory) the default compiler, garbage collector and maximum heap size are different. Before 5.0 it was
* -client compiler
* serial garbage collector
* 64M maximum heap (on most 32 bit platforms)
With 5.0 the defaults for server class machines are
* -server compiler
* throughput garbage collector
* maximum heap size the lesser of 1/4 of physical memory or 1G
For the smaller machines the defaults did not change.
The intent of this change was to provide better default performance on large machines that are typically dedicated to running 1 or a few applications and for which throughput is important. Desktops for which interactivity was expected to be more important were excluded from being server class machines.
See
http://java.sun.com/docs/hotspot/gc5.0/ergo5.html
for more details.
4) Low pause collector schedules remarking between minor collections.
Recall that the pause due to object remarking is the longer of the two stop-the-world pauses. Before 5.0 it could happen that a remark pause would occur immediately after a minor pause. When this back-to-back minor pause and remark pause occurred it looked like one big fat pause. With 5.0 the remarking is scheduled so as to be about mid way between two minor pauses.
3) Better scheduling of the start of a concurrent low pause collection.
Prior to 5.0 a low pause collector started a collection based on the rate of allocation and the amount of free space in the tenured generation. It did the calculation and started a concurrent collection so as to finish before the tenured generation ran dry. This was good, but there were some end cases that needed to be recognized. For example, if the tenured generation had to be expanded in order to support promotions for a minor collection that just finished, a concurrent collection was started right away. Or if the next minor collection might not succeeded because of lack of free space in the tenured generation, then a concurrent collection was started. That latter example wouldn't happen if we could perfectly predict the rate at which the tenured generation is filling up. We're not perfect. Also a bump in the allocation rate might mess us up.
2) GC ergonomics (See "It's Not Magic" at the start of my blogs)
And the number 1 GC reason for upgrading to 5.0 is
1) ???
You can tell me what you've found to be a good reason to upgrade to 5.0 in a comment to this blog. Don't be shy. And thanks for any responses.
Posted at 03:57PM May 03, 2006 by jonthecollector in Java | Comments[1]
What the Heck's a Concurrent Mode?
and why does it fail?If you use the low pause collector, have you ever seen a message that contained the phase "concurrent mode failure" such as this?
This is from a 6.0 (still under development) JDK but the same type of message can come out of a 5.0 JDK.
Recall that the low pause collector is a mostly concurrent collector: parts of the collection are done while the application is still running. The message "concurrent mode failure" signifies that the concurrent collection of the tenured generation did not finish before the tenured generation became full. Recall also that a concurrent collection attempts to start just-in-time to finish before the tenured generations becomes full. The low pause collector measures the rate at which the the tenured generation is filling and the expected amount of time until the next collection and starts a concurrent collection so that it finished just-in-time (JIT). Three things to note in that last sentence. The "rate" at which the tenured generation is filling is based on historical data as is the "expected amount of time" until the next collection. Either of those might incorrectly predict the future. Also the JIT is really JIT plus some amount of padding so as to get it right most of the time. When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW) collection. All the application threads are stopped, a different algorithm is used to collect the tenured generation (our particular flavor of a mark-sweep-compact), the applications threads are started again, and life goes on. Except that the STW collection is not very low pause and there's the rub.
At this point if you're asking "why does this happen", you've come to the right blog. There are several possibilites.
The amount of live data that is in the tenured generation is too large. Specifically, there is not enough free space in the tenured generation to support the rate of allocation into the tenured generation. For an example in the extreme if there are only 2 words of free space in the tenured generation after a collection of the tenured generation, chances are those 2 words will be exhausted before another concurrent collection of the tenured generation can be done. If you are seeing lots of concurrent mode failures, chances are your heap is too small.
Your application can change behaviors dramatically such that past behavior does not adequately predict future performance. If this is the problem you'll see the concurrent mode failures only near the change in behavior. After a few more collections the low pause collector adjusts its expectations to make better decisions. But to deal with the concurrent mode failures in the mean time, you'll usually be trading off better performance. You can tell the low pause collector to start a collection sooner. The flag -XX:CMSInitiatingOccupancyFraction=NN will cause a concurrent collection to start when NN percent of the tenured generation is full. If you use this option to deal with the concurrent mode failures that result from a change in the behavior of your application, much of the time (when the applications behavior is more steady state) you'll be starting collection too early and so doing more collections than necessary. If you set NN to 0, it will cause one concurrent collection to be followed as soon as possible by another. The next collection may not start immediately after the last because the check on when to start a collection is done only at particular points in the code, but the collection will start at the next opportunity.
Your application may not have dramatic changes in behavior, but if it has a large variance in allocation rates, that can cause the JIT GC to not be JIT. You can add some more padding to the time at which a concurrent collection kicks off by using the flag -XX:CMSIncrementalSafetyFactor=NN. The default value for NN is 10 (i.e., a 10% padding on the start of the concurrent collection). Increasing NN to 100 starts a concurrent collection at the next opportunity.
Posted at 09:15AM Apr 13, 2006 by jonthecollector in Java | Comments[1]
The Fault with Defaults
During early work on the low pause collector some of the applications we used for benchmarking the collector had the characteristic that objects had very short lifetimes or relatively long lifetimes with not much in between. All our current collectors use a copying collection for the young generation. This copying collection will copy the younger objects back into the young generation and copy the older objects into the tenured generation. The expectation is that more of the younger objects will die in the young generation if they are give a little more time. A consequence of this is that long lived objects are copied repeatedly into the young generation before finally being promoted into the tenured generation. Depending on the mix of lifetimes of your objects, this can be good (more objects die in the young generation) or this can be bad (extra copying into the young generation). Additionally at about the same time we were discussing the policy of never doing the copying but rather always copying any objects that survived a young generation collection immediately into the tenured generation. There are some collectors that do that and there was some anecdotal evidence that it was a good strategy. Given some example applications where never copying back into the young generations seemed to be the right strategy and the general discussion about what was the right policy for copying, we decided to make the default behavior for the low pause collector to always promote objects surviving a young generation collection into the tenured generation.This default policy still seems reasonable for some applications but the old adage about one size not fitting all certainly applies here. So here's what you should consider and what you can do.
The young generation is acting like a filter to dispose of the shorter lived objects. Any objects that get through the filter are moved to the tenured generation and are henceforth handled by the mostly concurrent collections (which I'll just call the concurrent collections from here on out). If the concurrent collections are starting too often to suit your tastes, one possibility is to filter out more of the short lived objects by doing some copying during the young generation collections. This may slow down the growth in the number of objects in the tenured generation and lessen the need for too frequent concurrent collections. To do this try the command line flags
-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8
MaxTenuringThreshold is the number of collections that an object must survive before being promoted into the tenured generation. When objects are copied back into the young generation during a young generation collection, the objects are copied into an part of the young generation that is referred to as a survivor space. SurvivorRatio set to 8 will mean that the survivor spaces are about 10% of the size of the young generation. The survivor spaces and SurvivorRatio are described in the 5.0 tuning guide that can be found under
http://java.sun.com/docs/hotspot
The FAQ attached to the 5.0 tuning guide has an entry about experimenting with MaxTenuringThreshold.
By the way having concurrent collections starting "too often" can also be a sign that the tenured generation is smaller than is needed for good performance. You can always try increasing the size of the heap to reduce the frequency of the concurrent collections. The suggestions above may be able to get you the less frequent concurrent collections without the longer concurrent collections that typically come when the heap is increased.
The default setting to always promote objects that survive a young generation collection is only the default for the low-pause collector. The other collectors by default will do some copying before promoting an object to the tenured generation.
Posted at 07:47AM Apr 04, 2006 by jonthecollector in Java | Comments[2]
When the Sum of the Parts
doesn't equal a big enough hole.Did I mention that the low pause collector maintains free lists for the space available in the tenured generation and that fragmentation can become a problem? If you're using the low pause collector and things are going just peachy for days and days and then there is a huge (relatively speaking) pause, the cause may be fragmentation in the tenured generation.
In 1.4.2 and older releases in order to do a young generation collection there was a requirement that there be a contiguous chunk of free space in the tenured generation that was big enough to hold all the the young generation. In the GC tuning documents at
http://java.sun.com/docs/hotspot/
this is referred to as the young generation guarantee. Basically during a young generation collection, any data that survives may have to be promoted into the tenured generation and we just don't know how much is going to survive. Being our usual conservative selves we assumed all of it would survive and so there needed to be room in the tenured generation for all of it. How does this cause a big pause? If the young generation is full and needs to be collected but there is not enough room in the tenured generation, then a full collection of both the young generation and the tenured generations are done. And this collection is a stop-the-world collection not a concurrent collection so you generally see a pause much longer than you want to. By the way this full collection is also a compacting collection so there is no fragmentation at the end of the full collection.
In 5.0 we added the ability in the low pause collector to start a young generation collection and then to back out of it if there was not enough space in the tenured generation. Being able to backout of a young generation collection allowed us to make a couple of changes. We now keep an average of the amount of space that is used for promotions and use that (with some appropriate padding to be on the safe side) as the requirement for the space needed in the tenured generation. Additionally we no longer need a single contiguous chunk of space for the promotions so we look at the total amount of free space in the tenured generation in deciding if we can do a young generation collection. Not having to have a single contiguous chunk of space to support promotions is where fragmentation comes in (or rather where it doesn't come in as often). Yes, sometimes using the averages for the amount promoted and the total amount of free in the tenured generation tells us to go ahead and do a young generation collection and we get surprised (there really isn't enough space in tenured generation). In that situation we have to back out of the young generation collection. It's expensive to back out of a collection, but it's doable. That's a very long way of saying that fragmentation is less of a problem in 5.0. It still occurs, but we have better ways of dealling with it.
What should you do if you run into a fragmentation problem?
Try 5.0.
Or you could try a larger total heap and/or smaller young generation. If your application is on the edge, it might give you just enough extra space to fit all your live data. But often it just delays the problem.
Or you can try to make you application do a full, compacting collection at a time which will not disturb your users. If your application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and you can hopefully go another day without hitting the fragmentation problem. Clearly no help for an application that does not have a logical "middle of the night".
Or if by chance most of the data in the tenured generation is read in when your application first starts up and you can do a System.gc() after you complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Depending on the allocation pattern of the application, that might be adequate.
Or you might want to start the
concurrent collections earlier. The low pause collector tries to start a concurrent
collection just in time (with some safety factor) to collect the
tenured generation before it is full. If you are doing concurrent
collections and freeing enough space, you can try starting a concurrent collection sooner so that
it finishes before the fragmentation becomes a problem.
The concurrent collections don't do a compaction, but they do
coalese adjacent free blocks so larger chunks of free space
can result from a concurrent collection.
One of the triggers for starting a concurrent collection is the amount
of free space in the tenured generation.
You can cause a concurrent collection to
occur early by setting the
option -XX:CMSInitiatingOccupancyFraction=
By the way, I've increased the comment period for my blogs. I hadn't realized it was so short.
Posted at
09:47AM Mar 06, 2006
by jonthecollector in Java |
Comments[4]
So What's It Worth to You?
Have you ever gone to your favorite restaurant, had a great meal and sat back with your second cup of coffee and asked "Why can't those guys do a pausless garbage collector? How hard can it be?"Well, you're not the only one. That question came up around here recently.
There are techniques for doing pauseless garbage collection. The problem is that it costs in terms of application performance or additional Java heap size. Threads in an application read and write objects. The garbage collector reads and writes (i.e., frees or moves) objects. All those reads and writes have to have some coordination. "Pauseless" GC very often means that the application as a whole is not paused, but individual threads are paused one at a time. Imagine two application threads that are furiously playing hot potato with an object. As soon as thread 1 passes the object A to thread 2, thread 1 forgets about A and thread 2 realizes it has A. Then thread 2 immediately passes it back to thread 1 with the analogous result. A pauseless collector pauses thread 1 to look at what objects it is holding, then restarts thread 1 (so only one thread is paused at a time) and pauses thread 2 to examine it. Did object A get passed from thread 2 to thread 1 in that window during which both were executing? If it did and that is the only reference to object A, then the collector will not know that object A is live without doing more work.
I'm going to skip the details here. The point of this note is not to delve into the particular difficulties of pauseless GC, but rather to give you an idea of why it can be more costly.
Also I'm going to consider collectors that compact objects (i.e., move them). Mostly because it's an easier example to talk about. Some collectors don't move objects during a collections. Not moving objects simplifies some aspects of a collection but has its own complications (e.g., maintaining free lists of objects, splitting and coalescing blocks of free space, and fragmentation).
If the collector does compact the objects, it has to be sure that all references to live objects point to their new locations. A common way to catch object A when it has slipped through the window is to use what is fondly referred to as a read barrier. Any read of an object by the application goes through code (the read barrier) that looks at the object and decides whether something special has to be done. The "something special" depends on the particulars of the collector. Without getting into those particulars just having to do extra stuff on every read has got to hurt. Yes, I'm really over simplifying this example, but what can you expect from my really simple mind.
So what's pauseless GC worth to you (i.e., in terms of the extra space and performance costs, the extra complexity, maybe special hardware)? It's definitely worth a bunch to some. But for many it isn't really necessary to have truly pauseless GC. Shorter is good enough.
Why can't those guys do a pauseless garbage collector?
Would "pauseless garbage collection" give us the biggest bang for the buck (i.e., most benefits to most user)? When we were deciding what to do for the next release, we asked ourselves that question. And we decided there were things we should do before "pauseless GC". Did I mention that I recently worked on the parallel collection of the tenured generation for the throughput collector? I'll tell you about it some time.
Posted at 02:08PM Feb 15, 2006 by jonthecollector in Java | Comments[4]
Tell Us Where It Hurts
Here's a URL for asking GC questions.http://java.sun.com/docs/forms/gc-sendusmail.html
Engineers in the GC group answer the questions so that's probably as good as it gets.
Updated 2008/2/20
The link above no longer takes you to the form for asking GC related questions. I'm leaving this information above just to save a little bit of confusion (i.e., you won't find yourself saying "Didn't this page used to ..."). The new channel for asking questions is explained below.
Questions for the Hotspot[tm] JVM[tm] garbage collector group should be sent to the hotspot-gc-use mailing list. This is a mailing list that is open to the public. Please see the link below for information on subscribing to that mailing list.
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
If you have questions which you would prefer not be answered in public, please work through you Sun support contact.
Information on Sun support can be found at
http://developers.sun.com/services/
Posted at 09:35AM Feb 01, 2006 by jonthecollector in Java |
Why not a Grand Unified Garbage Collector?
At last count we have three garbage collectors.
Why is there more than one? Actually, the usual answer applies. Specialization often results in better performance. If you're interested in more particulars about our garbage collectors, read on.
All three collectors are generational (young generation and tenured generation).
Let's do the easy comparison first, why a parallel collector and a serial collector. Parallelism has overhead. Nuf said? Yeah, I used to read comic books when I was a kid. If you don't understand that reference, ignore it. I'm just older.
As you might infer from the names, the serial collector uses 1 thread to do the GC work and the parallel collector uses multiple threads to do the same. As usual multiple threads doing the same tasks have to synchronize. That's pretty much it. On a single cpu machine the additional cost of the synchronization means that the parallel collector is slower than the serial collector. On a two cpu machine and a VM that has a small heap the parallel collector is as about fast as the serial collector. With two cpu's and large heaps the parallel collector will usually do better. We keep asking ourselves if we can get rid of the serial collector and use the parallel collector in its place. The answer so far keeps coming back no.
More interesting is the case of the low pause collector versus the parallel collector. Above I made the remark about specialization and better performance. This is actually a case of more complexity and lesser performance. These two collectors do the collection of the young generation using almost the exact same techniques The differences in the collectors have to do with the collections of the tenured generation. The low pause collector does parts of that collection while the application continues to run. One way to do that is to not move the live objects when collecting the dead objects. The application tends to get confused if the objects it is using move around while the application is running. The other two collectors compact the heap during a collection of the tenured generation (i.e., live objects are moved so as to occupy one contiguous region of the heap). The low pause collector collects the dead objects and coalesces their space into blocks that are kept in free lists. Maintaining free lists and doing allocations from them takes effort so it's slower than having a heap that is compacted. Having applications run while a collection is happening means that new objects can be allocated during a collection. That leads to so more complexity. Also the collection of the tenured generation can be interrupted for a collection of the young generation. More complexity still. The bottom line is that the low pause collector has shorter GC pauses but it costs performance. That performance difference is not huge but it's large enough to keep us from ditching the parallel collector and always using the low pause collector.
And last but not least, can we replace the serial collector with the low pause collector? Very tempting. The serial collector is used by default on desktop machines. We expect those to have 1 or 2 cpus and to be running applications that need 10's of megabytes of Java heap as opposed to 100's of megabytes. With small heaps the differences in collection times tend to make less difference. Even if the low pause collector was 10% slower than the serial collector, the difference between, for example, 70 ms and 77 ms often isn't large enough to matter. It would probably be a done deal except that the low pause collector has a larger memory footprint. It has additional data structures that it uses (for example to keep track of what references are being changed by the the running application while a collection is on going). It also usually needs a larger Java heap to run an application. Recall that the low pause collector uses free lists to keep track of the available space in the heap. Fragmentation can become a problem. The best bet is that we'll replace the serial collector with the low pause collector some day but not just yet.
Posted at 05:23PM Jan 26, 2006 by jonthecollector in Java |
A Little Thread Privacy Please
This is not really about garbage collection but hopefully you'll find it interesting. And there is no punch line (i.e., cool command line flag that I can suggest to make your application run better). It's just a story about letting the VM do it.If you're using multiple threads (or one of the Java libraries is using multiple threads in your behalf) then threads in your application are doing allocations concurrently. All the threads are allocating from the same heap so some allowances have to be made for simultaneous allocations by 2 or more threads. In the simplest case (other than the case where no allowances are made and which is just plain wrong) each thread would grab a lock, do the allocation and release the lock. That gives the right answer but is just too slow. The slightly more complicated means would be to use some atomic operation (such as compare-and-swap) plus a little logic to safely do the allocation. Faster but still too slow. What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well. But how large should the TLAB's be?
Prior to 5.0 that was a difficult question to answer. TLAB's that were too large wasted space. If you had tons of threads and large TLAB's, you could conceivably fill up the heap with buffers that were mostly unused. Creating more threads might force a collection which would be unfortunate because the heap was mostly empty. TLAB's that were too small would fill quickly and would mean having to get a new TLAB which would require some form of synchronization. There was not a general recommendation that we could make on how large TLAB's should be. Yup, we were reduced to trial-and-error.
Starting with 5.0 living large with TLAB's got much simpler - except for the guy down the hall that did the implementation. Here's what the VM does for you.
Each thread starts with a small TLAB. Between the end of the last young generation collection and the start of the next (let me call that period an epoch), we keep track of the number of TLAB's a thread has used. We also know the size of the TLAB's for each thread. Averages for each thread are maintained for these two numbers (number and size of TLAB's). These averages are weighted toward the most recent epoch. Based on these averages the sizes of the TLAB's are adjusted so that a thread gets 50 TLAB's during a epoch. Why 50? All things being equal we figured that a thread would have used half its TLAB by the end of the epoch. Per thread that gave us 1/2 a TLAB not used (out of the magic 50) for a wastage of 1%. That seemed acceptable. Also if the young generation was not large enough to provide the desired TLAB's, the size of the young generation would be increased in order to make it so (within the usual heap size constraints, of course).
The initial TLAB size is calculated from the number of threads doing allocation and the size of the heap. More threads pushes down the initial size of the TLAB's and larger heaps push up the initial size.
An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.
Posted at 02:14PM Jan 13, 2006 by jonthecollector in Java |
Since It's Not Magic
So by now you've tried GC ergonomics and things are working pretty well. If you didn't have any GC related command line flags, your reactions could run the gamut from the "big yawn" to an eye-popping "that's cool". The "big yawn" means that GC probably doesn't matter for your application. The "that's cool" means that GC matters but nobody ever told you. If you did have a bunch of GC related command line flags and you threw them out in order to try GC ergonomics, then 80% of you are happy and just glad to forget those command line flags. But 20% of you are thinking "Hmmm, not quite as good". So what else do you need to know about GC ergonomics? By the way I invoke the 80/20 rules quite often, especially when I have no idea what the real numbers are.
By default ergonomics gives you a larger maximum heap size. "Larger" is relative to the default maximum heap size you got prior to ergonomics in 5.0. This larger maximum heap size depends on the amount of physical memory on the system (the more physical memory you have, the larger the maximum heap size you get), but it is capped at 1G. If you had a command line flag that sets a larger maximum heap size, put it back. Ergonomics just doesn't know you're in the 20% that needs a larger heap. If you set a larger maximum heap size, ergonomics will still grow and shrink the generations for better performance but now will have more space to play with. By the way if you set a smaller maximum heap size, ergonomics still does its stuff but just within a smaller heap.
Ergomonics has a default for the relative sizes of the young generation and the tenured generation in the heap. If you have command line flags to specify the sizes of the generations (probably something like -XX:NewRatio=ratio or -XX:MaxNewSize=bytes), use them. GC ergonomics normally does not change the maximum size of a generation so if you know something about how your application runs such that you know it benefits from a larger or smaller young generation, pass that information along on the command line. Again GC ergonomics just works within the limits of the heap it's given.
Posted at 01:29PM Jan 02, 2006 by jonthecollector in Java |
When Are You Out of Memory?
You know why you get an out-of-memory exception, right? Your live data exceeds the space available in the Java heap. Well, that's very nearly, always right. Very, very nearly.If the Java heap is barely large enough to hold all the live data, the JVM could be doing almost continual garbage collections. For example if 98% of the data in the heap is live, then there is only 2% that is available for new objects. If the application is using that 2% for temporary objects, it can seem to be humming along quite nicely, but not getting much work done. How can that be? Well the application runs until it has allocated that 2% and then a garbage collection happens and recovers that 2%. The application runs along happily allocating and the garbage collector runs along respectfully collecting. Over and over and over. The application will be making forward progress but maybe oh so slowly. Are you out of memory?
Back in the 1.4.1 days a customer noticed this type of behavior and asked for help in detecting that bad situation. In 1.4.2 the throughput collector started throwing an out-of-memory exception if the VM was spending the vast majority of its time doing garbage collection and not recovering very much space in the Java heap. In 5.0 the implementation was changed some, but the idea was the same. If you are spending way too much time doing garbage collections, you're going to get an out-of-memory. Interestingly enough this identified at least one case in our own usage of Java applications where we were spending most of our time doing garbage collection. We were happy to find it.
Why do I bring this up? Well, mostly because it was brought up in our GC meeting this morning. If you're in this situation of spending most of your time in garbage collection, I think you are out of memory and you need a bigger heap. If you don't think that, you can turn off this behavior with the command line flag -XX:-UseGCTimeLimit. May you never need it.
Posted at 03:00PM Dec 14, 2005 by jonthecollector in Java | Comments[1]
What Are We Thinking?
What's next for GC ergonomics?Just a friendly warning. This one verges on GC stream-of-consciousness ramblings.
GC ergonomics has been implemented so far in the throughput collector only. We've been thinking about how to extend it to the low pause collector. The low pause collector currently is implemented as a collector that does some of it's work while the application continues to run. It's described in
http://java.sun.com/docs/hotspot
Some of the policies we used in the throughput collector will also be useful for the low pause collector, but because the low pause collector can be running at the same time as the application, there are some intriguing differences. By the way the low pause collector does completely stop the application in order to do some parts of the collection so some of our experience with the throughput collector is directly applicable. On the other hand having this mix of behaviors can be interesting in and of itself.
When we were developing the low pause collector we decided that any parts of the collection that we could do while the application continued to run was good. It was free. If there are spare cycles on the machine, that's almost true. If there aren't spare cycles, then it can get fuzzy. If the collection steals cycles that the application could use, then there is a cost. Especially if there is only one processor on the machine. If there are more than one processor on the machine and I'm doing GC, am I stealing cycles from the application? If I steal cycles from another process on the machine, does it become free again? We've been thinking about how to assess the load on a machine and what we should do in different load situations. That type of information may turn out to be input for GC ergonomics.
Another aspect that we have to deal with is the connection between the young generation size and the tenured generation pause times (pauses in collecting the tenured generation, that is). When collecting the tenured generation, we need to be aware of objects in the young generation that can be referencing (and thus keeping alive) objects in the tenured generation. In fact we have to find those objects in the young generation. And the larger the young generation is the longer it takes to find those objects. With the throughput collector the times to collect the tenured generation is only distantly related to the size of the young generation. With the low pause collector the connection is stronger. If we're trying to meet a pause time goal for a pause that is part of the tenured generation collection, then maybe we should reduce the size of the young generation as well as reduce the size of the tenured generation. But maybe not.
With the throughput collector a collection is started when the application attempts to allocate an object and there is no room left in the Java heap. With the low pause collector we want the collection to finish before we run out of room in the Java heap. So when does the low pause collector start a collection of the tenured generation? Just In Time, hopefully. Starting too early means that some of the capacity of the tenured generation is not used. Starting too late makes the low pause collector not a low pause collector. In the 5.0 release we did some good work to measure how quickly the tenured generation was being filled and used that to decide when to start a collection. It's a nice self contained problem as long as we can start a collection early enough. But if we cannot start a collection in time then we probably need a larger tenured generation. So a failure to JIT/GC needs to feed into GC ergonomics decisions. Well, really we don't actually want to fail to JIT/GC before we expand the tenured generation so there's more to think about. But not right now.
Posted at 12:00PM Dec 06, 2005 by jonthecollector in Java |
What Were We Thinking?
There were some decisions made during the development of GC ergonomics that perhaps deserve some explanation. For example,
Why is the pause time goal satisfied first?
GC ergonomics tries to satisfy a pause time goal before considering any throughput goal. Why not the throughput goal first? I tried both ways with a variety of applications. As one might expect it was not black and white. In the end we chose to consider the goals in this order.
The pause time goal definitely has the potential for being the hardest goal to meet. It's dependence on heap size is complicated and trying to meet the pause time goal without the encumberances of either of the other goals was easier to think about. If we could meet the pause time goal, then increasing the heap to try and meet a throughput goal felt safer (i.e., the relationship between throughput and heap size is more linear so it was easier to understand how undoing an increase would get us back to where we started).
In retrospect it also seems more natural to have the pause time goal (which pushes heap size down) competing with the throughput goal (which pushes heap size up). And only then to have the throughput goal (which again pushes the heap size up) competing with the footprint goal (which, of course, pushes the heap size down).
A pause is a pause ...
We talked quite a bit about whether the pause time goal should apply to both the major and minor pause times. The issue was whether it would be effective to shrink the size of the old generation to reduce the major pause times. With a young generation collection you can shrink the heap more easily because there is always some place to put any live objects in the young generation (namely into the old generation). It was clear that reducing the young generation size would reduce the minor collection times (after you've paid the cost of getting the collection started and shutting it down). Well, that's true if you can ignore the fact that more frequent collections give objects less time to die. With the old generation it was much less obvious what would happen. The old generation can only be shunk down to a size big enough to hold all the live data in the old generation. Also the amount of free space in the old generation has an effect on the young generation collection in that young generation collection may need to copy objects into the old generation. In the end we decided that trying to limit both the major pauses and minor pauses with the pause time goal, while harder was more meaningful. Would you have accepted the excuse "Yes, we missed the goal but it was a major collection pause not a minor collection pause".
System.gc()'s. Just ignore them.
During development I initially tried to include the costs of System.gc()'s in the calculation of the averages used by GC ergonomics. In calculating the cost of collections the frequency of collections matters. If you are having collections more often then the cost of GC is higher. The strategy to reduce that cost is to increase the size of the heap so that collections are less frequent (i.e., since the heap is larger you can do more allocations before having to do another collection). The difficulty with System.gc()'s is that increasing the size of the heap does not in general increase the time between System.gc()'s. I tried to finesse the cost of a System.gc() by considering how full the heap was when the System.gc() happened and extrapolating to how long the interval between collections would have been. After some experimentation I found that picking how to do the extrapolation was basically picking the answer (i.e., what the GC cost would have been). I could tailor an extrapolation to fit one application, but invariably it did not fit some other applications. Basically it was too hard. So GC ergonomics ignores System.gc()'s.
Posted at 10:09AM Oct 24, 2005 by jonthecollector in Java |
Where Are the Sharp Edges?
In general GC ergonomics works best for an application that has reached a steady state behavior in terms of its allocation pattern. Or at least it is not changing its allocation pattern quickly. GC ergonomics measures the pause times and throughput of the application and changes the size of the heap based on those measurements.The measurements of pause time and throughput are kept in terms of a weighted average where (as one would expect) the most recent measurements are weighted more heavily. By using a weighted average GC ergonomics is not going to turn on a dime in response to a change in behavior by the the application, but it is also not going to go flying off in a wrong direction because of normal variations in behavior.
If past behavior is not a good indicator of future performance, then GC ergonomics can lag behind in its decision making. If a change is just an occasional bump in the road, GC ergonomics will catch up. If behavior is all over the map, well, what can I say.
The easiest way to get into trouble with GC ergonomics is to specify a pause time goal that is not reachable. Typically what happens is that GC ergonomics reduces the pause times by reducing the size of the heap. As the heap is shrunk the frequency of collections goes up and throughput goes down. GC ergonomics is willing to drive throughput to nearly zero (by doing collections nearly all the time) in order to reach the pause time goal. I tell people to run without a pause time goal initially and see how large the collection pauses get. That gives a baseline for experimenting with a pause time goal. Then have a little fun and try some pause times.
Another thing you should be aware of is that GC ergonomics is going to run at every collection. If your applications has settled into a stable steady state, GC ergonomics is still looking to see if anything is changing so it can adjust. It does cost you some cycles, but I don't think it's significant. Let me put it this way. I've never seen GC ergonomics code show up in a significant way on performance profiles. This is probably less a sharp edge than a mild poke in the ribs. If you think that your application is really not going to be changing its behavior after it has settled in and want those last few cycles, run with GC ergonomics until your application has reached its steady state and look to see how the heap is sized. You'll have to pay attention to how the generations are sized also. Then select those sizes on the command line and turn GC ergonomics off. At least for most of you that should be plenty good. If performances is not quit as high and you don't already know about survivor spaces, you may have to learn about them. The document "Tuning Garbage Collection with the 5.0 Java Virtual Machine" should help. It can be found under the URL below (same one as in "Magic"). If performances is actually better, rejoice and let me know how we can be doing better.
http://java.sun.com/docs/hotspot
Posted at 06:37AM Oct 05, 2005 by jonthecollector in Java |
Tuesday Aug 15, 2006