Jon Masamitsu's Weblog
Help for the NUMA Weary
When you think about running on a machine with a non-uniform memory architecture (NUMA), do you think, "Cool, some memory is really close"? Or do you think, "Why are my memory latency choices bad, worse and worst"? Or are you like me and try not to think about it at all?Well for all of the above, this option rocks.
Executive summary: There is a option which attempts to improve the performance of an application on a NUMA machine by increasing the application's use of lower latency memory. This is done by creating per cpu pools of memory in the young generation. A thread AA that runs on a cpu XX will get objects allocated out of the pool for XX. When cpu XX first touches a new page of memory, Solaris tries to assign to XX memory that is closer to XX. Additionally, if a thread AA has run on XX, Solaris tries to keep AA running on XX. When this all works as we hope, AA is accessing the memory that is closer to it more of the time. This option does not improve GC performance but improves the performance of application threads. I'll say something about support for non-Solaris platforms at the end.
So if you're asking yourself if you should care about NUMA, two examples of Sun servers with NUMA architectures are the Sun Sparc E6900 and the AMD Opteron X4600. Sun's early chip multithreading (CMT) boxes (T1000, T2000, T5120 and T5220) are not NUMA, but the later CMT T5140 and T5240 are.
In the above summary I used the term "per cpu pools" for brevity when I should really have used the term "per node pools" to be more precise. Nodes in this context have 1 or more cpu's, local memory and interconnect as a minimum. I'll try to be more precise below, but if you see node and think cpu, it's close enough.
On a NUMA system there is some memory that is closer (local) to the cpu's in a node and some memory that is farther away (remote). Local and remote is relative to a node. On Solaris 10 the OS tries to improve the performance of a thread AA executing on a node RR by increasing AA's use of memory local to RR. This is achieved by a "first touch" policy (my words, not a technical term). AA can make a call to get memory but physical memory is not committed to AA until the memory is used (first touched). When AA executing on RR first touches memory, Solaris tries to assign it memory that is local to RR. If there is no available local memory, AA will get memory remote from RR.
The UseNUMA feature takes advantage of this policy to get better locality for an application thread that allocates objects and then uses them (as opposed to an architecture where one thread allocates objects and some other thread uses them).
In JDK6 update 2 we added -XX:+UseNUMA in the throughput collector (UseParallelGC). When you turn this feature on the, JVM then divides the young generation into separate pools, 1 pool for each node. When AA allocates an object, the JVM looks to see what node AA is on and then allocates the object from the pool for that node. In the diagram AA is running on RR and RR has its pool in the young generation.
Combine this with a first touch policy and the memory in the pool for RR is first touched by a thread running on RR and so is likely to be local to RR. And as I mentioned above if AA has run on RR. Solaris will try to keep AA executing on RR. So best case is that you have AA accessing local memory most of the time. It may sound a bit like wishful thinking, but we've seen very nice performance improvements on some applications.
Contrast this with the allocation without per node pools. As a thread does allocations, it marches deeper and deeper into the young generation. A thread actually does allocations out of thread local buffers (TLAB's) but even so, the TLAB's for a thread are generally scattered throughout the young generation and it is even more wishful thinking to expect those TLAB's to all be mapped to local memory.
This part is extra credit and you don't really need to know about it to use UseNUMA. Solaris has the concept of locality groups or lgroups. You can read more about lgroups at
A node has an lgroup and within that lgroup are the resources that are closer to the node. There is actually a hierarchy of lgroups, but lets talk as if a node has an lgroup that has its closest resources (local resources) and the resources farther away are just someplace else (remote resources). Thread AA running on RR can ask what lgroup MM it is in and can ask if a page of memory is in MM. This type of information is used by the page scanner that I describe below.
There are a couple of caveats.
The young generation is divided into per node pools. If any of these pools are exhausted, a minor collection is done. That's potentially wasteful of memory so to ameliorate that, the sizes of the pools are adjusted dynamically so that threads that do more allocation get larger pools.
In situations where memory is tight and there are several processes running on the system, the per node pools can be a mixture of local and remote memory. That simply comes about when RR first touches a page in its pool and there is not local memory available. It just gets remote memory. To try to increase the amount of local memory in the pool, there is a scanner that looks to see if a page in a pool for RR is in the lgroup MM for RR. If the page is not, the scanner releases that page back to the OS in the hopes that, when AA on RR again first touches that page, the OS will allocate memory in MM for that page. Recall that eden in the young generation is usually empty after a minor collection so these pages can be released. The scanner also looks for small pages in the pool. On Solaris you can have a mixture of pages of different sizes in a pool and performance can be improved by using more large pages and fewer small pages. So the scanner also releases small pages in the hope that it will be allocated a large page the next time it uses the memory. This scanning is done after a collection and only scans a certain number of pages (NUMAPageScanRate) per collection so as to bound the amount of scanning done per collection.
To review, if you have
- Thread AA running on node RR and the JVM allocating objects for AA in the pool for RR.
- Solaris mapping memory for the pool for RR in the lgroup MM (i.e., local to RR) based on first touch.
- Solaris keeping thread AA running on node RR.
then your application will run faster. And all you have to do is turn on -XX:+UseNUMA.
An implementation on linux is in the works and will be in an upcoming update of jdk6. The API's are different for binding the per node pools to local memory (e.g., the JVM requests that pages be bound rather than relying on first touch) but you really don't need to know about any differences. Just turn it on. We've looked at an implementation for windows platforms and have not figured out how to do it yet.
If you would like to know a little more about dealing with NUMA machines, you might find this useful.
Increasing Application Performance on NUMA Architectures
Posted at 09:22AM May 19, 2008 by jonthecollector in General | Comments[2]
Our Collectors
I drew this diagram on a white board for some customers recently. They seemed to like it (or were just being very polite) so I thought I redraw it for your amusement.
Each blue box represents a collector that is used to collect a generation. The young generation is collected by the blue boxes in the yellow region and the tenured generation is collected by the blue boxes in the gray region.
Using the -XX flags for our collectors for jdk6,
FAQ
1) UseParNew and UseParallelGC both collect the young generation using
multiple GC threads. Which is faster?
There's no one correct answer for
this questions. Mostly they perform equally well, but I've seen one
do better than the other in different situations. If you want to use
GC ergonomics, it is only supported by UseParallelGC (and UseParallelOldGC)
so that's what you'll have to use.
2) Why doesn't "ParNew" and "Parallel Old" work together?
"ParNew" is written in
a style where each generation being collected offers certain interfaces for its
collection. For example, "ParNew" (and "Serial") implements
space_iterate() which will apply an operation to every object
in the young generation. When collecting the tenured generation with
either "CMS" or "Serial Old", the GC can use space_iterate() to
do some work on the objects in the young generation.
This makes the mix-and-match of collectors work but adds some burden
to the maintenance of the collectors and to the addition of new
collectors. And the burden seems to be quadratic in the number
of collectors.
Alternatively, "Parallel Scavenge"
(at least with its initial implementation before "Parallel Old")
always knew how the tenured generation was being collected and
could call directly into the code in the "Serial Old" collector.
"Parallel Old" is not written in the "ParNew" style so matching it with
"ParNew" doesn't just happen without significant work.
By the way, we would like to match "Parallel Scavenge" only with
"Parallel Old" eventually and clean up any of the ad hoc code needed
for "Parallel Scavenge" to work with both.
Please don't think too much about the examples I used above. They
are admittedly contrived and not worth your time.
3) How do I use "CMS" with "Serial"?
-XX:+UseConcMarkSweepGC -XX:-UseParNewGC.
Don't use -XX:+UseConcMarkSweepGC and -XX:+UseSerialGC. Although that's seems like
a logical combination, it will result in a message saying something about
conflicting collector combinations and the JVM won't start. Sorry about that.
Our bad.
4) Is the blue box with the "?" a typo?
That box represents the new garbage collector that we're currently developing called
Garbage First or G1 for short. G1 will provide
G1 straddles the young generation - tenured generation boundary because it is
a generational collector only in the logical sense. G1 divides the
heap into regions and during a GC can collect a subset of the regions.
It is logically generational because it dynamically selects a set of
regions to act as a young generation which will then be collected at
the next GC (as the young generation would be).
The user can specify a goal for the pauses and G1
will do an estimate (based on past collections) of how many
regions can be collected in that time (the pause goal).
That set of regions is called a collection set and G1 will
collect it during the next GC.
G1 can choose the regions with the most garbage to collect first (Garbage First, get it?)
so gets the biggest bang for the collection buck.
G1 compacts so fragmentation is much less a problem. Why is it a problem at all?
There can be internal fragmentation due to partially filled regions.
The heap is not statically divided into
a young generation and a tenured generation so the problem of
an imbalance in their sizes is not there.
Along with a pause time goal the user can specify a goal on the fraction of
time that can be spent on GC during some period (e.g., during the next 100 seconds
don't spend more than 10 seconds collecting). For such goals (10 seconds of
GC in a 100 second period) G1 can choose a collection set that it expects it can collect in 10 seconds and schedules the collection 90 seconds (or more) from the previous collection. You can see how an evil user could specify 0 collection
time in the next century so again, this is just a goal,
not a promise.
If G1 works out as we expect, it will become our low-pause collector in place of
"ParNew" + "CMS". And if you're about to ask when will it be ready, please don't
be offended by my dead silence. It's the highest priority project for our team,
but it is software development so there are the usual unknowns. It will be out
by JDK7. The sooner the better as far as we're concerned.
Updated February 4. Yes, I can edit an already posted blog. Here's
a reference to the G1 paper if you have ACM portal access.
http://portal.acm.org/citation.cfm?id=1029879
Posted at
03:51PM Feb 01, 2008
by jonthecollector in Java |
Comments[34]
GC Errata 2
I know that none of you are still making the transition from jdk1.4.2 to jdk5 (aka jdk1.5.0, yah, we changed the numbering) but if you were, I'd want you to know that you might need a larger permanent generation. In jdk5 class data sharing was implemented (the -Xshare option, see the "java -X" output) when the client VM (-client) is being used. The implementation of class data sharing split some of the information about classes into two parts (the part that can be shared and the part that is specific to an execution of the JVM). The space in the permanent generation for classes that can be shared increased by 10-20%. That doesn't mean that you need a permanent generation that is 10-20% larger! Only some of the classes are affected and only some of the data structures for those classes are affected. But if you're running a tight ship in terms of the size of permanent generation, you might be affected by this. Of course, none of you are still making the transition to jdk5.Some users have reported occasional long pauses which occur at the time of a GC but cannot be accounted for by GC times. Such long pauses may be a symptom of a known latency in the use of (Linux's) mprotect() which is used by the JVM while steering application threads to so-called safepoints. A known workaround at the JVM level to step around the Linux mprotect() issue is the use of JVM option -XX:+UseMembar.
The real fix, unfortunately, lies in the Linux kernel; see http://bugzilla.kernel.org/show_bug.cgi?id=5493. You might wish to pursue this with your Linux support engineer and see if a patch is available for the problem.
To avoid running into this bug, make sure there is plenty of physical memory free in your system so that the Linux pager does not start evicting pages for any process from memory to swap.
If you're seeing large variations in your minor GC pauses and are using UseParNewGC (which is on by default with the low pause collector), you might be running into 6433335. A large variation is a factor of a 100. This is a bug that has to do with large objects (or large free blocks) in the heap. If you turn off UseParNewGC and see larger pauses (by a multiplier that is on the order of the number of cpu's) but more regular pauses, then 6433335 is a likely candidate. It's been fixed in jdk 1.4.2 update 14, jdk 5 update 10 and jdk 6 update 1.
I've heard of more than one report recently of crashes with the low-pause collector that were caused by bug
6558100 - CMS crash following parallel work queue overflow
This bug is fixed in 1.4.2u18, 5.0u14 and 6u4. You can workaround the bug with the flag -XX:-ParallelRemarkEnabled. You could also run into this bug if you explicitly enable ParallelRefProcEnabled, so if you include -XX:+ParallelRefProcEnabled, remove it. ParallelRefProcEnabled is off by default so if you don't explicitly turn it on, don't worry about it.
Starting in jdk6 the biased locking optimization is "on" by default (command line option UseBiasedLocking). This optimization reduces the cost of uncontended locks by treating the thread that owns the lock preferentially. It's a nice optimization but could increase the memory usage of the JVM.
If you move from jdk5 to jdk6 and start seeing messages such as
java.lang.OutOfMemoryError: requested
you could try turning off biased locking and see if it helps. Improvements in
biased locking were made in 6u1 to make this much less of a problem.
Posted at
10:28AM Jan 10, 2008
by jonthecollector in Java |
Did You Know ...
These are a few esoteric factoids that I never expected users to need, but which have actually come up recently. Most of the text is just background information. If you already recognize the command line flags that I've bold'ed, you probably already know more than is good for you.ParallelCMSThreads
The low-pause collector (UseConcMarkSweepGC) does parts of the collection of the tenured generation concurrently with the execution of the application (i.e., not during a stop-the-world). There are principally two concurrent phases of the collection: the concurrent marking phase and the concurrent sweeping phase. In JDK 6 the concurrent marking phase can use more than 1 GC threads (uses parallelism as well as concurrency). This use of the parallelism is controlled by the command line flag CMSConcurrentMTEnabled. The number of threads used during a concurrent marking phase is ParallelCMSThreads. If it is not set on the command line it is calculated as
(ParallelGCThreads + 3)/4)
where ParallelGCThreads is the command line flag for setting the number of GC threads to be used in a stop-the-world parallel collection.
Where did this number come from? We added parallelism to the concurrent marking phase because we observed that a single GC thread doing concurrent marking could be overwhelmed by the allocations of many applications threads (i.e., while the concurrent marking was happening, lots of applications threads doing allocations could exhaust the heap before the concurrent marking finished). We could see this with a fewer application threads allocating at a furious rate or many application threads allocating at a more modest rate, but whatever the application we would often seen the concurrent marking thread overwhelmed on platforms with 8 or more processors.
The above policy provides a second concurrent marking threads at ParallelGCThreads=5 and approaches a fourth of ParallelGCThread at the higher processor numbers. Because we still do have the added overheard of parallelism 2 concurrent marking threads provide only a small boost in concurrent marking over a single concurrent marking thread. We expect that to still be adequate up to ParallelGCThreads=8. At ParallelGCThreads=9 we get a third concurrent marking thread and that's when we expect to need it.
CMSMaxAbortablePrecleanTime
Our low-pause collector (UseConcMarkSweepGC) which we are usually careful to call our mostly concurrent collector has several phases, two of which are stop-the-world (STW) phases.
The first STW pause is used to find all the references to objects in the application (i.e., object references on thread stacks and in registers). After this first STW pause is the concurrent marking phase during which the application threads runs while GC is doing additional marking to determine the liveness of objects. After the concurrent marking phase there is a concurrent preclean phase (described more below) and then the second STW pause which is called the remark phase. The remark phase is a catch-up phase in which the GC figures out all the changes that the application threads have made during the previous concurrent phases. The remark phase is the longer of these two pauses. It is also typically the longest of any of the STW pauses (including the minor collection pauses). Because it is typically the longest pause we like to use parallelism where ever we can in the remark phase.
Part of the work in the remark phase involves rescanning objects that have been changed by an application thread (i.e., looking at the object A to see if A has been changed by the application thread so that A now references another object B and B was not previously marked as live). This includes objects in the young generation and here we come to the point of these ramblings. Rescanning the young generation in parallel requires that we divide the young generation into chunks so that we can give chunks out to the parallel GC threads doing the rescanning. A chunk needs to begin on the start of an object and in general we don't have a fast way to find the starts of objects in the young generation.
Given an arbitrary location in the young generation we are likely in the middle of an object, don't know what kind of object it is, and don't know how far we are from the start of the object. We know that the first object in the young generation starts at the beginning of the young generation and so we could start at the beginning and walk from object to object to do the chunking but that would be expensive. Instead we piggy-back the chunking of the young generation on another concurrent phase, the precleaning phase.
During the concurrent marking phase the applications threads are running and changing objects so that we don't have an exact picture of what's alive and what's not. We ultimately fix this up in the remark phase as described above (the object-A-gets-changed-to-point-to-object-B example). But we would like to do as much of the collection as we can concurrently so we have the concurrent precleaning phase. The precleaning phase does work similar to parts of the remark phase but does it concurrently. The details are not needed for this story so let me just say that there is a concurrent precleaning phase. During the latter part of the concurrent precleaning phase the the young generation "top" (the next location to be allocated in the young generation and so at an object start) is sampled at likely intervals and is saved as the start of a chunk. "Likely intervals" just means that we want to create chunks that are not too small and not too large so as to get good load balancing during the parallel remark.
Ok, so here's the punch line for all this. When we're doing the precleaning we do the sampling of the young generation top for a fixed amount of time before starting the remark. That fixed amount of time is CMSMaxAbortablePrecleanTime and its default value is 5 seconds. The best situation is to have a minor collection happen during the sampling. When that happens the sampling is done over the entire region in the young generation from its start to its final top. If a minor collection is not done during that 5 seconds then the region below the first sample is 1 chunk and it might be the majority of the young generation. Such a chunking doesn't spread the work out evenly to the GC threads so reduces the effective parallelism.
If the time between your minor collections is greater than 5 seconds and you're using parallel remark with the low-pause collector (which you are by default), you might not be getting parallel remarking after all. A symptom of this problem is significant variations in your remark pauses. This is not the only cause of variation in remark pauses but take a look at the times between your minor collections and if they are, say, greater than 3-4 seconds, you might need to up CMSMaxAbortablePrecleanTime so that you get a minor collection during the sampling.
And finally, why not just have the remark phase wait for a minor collection so that we get effective chunking? Waiting is often a bad thing to do. While waiting the application is running and changing objects and allocating new objects. The former makes more work for the remark phase when it happens and the latter could cause an out-of-memory before the GC can finish the collection. There is an option CMSScavengeBeforeRemark which is off by default. If turned on, it will cause a minor collection to occur just before the remark. That's good because it will reduce the remark pause. That's bad because there is a minor collection pause followed immediately by the remark pause which looks like 1 big fat pause.l
DontCompileHugeMethods
We got a complaint recently from a user who said that all his GC pauses were too long. I, of course, take such a statement with a grain of salt, but I still try to go forward with an open mind. And this time the user was right, his GC pauses were way too long. So we started asking the usual questions about anything unusual about the application's allocation pattern. Mostly that boils down to asking about very large objects or large arrays of objects. I'm talking GB size objects here. But, no, there were nothing like that. The user was very helpful in terms of trying experiments with his application, but we weren't getting anywhere until the user came back and said that he had commented out part of his code and the GC's got much smaller. Hmmm. Curiouser and curiouser. Not only that, but the code that was commented out was not being executed. At this point the strain on my brain began to be too much and I lost consciousness. Fortunately, another guy in the group persevered and with some further experiments determined that the code that was being commented out was in a method that was not always being JIT'ed.
Methods larger than a certain size will not be JIT'ed in hotspot. Commenting out some code would bring the size of the method below the JIT size limit and the method would get compiled. How did that affect GC you might ask? When a method is compiled, the compilers generate and save information on where object references live (e.g., where on the stack or in which registers). We refer to these as oop maps and oop maps are generated to speed up GC. If the method has not been JIT'ed, the GC has to generate the oop maps itself during the GC. We do that by a very laborious means that we call abstract interpretation. Basically, we simulate the execution of the method with regard to where reference are stored. Large methods mean large abstract interpretation times to generate the oop maps. We do save the oop maps for the next GC, but oop maps are different at different locations in the method. If we generate an oop map for PC=200 this time but stop for a GC at PC=300 next time, we have to generate the oop map for PC=300. Anyway, the method in which code was being commented in and out, was too large to be JIT'ed with the code commented in and that led to the long GC's.
If you have some huge methods and GC's are taking a very long time, you could try -XX:-DontCompileHugeMethods. This will tell the JIT to ignore its size limit on compilation. I'm told by the compiler guy in my carpool that it's not a good idea to use that flag in general. Refactor your methods down to a less than huge size instead. By the way, the huge method was something like 2500 lines so it was what I would call huge.
Posted at 11:45AM Oct 26, 2007 by jonthecollector in Java | Comments[11]
Size Matters
I recently had an interesting exchange with a customer about concurrent mode failures with the low-pause collector. My initial response was with my litany of whys and wherefores of concurrent mode failures. But it got more interesting, because it turned out to be an example of where object sizes and the young generation size made an unexpected difference.Before I get to that let's talk a little bit about the JVM's behavior with regard to allocating large objects. Large, of course, is a relative term and what I mean here is large relative to the size of the young generation. My comments here apply to JDK 5 update 10 and later and JDK 1.4.2 update 13 and later. Earlier versions of those JVM's had a bug in them (6369448) that made their behavior different. All versions of JDK 6 are also free from this bug. By the way the customer was on JDK 5 update 6 so was subject to 6369448. That made it a little more interesting for me but may not be of interest to you if you are running on the later releases. And also the policy I describe below does not apply to the throughput collector (UseParallelGC) which has its own variation on this policy.
Objects are normally allocated out of the young generation and get moved to the tenured generation as they age. But what happens when a object is larger than the young generation (actually larger than the eden space in the young generation)? If the application tries to allocate object AAA and there is not enough free space in eden for the allocation, a garbage collection usually occurs. The exception to this is if AAA is larger than eden. An outline of the policy for allocation and collections follow.
- Allocate out of the young generation
- Possibly allocate out of the tenured generation
- Collect the young generation
- Possibly collect the tenured generation
- Heroics
If the allocation in 1. fails, before starting a GC the JVM considers allocating AAA out of the tenured generation at 2. The JVM compares the size of AAA to the current capacity of eden. By capacity I mean the total size available in an empty eden. If the capacity of eden is too small to hold AAA, then collecting the young generation will not help with the allocation of AAA. So the JVM tries to allocate AAA directly out of the tenured generation. If the allocation out of the tenured generation succeeds, no young generation GC is done and the JVM just continues. If AAA is smaller than eden, then the JVM proceeds to the collection of the young generation at 3. After the the young generation collection, a check is done at 4. to see if enough space has been freed to allocate AAA in eden. If AAA still cannot be allocated out of the young generation, then the tenured generation is collected at 5. and an allocation attempt is made out of the tenured generation. If AAA cannot be allocated in the tenured generation, some additional heroics are attempted (e.g., clearing all SoftReferences). Failing those heroics an out-of-memory is thrown.
A few things to note. The initial allocation at 1. is what we refer to as fast-path allocation. Fast-path allocation does not call into the JVM to allocate an object. The JIT compilers know how to allocate out of the young generation and code for an allocation is generated in-line for object allocation. The interpreter also knows how to do the allocation without making a call to the VM. And yes, both know about thread local allocation buffers (TLAB's) and use them. The allocation out of the tenured generation at 2. could be attempted unconditionally for any sized object. That could avoid a collection of the young generation, but would also defeat the purpose of having the young generation. We want new objects allocated in the young generation because that's where they will probably die. We only want the longed-lived objects to make it into the tenured generation. Also an allocation at 2. is not a fast-path allocation so the execution of the application could be significantly affected by too much allocation at 2. At 4. a check is made to see if enough space has been freed in the young generation for AAA and if not the tenured generation is collected. Why not attempt the allocation of AAA out of the tenured generation before doing the collection? Attempting the allocation first would probably work lots of the time, but there are pathological cases where allocations would repeatedly be done at 4. If we called allocation at 2. slow-path allocation, we'd have to call allocation at 4. slow-slow-path allocation. At 4. we've already stopped the world and done a young generation collection. The pathological case would execute very slowly. We know. We've tried variations that did the slow-slow-path allocation first and the reports that we usually got was the the JVM was hung! The JVM was not hung. Going slow-slow-path just made it seem so. Doing the collection of the tenured generation actually also collects the young generation and doing that avoids the pathological case in practice.
The final thing to note is that you don't want to be stuck using slow-path allocation (allocation at 2.) It's really slow compared to fast-path allocation. So if you have large objects (and only you would know), try to make your young generation large enough to hold plenty of them (10's would be good, 100's would be better). Or that you have very few of them.
The policy for the throughput collector differs from the above at 2. and 5. At 2. the throughput collector will allocate an object out of the tenured generation if it is larger than half of eden. That variation tries to not fill up the young generation with large objects. Is this better? I just don't know. The heroics at 5. are also different.
The interesting exchange with the customer that got me thinking about this blog started with this snippet of a GC log. Again this was with JDK 5 update 6 so the bug 6369448 was in play.
963236.282: [GC 963236.282: [ParNew: 15343K->0K(21184K), 0.1025429 secs] 8289743K->8274557K(12287936K), 0.1029416 secs] 963236.479: [GC 963236.479: [ParNew: 10666K->0K(21184K), 0.1072986 secs] 963236.587: [CMS (concurrent mode failure): 8285141K->7092744K(12266752K), 91.3603721 secs] 8285224K->7092744K(12287936K), 91.4763098 secs] 963328.194: [GC 963328.194: [ParNew: 21120K->0K(21184K), 0.1455909 secs] 7135030K->7114033K(12287936K), 0.1459930 secs] 963328.434: [GC 963328.435: [ParNew: 7442K->0K(21184K), 0.0745429 secs] 963328.509: [CMS (concurrent mode failure): 7114084K->7073781K(12266752K), 78.1535121 secs] 7121475K->7073781K(12287936K), 78.2286852 secs] 963408.503: [GC 963408.503: [ParNew : 21120K->0K(21184K), 0.0651745 secs] 7116067K->7097487K(12287936K), 0.0656080 secs]
When I first looked at this log I jumped right to the first concurrent mode failure. The first thing to notice was that the available free space in the tenured generation was large (about 4g). The total space in the tenured generation was 12266752K and the occupancy (amount of space currently being used) of the tenured generation was only 8285141K (see the part of the log immediately following the first "concurrent mode failure" message). That's about 4g of space. With that much free space the only thing I could think of was a really bad case of fragmentation. Now a concurrent mode failure is bad for the low-pause collector. If it is due to fragmentation at least the resulting collection is a compacting collection and you should not have to worry about fragmentation for a while. So why is there another concurrent mode failure after just one minor collection? And there is even more free space in the tenured generation when this second one occurred (about 5g). So now I'm sweating. I fortunately started noticing some other peculiarities in the log. For example the first minor collection starts when the young generation had only allocated 15343K out of a total of 21184K. That's not that unusual because the survivor space can be sizable and mostly empty. But the third collection is also a minor collection and the young generation has allocated 21120K by the time the minor collection starts (as does the fifth collection). Hmmm. Looking at the occupancies of the young generation at the point of the two concurrent mode failures shows the pattern: the young generation has a good chunk of free space but is still being collected. So I asked the customer if very large objects are being allocated, objects that are approaching the size of the young generation. To my relief the answer was yes. What follows now is my interpretation of the logs. I didn't have the application to run and really debug the behavior, but I think the pieces fit together.
The first collection (a minor collection) starts seemingly early because a large object is being allocated. The object is larger than the avalable space in eden but is not larger than all of eden so the minor collection does free up enough space to allocate that large object. The next collection is started when an even larger object allocation is attempted. This is where bug 6369448 comes into play. If not for 6369448 the larger object would have been allocated out of the tenured generation at 2. (the corrected case that I discussed above). There was an inconsistency in the test at 2. and the check at 4. before the collection of the tenured generation. A young generation is divided into an eden and two survivor spaces. The test at 2. compared the object size to the total size of the young generation. The check at 4. only checked the size (as is correct) against eden. So the short circuit at 2. was not taken, the collection at 3. doesn't free up enough space, and a tenured generation collection (concurrent mode failure) is the result. The third collection is an uninteresting minor collection. The fourth collection again is a collection prompted by the allocation of an object too big to fit into eden so another concurrent mode failure happens. The fifth collection is also uninteresting as are the many minor collections that follow.
My story for the two concurrent mode failures does depend on the large objects fitting in that gap in size between eden and the total young generation, but I'm willing to assume that there was a mix of object sizes and occasionally there was something in the gap. The two concurrent mode failure pattern above didn't happen often in the entire log and it was a long log. And the customer increased the size of the young generation and these types of concurrent mode failures did not reoccur.
I've always expected the occurrence of slow-path allocation to be a rare event. I think it is rare, but that's not the same as never so I'm thinking about adding a counter to be printed into the GC log to indicate how many slow-path allocations have occurred since the last GC.
Posted at 10:58AM Jun 22, 2007 by jonthecollector in Java |
Get What You Need
As we all know by now GC ergonomics exists to automatically adjust the sizes of the generations to achieve a pause time goal and/or a throughput goal while using a minimum heap size. But sometimes that's not what you really want. I was talking to a user recently who was using java.lang.Runtime.totalMemory() and java.lang.Runtime.freeMemory() to monitor how full the heap was. The intent was that the user's application be able to anticipate when a GC was coming and to throttle back on the number of requests being serviced so as to avoid dropping requests due to the reduced throughput when the GC pause did occur. The problem was that the capacity of the heap (the amount returned by totalMemory()) was varying so much that anticipating when a GC was coming was not very reliable. First the pretty picture.
Looking at the memory used plot first, I inferred from the graph that there was some processing during initialization that required a lesser amount of memory. After that there was activity that used and released memory until about 6k seconds. At that point the activity died down until about 7.5k seconds when there was an up tick in activity and then a quiet period again until about 11k seconds. After that there was pretty regular allocations being done.
Looking at the total memory I see what I'm expecting to see. The initial total heap size at zero time is larger than is needed during the first phase and the total memory drops. Then as allocations picked up the size of the heap grew and, on average, stayed at a higher level until the first quiet period. During the first quiet period the size total heap size decayed to a lower value that corresponds to the lesser demand for allocations. Then there is the short burst of activity with an corresponding growth in the heap. The toal heap again drops during the second quiet period and then grows again as the activity picks up.
I like the behavior I see, but it wasn't what the user wanted.
The amount of variation in the total heap size during the active periods is high. This is varying too quickly for this user. The reason for the variations is that the GC ergonomics is trying to react to variations in the behavior of the applications and so is making changes to the heap as soon as a variation in the application's behavior is recognized. There are two ways to look at this:
- GC ergonomics was designed for the case where a steady state behavior has been reached, and as exhibited by the variations in the amount of memory used, this application doesn't reach a steady state (on the time scale of interest) and GC ergonomics is doing the best that it can under the circumstances, and
- the user's application is what it is and what should he do?
The first bullet is the excuse I use for almost everything. Let's consider the second.
First a word about what GC ergonomics does by default that provides some damping of variations.
GC ergonomics calculates the desired size of a generation (in terms of meeting the goals specified by the user) and then steps toward that size. There is a miminum size for a step and if the distance to the desired size is less than that minimum step, then the step is not made. The minimum step size is the page size on the platform. This eliminates changes as we get close to the goal and removes some of the jitter.
When deciding whether a goal is being met, GC ergonomics uses weighted averages for quantities. For example, a weighted average for each pause time is used. The weighted averages typically change more slowly than the instantaneous values of the quantities so their use also tends to damp out variations.
But if you really want a more stable heap size, here's what you can try. These suggestions limit the range of changes that can be made by GC ergonomics. The more minor limitations are given first followed by the more serious ones, finally ending with turning off GC ergonomics. You can just try the suggestions (in bold) and skip over the explanations as you wish.
Reduce the jitter caused by the different sizes of the survivor spaces (and the flip-flopping of the roles of the survivor spaces) by reducing the size of the survivor spaces. Try NNN = 8 as a starter.
-XX:MinSurvivorRatio=NNN
-XX:InitialSurvivorRatio=NNN
The total size of the young generation is the size of eden plus the size of from-space. Only from-space is counted because the space in the young generation that contain objects allocated by the application is only eden and from-space. The other survivor space (to-space) can be thought of as scratch space for the copying collection (the type of collection that is used for the young generation). Each survivor space alternately plays the role of from-space (i.e. during a collection survivor space A is from-space and in the next collection survivor space B is from-space). Since the two survivor spaces can be of different sizes, just the swapping can change the total size of the young generation. In a steady state situation the survivor spaces tend to be the same size but in situations where the application is changing behavior and GC ergonomics is trying to adjust the sizes of the survivor spaces to get better performance, the sizes of the survivor spaces are often different temporarily.
The default value for MinSurvivorRatio is 3 and the default value for InitialSurvivorRatio is 8. Pick something in between. A smaller value puts more space into the survivor spaces. That space might go unused. A larger value limits the the size of the survivor spaces and could result in objects being promoted to the tenured generation prematurely. Note that the survivor space sizes are still adjusted by GC ergonomics. This change only puts a limit on how large they can get.
Reduce the variation in the young generation size by setting a minimum and a maximum with -XX:NewSize=NNN and -XX:MaxNewSize=MMM.
GC ergonomics will continue to adjust the size of the young generation within the range specified. If you want to make the minimum and maximum limits of the young generation the same you can use -XX:NewSize=NNN and -XX:MaxNewSize=NNN or the flag -XmnNNN will also do that.
Reduce the range over which the generations can change by specifying the minimum and maximum heap size with -Xms and -Xmx, respectively.
If you've already explicitly specified the size of the young generation, then specifying the limit on the entire heap will in effect specify the limits of the tenured generation. You don't have to make the minimum and maximum the same but making them the same will have the maximum effect in terms of reducing the variations.
Turn off GC ergonomics with the flag -XX:-UseAdaptiveSizePolicy
The size of the generations, the sizes of the survivor spaces in the young generation and the tenuring threshold stay at their starting value throughout the execution of the VM. You can exercise this level of control but the you're back to tuning the GC yourself. But sometimes that really is the best solution. The user I talked to liked the more predictable heap size better than the automatic tuning and so turned off ergonomics.
Posted at 02:09PM Apr 23, 2007 by jonthecollector in Java |
GC Errata 1
Just some tid bits of information to fill some of the holes in the documentation.GC ergonomics is used by default on server class machines. Please see
http://java.sun.com/docs/hotspot/gc5.0/ergo5.html
for a description of GC ergonomics and server class machine. The flag "-server" causes the server VM to be used. "-server" does not cause the platform on which the VM is running to be considered a server class machine.
Prior to JDK6 the UseParallelGC collector (with GC ergonomics turned on) used only the flags InitialSurvivorRatio and MinSurvivorRatio to set the initial survivor ratio and the minimum survivor ratio, respectively. GC ergonomics dynamically resizes the survivor spaces, but these flags allow the user to specify initial and minimum sizes for the survivor spaces. The flag SurvivorRatio is used by the other collectors (that don't dynamically resize the survivor spaces) to set the size of the survivor spaces. SurvivorRatio was ignored by the UseParallelGC collector. Beginning with JDK6 if the SurvivorRatio is set on the command line for the UseParallelGC collector and InitialSurvivorRatio and MinSurvivorRatio are not explicitly set on the command line, then InitialSurvivorRatio and MinSurvivorRatio are set to SurvivorRatio + 2.
If UseAdaptiveSizePolicy is turned off when using the UseParallelGC collector, the tenured generation and the young generation stay at their initial sizes throughout the execution of the VM.
Starting with JDK5 update 06 the maximum tenuring threshold is limited to 15. The GC logging output does not correctly reflect this change. This is a bug that will be fixed under 6521376.
If a minimum heap size is set with -Xms and NewSize is not explicitly set, the minimum young generation size is calculated using NewRatio. The UseParallelGC collector is not correctly using the minimum value calculated from -Xms and NewRatio but is using the default value of NewSize. Explicitly setting NewSize does correctly set the minimum size for the generation. This is being fixed under bug 6524727.
Posted at 11:55AM Mar 05, 2007 by jonthecollector in Java | Comments[3]
How Does Your Garden Grow?
Actually, the title should be "How Does Your Generation Grow" but then the literary reference would have been totally lost.I've written about how the generations grow if you are using GC ergonomics.
http://blogs.sun.com/jonthecollector/entry/it_s_not_magic
What if you're not using GC ergonomics? There is a different policy for growing and shrinking the heap that is used by the low pause collector and the serial collector. If you're interested in that policy, this is you're lucky blog.
It's actually pretty straightforward. At the end of the collection of a generation, we check the amount of free space in the generation. If the amount of free space is below a certain percentage (specified by MinHeapFreeRatio) of the total size of the generation, then the generation is expanded. Basically, we ask if we have enough free space in the generation so that the VM can run a while before we have to collect again. At the other end we don't want to have an excessive amount of free space so if the amount of free space is above a certain percentage (specified by MaxHeapFreeRatio) the generation is shrunk. That's it.
If you decide to play with the free ratio parameters, leave enough distance between MinHeapFreeRatio and MaxHeapFreeRatio so that the generations are not constantly adjusting by small amounts to get to the "perfect" ratio. Also our experience is that even if a generation does not need the extra free space right now, it will shortly, so don't be too aggressive with MaxHeapFreeRatio.
Will this policy eventually be replaced by GC ergonomics? Actually, I think not. I recently talked to a customer who told me that he was more concerned with bounds on the heap footprint than with achieving some specific throughput. The way he put it was something like "I don't want to be doing GC's all the time so there should be a good amount of free space in the heap but not too much". This policy is one approximation of what he wants.
Posted at 02:40PM Feb 12, 2007 by jonthecollector in Java | Comments[3]
JDK6 GC Release notes
Here's a couple of pointers to the GC section of the release notes for JDK6.Concurrent Mark Sweep Collector Enhancements
Edit done May 16, 2007.
Ryan and Tobais, sorry but I just saw your comments. I could not figure out how to reply with a comment (comment period being closed and I didn't want to mess with the settings too much) so I'm editting the blog to respond.
Ryan,
The line "CMS-concurrent-abortable-preclean" does not indicate a problem. The "abortable" only indicates that the current phase (precleaning) can be interrupted in order to start the next phase.
Tobais,
Yes, the UseParallelOldGC does imply UseParallelGC.
Posted at 03:17PM Feb 06, 2007 by jonthecollector in Java | Comments[2]
When You're at Your Limit
It's often the case that we advise users of large applications to use large young generations. The rational is that large applications (very often multithreaded) have large allocation rates and a large young generations is good because it allows more time between minor collections. One case where that is not a good idea is when you're using the largest heap you can (e.g. close to 4G on a 32 bit system) and the live data is bigger than will fit in the tenured generation.When the young generation fills up, we want to do a minor collection. Minor collections are in general more productive (the young generation tends to have more garbage) and faster (the time to collect depends largely on the amount of live data in the young generation and not on the about of garbage). If, however, when we're contemplating a minor collection and the tenured generation is full, the garbage collector will think that it's time to do a major collection and collect both the tenured generation and the young generation. If the tenured generation is actually full (or very nearly full) of live data, then the tenured generation may still be full at the end of the major collection. The collection worked, but the next time the young generation fills up, a major collection will again be done.
If you find yourself in this situation, try reducing the size of the young generation and increasing the size of the tenured generation accordingly. The goal is to have enough free space available in the tenured generation after a major collection so that the next collection can be a minor collection. Even if you can only manage enough free space in the tenured generation to do 1 minor collection, it's a win. Sizing the generations so you can get a few minor collections between major collections would be sweeter, but that all depends on how close you are to your limit.
Posted at 08:55AM Jan 11, 2007 by jonthecollector in Java | Comments[5]
Presenting the Permanent Generation
Have you ever wondered how the permanent generation fits into our generational system? Ever been curious about what's in the permanent generation. Are objects ever promoted into it? Ever promoted out? We'll you're not alone. Here are some of the answers.Java objects are instantiations of Java classes. Our JVM has an internal representation of those Java objects and those internal representations are stored in the heap (in the young generation or the tenured generation). Our JVM also has an internal representation of the Java classes and those are stored in the permanent generation. That relationship is shown in the figure below.
The internal representation of a Java object and an internal representation of a Java class are very similar. From this point on let me just call them Java objects and Java classes and you'll understand that I'm referring to their internal representation. The Java objects and Java classes are similar to the extent that during a garbage collection both are viewed just as objects and are collected in exactly the same way. So why store the Java objects in a separate permanent generation? Why not just store the Java classes in the heap along with the Java objects?
Well, there is a philosophical reason and a technical reason. The philosophical reason is that the classes are part of our JVM implementation and we should not fill up the Java heap with our data structures. The application writer has a hard enough time understanding the amount of live data the application needs and we shouldn't confuse the issue with the JVM's needs.
The technical reason comes in parts. Firstly the origins of the permanent generation predate my joining the team so I had to do some code archaeology to get the story straight (thanks Steffen for the history lesson).
Originally there was no permanent generation. Objects and classes were just stored together.
Back in those days classes were mostly static. Custom class loaders were not widely used and so it was observed that not much class unloading occurred. As a performance optimization the permanent generation was created and classes were put into it. The performance improvement was significant back then. With the amount of class unloading that occur with some applications, it's not clear that it's always a win today.
It might be a nice simplification to not have a permanent generation, but the recent implementation of the parallel collector for the tenured generation (aka parallel old collector) has made a separate permanent generation again desirable. The issue with the parallel old collector has to do with the order in which objects and classes are moved. If you're interested, I describe this at the end.
So the Java classes are stored in the permanent generation. What all does that entail? Besides the basic fields of a Java class there are
- Methods of a class (including the bytecodes)
- Names of the classes (in the form of an object that points to a string also in the permanent generation)
- Constant pool information (data read from the class file, see chapter 4 of the JVM specification for all the details).
- Object arrays and type arrays associated with a class (e.g., an object array containing references to methods).
- Internal objects created by the JVM (java/lang/Object or java/lang/exception for instance)
- Information used for optimization by the compilers (JITs)
That's it for the most part. There are a few other bits of information that end up in the permanent generation but nothing of consequence in terms of size. All these are allocated in the permanent generation and stay in the permanent generation. So now you know.
This last part is really, really extra credit. During a collection the garbage collector needs to have a description of a Java object (i.e., how big is it and what does it contain). Say I have an object X and X has a class K. I get to X in the collection and I need K to tell me what X looks like. Where's K? Has it been moved already? With a permanent generation during a collection we move the permanent generation first so we know that all the K's are in their new location by the time we're looking at any X's.
How do the classes in the permanent generation get collected while the classes are moving? Classes also have classes that describe their content. To distinguish these classes from those classes we spell the former klasses. The classes of klasses we spell klassKlasses. Yes, conversations around the office can be confusing. Klasses are instantiation of klassKlasses so the klassKlass KZ of klass Z has already been allocated before Z can be allocated. Garbage collections in the permanent generation visit objects in allocation order and that allocation order is always maintained during the collection. That is, if A is allocated before B then A always comes before B in the generation. Therefore if a Z is being moved it's always the case that KZ has already been moved.
And why not use the same knowledge about allocation order to eliminate the permanent generations even in the parallel old collector case? The parallel old collector does maintain allocation order of objects, but objects are moved in parallel. When the collection gets to X, we no longer know if K has been moved. It might be in its new location (which is known) or it might be in its old location (which is also known) or part of it might have been moved (but not all of it). It is possible to keep track of where K is exactly, but it would complicate the collector and the extra work of keeping track of K might make it a performance loser. So we take advantage of the fact that classes are kept in the permanent generation by collecting the permanent generation before collecting the tenured generation. And the permanent generation is currently collected serially.
Posted at 12:01PM Nov 28, 2006 by jonthecollector in Java | Comments[11]
Why Now?
Are you ever surprised by the occurrence of a full collection (aka major collection)? Generally a full collection only happens when an allocation is attempted and there is not enough space available in any of the generations. But there are exceptions to that rule and here are some of them. Actually, these are all of them that I could think of. I thought I would have a longer list so to fill out this blog I've included some bugs that relate to full collections that might interest you. And we're thinking about a policy change for the low pause collector that would affect full collections.The simplest reason for an unexpected full collection is that you asked for it. System.gc() can be called from most anywhere in the program. Try using -XX:+DisableExplicitGC to have the garbage collector ignore such requests.
The permanent generation is collected only by a full collection. It's very seldom the cause for a full collection but don't overlook the possibility. If you turn on PrintGCDetails, you can see information about the permanent generation collection.
0.167: [Full GC [PSYoungGen: 504K->0K(10368K)] [ParOldGen: 27780K->28136K(41408K)] 28284K->28136K(51776K) [PSPermGen: 1615K->1615K(16384K)], 0.4558222 secs]
This output is from the throughput collector. The "PSPermGen" denotes the permanent generation. The size of the permanent generation currently is 16384K. It's occupancy before the collection is only 1615K and is probably the same after the collection. If the permanent generation needed collecting, the occupancy before the collection would have been closer to the 16384K.
The garbage collector tracks the average of the rate of promotion of live objects from the young generation into the tenured generation. If the average exceeds the free space in the tenured generation, the next time the young generation fills up, a full collection is done instead of a minor collection. There is a pathological situation where, after the full collection, the free space in the tenured generation is still not large enough to take all the live objects expected to be promoted from the next minor collection. This will result in another full collection the next time the young generation fills up. The bad part about this is that the average value for the amount promoted only changes when a minor collection is done and if no minor collections are done, then the average does not change. Increasing the size of the heap usually avoids this problem.
Sometimes it's not really a full collection. There was a bug affecting JDK5 (6432427) which print "Full" when the collection was not a full collection. When JNI critical sections are in use, GC can be locked out. When the JNI critical sections are exited, if a GC had been requested during the lock out, a GC is done. That GC most likely would be a minor collection but was mistakenly labeled a full collection. A second place where "Full" was also printed erroneously was when -XX:+CMSScavengeBeforeRemark is used. The same bug report explains and fixes that error.
An attempt to allocate an object A larger than the young generation can cause a minor collection. That minor collection will fail to free enough space in the young generation to satisfy the allocation of A. A full collection is then attempted and A is then allocated out of the tenured generation. This behavior was a bug (6369448). It has been fixed in JDK's 1.4.2_13, 5.0_10, and 6. The correct behavior is for the GC to recognize when an object is too large to be allocated out of the young generation and to attempt the allocation out of the tenured generation before doing a collection.
When an allocation cannot be satisfied by the tenured generation a full collection is done. That is the correct behavior but that behavior has an adverse affect on CMS (the low pause collector). Early in a run before the CMS (tenured) generation has expanded to a working size, concurrent mode failures (resulting in full collections) happen. These concurrent mode failures perhaps can be avoided. We are thinking about a change to the policy such that CMS expands the generation to satisfy the allocation and then starts a concurrent collection. We're cautious about this approach because similar policies in the past has led to inappropriate expansion of the CMS generation to its maximum size. To avoid these full collections try specifying a larger starting size for the CMS generation.
So just because these are all the unexpected "Full" collections that I could think of, that doesn't mean that these are all there are. If you have a mysterious "Full" collection happening, submit a description to
http://java.sun.com/docs/forms/gc-sendusmail.html
I'll post any additional cases of unexpected "Full" collections that I get.
Late addition (I'm allowed to edit my blog even after they've been posted). If you ask about unexpected "Full" collections and send in a gc log, please add the command line flags
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps
The additional output (of which there will be plenty) may help me see what's happening.
Posted at 09:42AM Oct 26, 2006 by jonthecollector in Java | Comments[3]
The Second Most Important GC Tuning Knob
[Read More]Posted at 03:54PM Sep 11, 2006 by jonthecollector in Java | Comments[5]
More of a Good Thing
The throughput collector is getting more parallel.In our J2SE 6 release there is an option in the throughput collector to collect the tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn't all that we wanted it to be in terms of parallelism. Without this recent work only the young generation was being collected in parallel.
The option to turn on the parallel collection of the tenured generation is -XX:+UseParallelOldGC. That's the punch line for this blog. Oh, also it's off by default.
Below are a few details on the new collector. Actually, first I'll describe the serial collector used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.
The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector.
1. Mark all the live objects
2. Calculate the new locations of the objects
3. Adjust all object references to point to the new locations
4. Move the objects to their new locations
Phase 1 found and marked the live objects.
In phase 2 the collector kept a pointer to the first location in the tenured generation to which we could move a live object during the collection. Let's call that location T. T started at the bottom of the generation. The collector scanned forward from the bottom of the generation looking for marked (i.e., live) objects. When a live object A was found, the collector saved T in A (i.e., saved the new location of A in A). It then calculated the next location T as T + sizeof(A) and restarted the scan looking for the next live object to put at the new T.
Phase 3 started at the roots, references to objects known to the application (for the most part), and changed all the references in the roots and all the reference found from the roots to point to the new location of the objects (to the T's saved in the A's).
Phase 4 again started at the bottom of the tenured generation and scanned forward to find live objects. When a live object was found, it was moved to it's new location (the T's).
The end result was that all the live objects were moved to the lower end of the tenured generation and all the free space was in one contiguous region at the higher end of the tenured generation. It simplifies allocations from the tenured generation to have all the free space in a single region. Also there are no issues of fragmentation as would be the case if the collection resulted in multiple regions of free space separated by live objects. This single region of free space seemed an important property to keep when thinking about how to do a collection of the tenured generation using multiple GC threads.
Another feature of our serial mark-sweep-compact collector that we liked is the flexibility to leave some deadwood at the lower end of the tenured generation. We use the term "deadwood" to refer to garbage at the lower end of the generation that we're not going to collect. We don't collect it to reduce the number of objects that we have to move. If we have objects
A B C D
and if B is garbage and we collect it, then we have to move C and D. If we treat B as deadwood, then C and D can stay where they are. We make a judgment call to waste a certain amount of space (that occupied by the deadwood) in order to move fewer objects.
You may have noticed that each of the four phases of our mark-sweep-compaction walked over at least all the live objects. In some cases it actually walked over all the objects (live and dead). When we're scanning the heap (such as in phases 2), we look at a dead object just to find its size so that we can step over it. Any part of the collection that walks over the objects is costly so with the new parallel collector for the tenured generation we wanted to have fewer walks over the objects.
Keeping all this in mind resulted in a parallel collector that
* Marked the live objects using multiple GC threads and maintained some summary information (explained below).
* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep.
* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.
I'll refer to these three phases as the marking phase, summary phase and move-and-update phase. In some of the other documentation on this new collector the last phase is referred to as the compaction phase. I prefer move-and-update because that's the type of name used in the code. Just in case you're ever looking at the code.
Marking phase
Doing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us the size the of data for objects that begin in each chunk.
Summary phase
The summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap the dense prefix. Live objects that are moved will be placed in the heap starting immediately after the dense prefix. Given that we know the amount of live data for each chunk, we can calculate the new location of any live object using the summary data. Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land. The summary phase is currently done by one thread but can be done in parallel if it turns out to be a scaling bottleneck.
Update-and-move phase
Given the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunk at any given time. If the chunks are A, B, C, D ... and the only dead object is in A and fits entirely in A, then A is the only ready chunk. The move-and-update phase will first fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen this "gap" of ready chunks. We expect to implement it, but it's not in this release.
That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.
For more on this new collector see the slides for the GC presentation done at this year's JavaOne. They can be download from
The last time I looked, there was a "Click here" link for the technical sessions in the first paragraph. Following that link, download the Java SE bundle and in there our GC presentation is TS-1168.
Posted at 12:17PM Aug 30, 2006 by jonthecollector in Java | Comments[2]
The GC whitepaper
There's a very nice whitepaper on our GC athttp://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf
It talks about basic concepts, some specifics about our GC, suggestions on tuning, and some of the latest improvements such as the parallel old collector. If you've looked at the GC tuning guides, some of it will look familiar, but it nicely pulls together information from several sources.
Posted at 11:53AM Aug 24, 2006 by jonthecollector in Java | Comments[4]
Monday May 19, 2008