Jon Masamitsu's Weblog
Why not a Grand Unified Garbage Collector?
At last count we have three garbage collectors.
Why is there more than one? Actually, the usual answer applies. Specialization often results in better performance. If you're interested in more particulars about our garbage collectors, read on.
All three collectors are generational (young generation and tenured generation).
Let's do the easy comparison first, why a parallel collector and a serial collector. Parallelism has overhead. Nuf said? Yeah, I used to read comic books when I was a kid. If you don't understand that reference, ignore it. I'm just older.
As you might infer from the names, the serial collector uses 1 thread to do the GC work and the parallel collector uses multiple threads to do the same. As usual multiple threads doing the same tasks have to synchronize. That's pretty much it. On a single cpu machine the additional cost of the synchronization means that the parallel collector is slower than the serial collector. On a two cpu machine and a VM that has a small heap the parallel collector is as about fast as the serial collector. With two cpu's and large heaps the parallel collector will usually do better. We keep asking ourselves if we can get rid of the serial collector and use the parallel collector in its place. The answer so far keeps coming back no.
More interesting is the case of the low pause collector versus the parallel collector. Above I made the remark about specialization and better performance. This is actually a case of more complexity and lesser performance. These two collectors do the collection of the young generation using almost the exact same techniques The differences in the collectors have to do with the collections of the tenured generation. The low pause collector does parts of that collection while the application continues to run. One way to do that is to not move the live objects when collecting the dead objects. The application tends to get confused if the objects it is using move around while the application is running. The other two collectors compact the heap during a collection of the tenured generation (i.e., live objects are moved so as to occupy one contiguous region of the heap). The low pause collector collects the dead objects and coalesces their space into blocks that are kept in free lists. Maintaining free lists and doing allocations from them takes effort so it's slower than having a heap that is compacted. Having applications run while a collection is happening means that new objects can be allocated during a collection. That leads to so more complexity. Also the collection of the tenured generation can be interrupted for a collection of the young generation. More complexity still. The bottom line is that the low pause collector has shorter GC pauses but it costs performance. That performance difference is not huge but it's large enough to keep us from ditching the parallel collector and always using the low pause collector.
And last but not least, can we replace the serial collector with the low pause collector? Very tempting. The serial collector is used by default on desktop machines. We expect those to have 1 or 2 cpus and to be running applications that need 10's of megabytes of Java heap as opposed to 100's of megabytes. With small heaps the differences in collection times tend to make less difference. Even if the low pause collector was 10% slower than the serial collector, the difference between, for example, 70 ms and 77 ms often isn't large enough to matter. It would probably be a done deal except that the low pause collector has a larger memory footprint. It has additional data structures that it uses (for example to keep track of what references are being changed by the the running application while a collection is on going). It also usually needs a larger Java heap to run an application. Recall that the low pause collector uses free lists to keep track of the available space in the heap. Fragmentation can become a problem. The best bet is that we'll replace the serial collector with the low pause collector some day but not just yet.
Posted at 05:23PM Jan 26, 2006 by jonthecollector in Java |
A Little Thread Privacy Please
This is not really about garbage collection but hopefully you'll find it interesting. And there is no punch line (i.e., cool command line flag that I can suggest to make your application run better). It's just a story about letting the VM do it.If you're using multiple threads (or one of the Java libraries is using multiple threads in your behalf) then threads in your application are doing allocations concurrently. All the threads are allocating from the same heap so some allowances have to be made for simultaneous allocations by 2 or more threads. In the simplest case (other than the case where no allowances are made and which is just plain wrong) each thread would grab a lock, do the allocation and release the lock. That gives the right answer but is just too slow. The slightly more complicated means would be to use some atomic operation (such as compare-and-swap) plus a little logic to safely do the allocation. Faster but still too slow. What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well. But how large should the TLAB's be?
Prior to 5.0 that was a difficult question to answer. TLAB's that were too large wasted space. If you had tons of threads and large TLAB's, you could conceivably fill up the heap with buffers that were mostly unused. Creating more threads might force a collection which would be unfortunate because the heap was mostly empty. TLAB's that were too small would fill quickly and would mean having to get a new TLAB which would require some form of synchronization. There was not a general recommendation that we could make on how large TLAB's should be. Yup, we were reduced to trial-and-error.
Starting with 5.0 living large with TLAB's got much simpler - except for the guy down the hall that did the implementation. Here's what the VM does for you.
Each thread starts with a small TLAB. Between the end of the last young generation collection and the start of the next (let me call that period an epoch), we keep track of the number of TLAB's a thread has used. We also know the size of the TLAB's for each thread. Averages for each thread are maintained for these two numbers (number and size of TLAB's). These averages are weighted toward the most recent epoch. Based on these averages the sizes of the TLAB's are adjusted so that a thread gets 50 TLAB's during a epoch. Why 50? All things being equal we figured that a thread would have used half its TLAB by the end of the epoch. Per thread that gave us 1/2 a TLAB not used (out of the magic 50) for a wastage of 1%. That seemed acceptable. Also if the young generation was not large enough to provide the desired TLAB's, the size of the young generation would be increased in order to make it so (within the usual heap size constraints, of course).
The initial TLAB size is calculated from the number of threads doing allocation and the size of the heap. More threads pushes down the initial size of the TLAB's and larger heaps push up the initial size.
An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.
Posted at 02:14PM Jan 13, 2006 by jonthecollector in Java |
Since It's Not Magic
So by now you've tried GC ergonomics and things are working pretty well. If you didn't have any GC related command line flags, your reactions could run the gamut from the "big yawn" to an eye-popping "that's cool". The "big yawn" means that GC probably doesn't matter for your application. The "that's cool" means that GC matters but nobody ever told you. If you did have a bunch of GC related command line flags and you threw them out in order to try GC ergonomics, then 80% of you are happy and just glad to forget those command line flags. But 20% of you are thinking "Hmmm, not quite as good". So what else do you need to know about GC ergonomics? By the way I invoke the 80/20 rules quite often, especially when I have no idea what the real numbers are.
By default ergonomics gives you a larger maximum heap size. "Larger" is relative to the default maximum heap size you got prior to ergonomics in 5.0. This larger maximum heap size depends on the amount of physical memory on the system (the more physical memory you have, the larger the maximum heap size you get), but it is capped at 1G. If you had a command line flag that sets a larger maximum heap size, put it back. Ergonomics just doesn't know you're in the 20% that needs a larger heap. If you set a larger maximum heap size, ergonomics will still grow and shrink the generations for better performance but now will have more space to play with. By the way if you set a smaller maximum heap size, ergonomics still does its stuff but just within a smaller heap.
Ergomonics has a default for the relative sizes of the young generation and the tenured generation in the heap. If you have command line flags to specify the sizes of the generations (probably something like -XX:NewRatio=ratio or -XX:MaxNewSize=bytes), use them. GC ergonomics normally does not change the maximum size of a generation so if you know something about how your application runs such that you know it benefits from a larger or smaller young generation, pass that information along on the command line. Again GC ergonomics just works within the limits of the heap it's given.
Posted at 01:29PM Jan 02, 2006 by jonthecollector in Java |
Thursday Jan 26, 2006