GetJava Download Button XML Feed
All | About | Flying | General | Java | Solaris 10
20060822 Tuesday August 22, 2006

Microbenchmark Adventure

So I've been working on a new optimization for Hotspot that improves the speed we can call from Java (byte codes) to a Java Native method. There is unfortunately a lot more overhead on that path than we'd like and we take a fair amount of heat for it both internally and externally. At the moment the optimization is only available to the server compiler while I see how worthwhile the change is.

As part of investigating the performance of this I came this web page and the accompanying benchmark. So I downloaded it to see what my optimization did for this test. I'm always quite dubious of microbenchmarks because they often don't measure what the author expects because the compiler or OS performs in unexpected ways. Quite often the compiler is able to remove work that could never be removed in a real app and so the benchmark predicts more performance than you actually get. In this case the benchmark turns out to be flawed, but in opposite direction, predicting less Java performance that you might expect in a real app.

Well the first thing I had to do was to fix the SunOS.jmk file since as received it didn't really work on Solaris. So then I ran it and I could see that I got a boost but that the results were kind of noisy. So the benchmark has an option to run it multiple times ( -n <n> ) so I ran it five times to try and average out the noise.

So I observed a very strange thing. Here is a snippet of the results (without my JNI optimization):

Throughput:
        JavaRowConsumer 515583
        FineGrainedJNIRowConsumer       51986
        CoarseGrainedJNIRowConsumer     79021
        BytePackedJNIRowConsumer        106115
        SocketRowConsumer       66822
end throughput.
Throughput:
        JavaRowConsumer 203963
        FineGrainedJNIRowConsumer       32793
        CoarseGrainedJNIRowConsumer     72642
        BytePackedJNIRowConsumer        117230
        SocketRowConsumer       72205
end throughput.
...


So that's kind of strange. Why did JavaRowConsumer and FineGrainedJNIRowConsumer slow down so dramatically on the second iteration? I wondered if maybe something was broken in the benchmark so that the SocketRowConsumer was still doing work and competing against the next iteration. So I reversed the order of the controlling loop. Then I got:

Throughput:
        SocketRowConsumer       86094
        BytePackedJNIRowConsumer        151618
        CoarseGrainedJNIRowConsumer     77717
        FineGrainedJNIRowConsumer       31748
        JavaRowConsumer 195528
end throughput.
Throughput:
        SocketRowConsumer       73018
        BytePackedJNIRowConsumer        116724
        CoarseGrainedJNIRowConsumer     73709
        FineGrainedJNIRowConsumer       32610
        JavaRowConsumer 198806

So now JavaRowConsumer and FineGrainedJNIRowConsumer are consistent but SocketRowConsumer and BytePackedJNIRowConsumer seem to have the disease. So then I really read the code to the simulator. Then the light went on. In order to make the Simulator simple there is an Interface for a RowConsumer and there are five implementors of that interface. The driver for producer/consumer is passed different objects depending on the subtest that is happening.

So what happens is this. While the benchmark is heating up all the calls for the interfaces in RowConsumer are for a single type (whichever is first in the output above). So when we finally compile the code the call site uses that prediction and compiles for that single type (it can even inline it). Then the usage pattern changes and we go from monomorphic to bi-morphic and have to recompile. At the point I suspect we're prepared or a bimorphic call site. Later still that gets proven incorrect and we go mega-morphic and have to do a straight v-table call. At that point the results stabilize at a much lower value.

What's interesting is how dramatically the results change. The microbenchmark does so little work in the calls that how we do the call dramatically changes the results. This is probably not realistic. In fact the comparison to C is not particularly fair since the C code doesn't have to virtual dispatch since it isn't object oriented and there is in essence only a single implementor. So the results on the web page would seem to be skewed by that. To be fairer the C code should be C++ and have multiple implementors. But there is still more.

So in order to get something stabler I modified the benchmark. What I did was really lame. Essentially I change the main driver (Pumper::pump) to have a switch statement where there were 6 case alternatives, one for each type and a default. Then I had each type provide an identifier that I could switch on. This made the code for the method much larger but it ensured that each call site would be monomorphic throughout the run. It's ugly but sure enough the results were pretty similar for each iteration and always faster throughput for each subtest compared to the original benchmark. Showing that the benchmark was not very accurate at measuring what it was intending to measure.

Well I wanted to see if I could get the noise level down further by running the subtests longer. So I ran each subtest for 12000 milliseconds. Another surprise. The final report showed throghput noticably slower than the same five runs with a shorter subtest interval. WTF? Back to the source code where I found a bug. The report code when it computes the final throughput number throws out the 1st result and the last result and then computes the average row per second. There isn't anything wrong with throwing out the samples but in computing the averages integer arithmetic was used and depending on the interval you pick truncation could inflate your results. That's just what happened with the short interval. I modified the code to use floats and then sanity returned. So I was almost happy with the results.

There was just one thing that bothered me. On the original web page there was the claim that the "Coarse" JNI subtest ran slower than the "FineGrained" JNI subtest. Even my results showed the same thing (with my JNI optimization)

        JavaRowConsumer 503992
        FineGrainedJNIRowConsumer       92880
        CoarseGrainedJNIRowConsumer     66331
        BytePackedJNIRowConsumer        57715
        SocketRowConsumer       73069

That is kinda weird. I know how expensive that Java->Native transition and it is hard to understand how a benchmark that presumably does less transitions would be slower than one that does more. So I read the code for the two subtests. The FineGrain benchmark is certainly simpler in that for each transfer it does the transtion and the C code stores the transmitted value. The "Coarse" benchmark on the otherhand transfers every item from one Java array to another one (essentially duplicating JavaRowConsumer) and then when it is complete calls to native. The native code then does JNI calls to extract the data across the barrier. In the case of the two dimensional byte array it does two JNI calls per row. Now those JNI calls are very similar in cost to Java->Native transitions. So in effect the "Coarse" benchmark does even more JNI calls than the "FineGrained" version. No wonder it suffers.

So just for the heck of it I wrote a new subtest, RealCoarseGrainedJNIRowConsumer. On each transfer of a row of the 2-d byte array I packed it into a Java 1-D array that matches what the C side would use. The like the original "Coarse" benchmark at the end of the loop it make a single call to JNI. Now however on the native side I do one JNI call to transfer all the byte data. Now the results look like:

        JavaRowConsumer 510581
        FineGrainedJNIRowConsumer       94501
        CoarseGrainedJNIRowConsumer     67844
        RealCoarseGrainedJNIRowConsumer 167878
        BytePackedJNIRowConsumer        57726
        SocketRowConsumer       73522


That makes a lot more sense. Now there is just one more thing to check. How do we compare to C? Note: in all of the benchmark results above the results are not in rows per seconds. They are number of rows moved in the timing period which is what each sub-run reports. In this section the results are in rows per second so they are comparable with the C results. So here is row per second for the Java version:

Throughput in rows per second (bigger is better)
JavaRowConsumer 426398
FineGrainedJNIRowConsumer       78782
CoarseGrainedJNIRowConsumer     56141
RealCoarseGrainedJNIRowConsumer 139440
BytePackedJNIRowConsumer        48019
SocketRowConsumer       59574

and here is the CSimulator result (-O4)

gretch(pts/52)[1466] CSimulator
Throughput in rows per second: 195000

How about that, JavaRowConsumer clobbered the C version! My RealCoarse version  came in a 70% of the speed of C and the FineGrain version came in at 40% of the speed. So while using JNI for this test is slower than C it isn't as bad as the benchmark originally predicted. Why does the Java only version outperform C? Well by the time we compiler Pumper::pump the profiling the interpreter has done enables us to predict that on one of the case alternative we always See RowProducer and JavaRowConsumer and so the compiler inlines the code and all the call overhead is removed. We're left with just the data movement. This is an advantage of Java. Because the compiler in the JVM is able to optimize the program based on the way it is actually running it can do better optimization than a static compiler can do. Java will take a performance hit during startup but if we run long enough (and long isn't really that long) we often can outperform C or C++ code in equivalent apps.

Now what about the JNI performance. Obviously this benchmarks shows that there is an overhead in calling from Java->native and that it more expensive than a C->C call. We'd like that to be more competitive (the optimization I'm working on saves ~15%) but is this benchmark realistic of real apps. Probably not. Look at how much the results changed when the Java call went from monomorphic to megamorphic. A huge change. This shows that the work being done on the native side was pratically nil. In real apps I'd hope that your typical JNI call does a little more work than a single store and return. When that is true the cost of the transition is not nearly the limiting factor that this benchmark makes it appear to be.

So I believe this shows another example of why microbenchmarks should always be viewed with distrust. It is very easy to be fooled unless you are very familiar with the underlying compiler and runtime system. I've fooled myself plenty and I'm very familiar with the tricks the compiler and JVM are capable of.

My favorite microbenchmark story comes from my days as a compiler developer at Encore Computer. In thise days (late '80s early '90s) ther was a benchmark from AIM Technology called Aim-2 and Aim-3. The Encore C compiler would remove huge amounts of work from the benchmark. I always loved it when the benchmark would report that our machine was capable of performing more than 100 floating point operations per cpu clock cycle! Too bad the customers never saw that kind of performance. Now in the example today instead of predicting more performance than you can expect it actually predicts the opposite. When it comes to benchmarks, especially microbenchmarks YMMV.

Now even though I think this benchmark is misleading I saved a copy of the modified benchmark here. This is so you can inspect the difference between my version and the original. Don't buy a computer language based on it. :-)


Aug 22 2006, 10:19:38 AM EDT Permalink

Comments:

Can you provide some additional information about this very interesting passage: "So I've been working on a new optimization for Hotspot that improves the speed we can call from Java (byte codes) to a Java Native method. There is unfortunately a lot more overhead on that path than we'd like and we take a fair amount of heat for it both internally and externally. At the moment the optimization is only available to the server compiler while I see how worthwhile the change is." BTW: I would love any improvements here! :-)

Posted by Morten Christensen on August 22, 2006 at 12:18 PM EDT #

Post a Comment:

Comments are closed for this entry.