Tuesday August 22, 2006
Microbenchmark Adventure
So I've been working on a new optimization for Hotspot that improves
the speed we can call from Java (byte codes) to a Java Native method.
There is unfortunately a lot more overhead on that path than we'd like
and we take a fair amount of heat for it both internally and
externally. At the moment the optimization is only available to the
server compiler while I see how worthwhile the change is.
As part of investigating the performance of this I came this web page and the accompanying benchmark.
So I downloaded it to see what my optimization did for this test. I'm
always quite dubious of microbenchmarks because they often don't
measure what the author expects because the compiler or OS performs in
unexpected ways. Quite often the compiler is able to remove work that
could never be removed in a real app and so the benchmark predicts more
performance than you actually get. In this case the benchmark turns out
to be flawed, but in opposite direction, predicting less Java performance
that you might expect in a real app.
Well
the first thing I had to do was to fix the SunOS.jmk file since as
received it didn't really work on Solaris. So then I ran it and I could
see that I got a boost but that the results were kind of noisy. So the
benchmark has an option to run it multiple times ( -n <n> ) so I
ran it five times to try and average out the noise.
So I observed a very strange thing. Here is a snippet of the results (without my JNI optimization):
Throughput:
JavaRowConsumer 515583
FineGrainedJNIRowConsumer 51986
CoarseGrainedJNIRowConsumer 79021
BytePackedJNIRowConsumer 106115
SocketRowConsumer 66822
end throughput.
Throughput:
JavaRowConsumer 203963
FineGrainedJNIRowConsumer 32793
CoarseGrainedJNIRowConsumer 72642
BytePackedJNIRowConsumer 117230
SocketRowConsumer 72205
end throughput.
...
So that's kind of strange. Why did JavaRowConsumer and
FineGrainedJNIRowConsumer slow down so dramatically on the second
iteration? I wondered if maybe something was broken in the benchmark so
that the SocketRowConsumer was still doing work and competing against
the next iteration. So I reversed the order of the controlling loop.
Then I got:
Throughput:
SocketRowConsumer 86094
BytePackedJNIRowConsumer 151618
CoarseGrainedJNIRowConsumer 77717
FineGrainedJNIRowConsumer 31748
JavaRowConsumer 195528
end throughput.
Throughput:
SocketRowConsumer 73018
BytePackedJNIRowConsumer 116724
CoarseGrainedJNIRowConsumer 73709
FineGrainedJNIRowConsumer 32610
JavaRowConsumer 198806
So now JavaRowConsumer and FineGrainedJNIRowConsumer are consistent but
SocketRowConsumer and BytePackedJNIRowConsumer seem to have the
disease. So then I really read the code to the simulator. Then the
light went on. In order to make the Simulator simple there is an
Interface for a RowConsumer and there are five implementors of that
interface. The driver for producer/consumer is passed different objects
depending on the subtest that is happening.
So what happens is this. While the benchmark is heating up all the
calls for the interfaces in RowConsumer are for a single type
(whichever is first in the output above). So when we finally compile
the code the call site uses that prediction and compiles for that
single type (it can even inline it). Then the usage pattern changes and
we go from monomorphic to bi-morphic and have to recompile. At the
point I suspect we're prepared or a bimorphic call site. Later still
that gets proven incorrect and we go mega-morphic and have to do a
straight v-table call. At that point the results stabilize at a much
lower value.
What's interesting is how dramatically the results change. The
microbenchmark does so little work in the calls that how we do the call
dramatically changes the results. This is probably not realistic. In
fact the comparison to C is not particularly fair since the C code
doesn't have to virtual dispatch since it isn't object oriented and
there is in essence only a single implementor. So the results on the
web page would seem to be skewed by that. To be fairer the C code
should be C++ and have multiple implementors. But there is still more.
So in order to get something stabler I modified the benchmark. What I
did was really lame. Essentially I change the main driver
(Pumper::pump) to have a switch statement where there were 6 case
alternatives, one for each type and a default. Then I had each type
provide an identifier that I could switch on. This made the code for
the method much larger but it ensured that each call site would be
monomorphic throughout the run. It's ugly but sure enough the results
were pretty similar for each iteration and always faster throughput for
each subtest compared to the original benchmark. Showing that the
benchmark was not very accurate at measuring what it was intending to
measure.
Well I wanted to see if I could get the noise level down further by
running the subtests longer. So I ran each subtest for 12000
milliseconds. Another surprise. The final report showed throghput
noticably slower than the same five runs with a shorter subtest
interval. WTF? Back to the source code where I found a bug. The report
code when it computes the final throughput number throws out the 1st
result and the last result and then computes the average row per
second. There isn't anything wrong with throwing out the samples but in
computing the averages integer arithmetic was used and depending on the
interval you pick truncation could inflate your results. That's just
what happened with the short interval. I modified the code to use
floats and then sanity returned. So I was almost happy with the
results.
There was just one thing that bothered me. On the original web page
there was the claim that the "Coarse" JNI subtest ran slower than the
"FineGrained" JNI subtest. Even my results showed the same thing (with
my JNI optimization)
JavaRowConsumer 503992
FineGrainedJNIRowConsumer 92880
CoarseGrainedJNIRowConsumer 66331
BytePackedJNIRowConsumer 57715
SocketRowConsumer 73069
That is kinda weird. I know how expensive that Java->Native
transition and it is hard to understand how a benchmark that presumably
does less transitions would be slower than one that does more. So I
read the code for the two subtests. The FineGrain benchmark is
certainly simpler in that for each transfer it does the transtion and
the C code stores the transmitted value. The "Coarse" benchmark on the
otherhand transfers every item from one Java array to another one
(essentially duplicating JavaRowConsumer) and then when it is complete
calls to native. The native code then does JNI calls to extract the
data across the barrier. In the case of the two dimensional byte array
it does two JNI calls per row. Now those JNI calls are very similar in
cost to Java->Native transitions. So in effect the "Coarse"
benchmark does even more JNI calls than the "FineGrained" version. No
wonder it suffers.
So just for the heck of it I wrote a new subtest,
RealCoarseGrainedJNIRowConsumer. On each transfer of a row of the 2-d
byte array I packed it into a Java 1-D array that matches what the C
side would use. The like the original "Coarse" benchmark at the end of
the loop it make a single call to JNI. Now however on the native side I
do one JNI call to transfer all the byte data. Now the results look
like:
JavaRowConsumer 510581
FineGrainedJNIRowConsumer 94501
CoarseGrainedJNIRowConsumer 67844
RealCoarseGrainedJNIRowConsumer 167878
BytePackedJNIRowConsumer 57726
SocketRowConsumer 73522
That makes a lot more sense. Now there is just one more thing to check. How do we compare to C? Note:
in all of the benchmark results above the results are not in rows per
seconds. They are number of rows moved in the timing period which is
what each sub-run reports. In this section the results are in
rows per second so they are comparable with the C results. So here is
row per second for the Java version:
Throughput in rows per second (bigger is better)
JavaRowConsumer 426398
FineGrainedJNIRowConsumer 78782
CoarseGrainedJNIRowConsumer 56141
RealCoarseGrainedJNIRowConsumer 139440
BytePackedJNIRowConsumer 48019
SocketRowConsumer 59574
and here is the CSimulator result (-O4)
gretch(pts/52)[1466] CSimulator
Throughput in rows per second: 195000
How about that, JavaRowConsumer clobbered the C version! My
RealCoarse
version came in a 70% of the speed of C and the FineGrain version
came in at 40% of the speed. So while using JNI for this test is slower
than C it isn't as bad as the benchmark originally predicted. Why does
the Java only version outperform C? Well by the time we compiler
Pumper::pump the profiling the interpreter has done enables us to
predict that on one of the case alternative we always See RowProducer
and JavaRowConsumer and so the compiler inlines the code and all the
call overhead is removed. We're left with just the data movement. This
is an advantage of Java. Because the compiler in the JVM is able to
optimize the program based on the way it is actually running it can do
better optimization than a static compiler can do. Java will take a
performance hit during startup but if we run long enough (and long
isn't really that long) we often can outperform C or C++ code in
equivalent apps.
Now what about the JNI performance. Obviously this benchmarks shows
that there is an overhead in calling from Java->native and that it
more expensive than a C->C call. We'd like that to be more
competitive (the optimization I'm working on saves ~15%) but is this
benchmark realistic of real apps. Probably not. Look at how much the
results changed when the Java call went from monomorphic to
megamorphic. A huge change. This shows that the work being done on the
native side was pratically nil. In real apps I'd hope that your typical
JNI call does a little more work than a single store and return. When
that is true the cost of the transition is not nearly the limiting
factor that this benchmark makes it appear to be.
So I believe this shows another example of why microbenchmarks should
always be viewed with distrust. It is very easy to be fooled unless you
are very familiar with the underlying compiler and runtime system. I've
fooled myself plenty and I'm very familiar with the tricks the compiler
and JVM are capable of.
My favorite microbenchmark story comes from my days as a compiler
developer at Encore Computer. In thise days (late '80s early '90s) ther
was a benchmark from AIM Technology called Aim-2 and Aim-3. The Encore
C compiler would remove huge amounts of work from the benchmark. I
always loved it when the benchmark would report that our machine was
capable of performing more than 100 floating point operations per cpu
clock cycle! Too bad the customers never saw that kind of performance.
Now in the example today instead of predicting more performance than
you can expect it actually predicts the opposite. When it comes to
benchmarks, especially microbenchmarks YMMV.
Now even though I think this benchmark is misleading I saved a copy of
the modified benchmark here. This is so you can inspect the difference
between my version and the original. Don't buy a computer language
based on it. :-)
Aug 22 2006, 10:19:38 AM EDT
Permalink
Posted by Morten Christensen on August 22, 2006 at 12:18 PM EDT #
Comments are closed for this entry.