David Dice's Weblog
Friday May 29, 2009
Instruction selection for volatile fences : MFENCE vs LOCK:ADD
In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).
Posted at 12:58PM May 29, 2009 by David Dice in General | Comments[6]
Hi,
But you still make the choice at run-time based on the exact CPU flavour underneath you yes? It's not going to be always LOCK:ADD or always MFENCE across all x86/x64 CPUs for a given JVM revision is it?
Rgds
Damon
Posted by Damon Hart-Davis on May 29, 2009 at 02:18 PM EDT #
Hello Damon -- correct, but it's fairly important to get the defaults correct so that the JVMs we implement today work well on the platforms of tomorrow. There are still platforms where MFENCE is clear winner (older AMD processors, for instance). We'd very much like to avoid impairing performance on such systems. At least from what we can see today, however, LOCKED:ADD looks like the instruction of choice for the near future. Regards, -Dave
Posted by David Dice on May 29, 2009 at 03:17 PM EDT #
It's worth something like 2.5% on SPECjbb2005 to use LOCK:ADD and 14% on Derby on SPECjvm2008. Oh and you're welcome ;)
Posted by Azeem Jiva on May 29, 2009 at 04:20 PM EDT #
Hi Azeem, that's on AMD, correct? I believe the difference was much less on modern Intel Core* processors. Regards, -Dave
Posted by David Dice on May 29, 2009 at 04:34 PM EDT #
Yeah that was on the latest quad core Opterons.
Posted by Azeem Jiva on May 29, 2009 at 05:39 PM EDT #
Thanks. It confess I was surprised when we encountered the MFENCE latency behavior on AMD. I had subsequent conversations with folks at AMD who confirmed our observations. Broadly, we expected that MFENCE/MEMBAR would either have the same latency or better latency than LOCK:ADD/CAS instructions. While not always strictly true it was a reasonable rule of thumb. Both instructions have bidirectional fence semantics and the ADD or CMPXCHG is, in a sense, doing more work leading me to suspect an implementation artifact rather than something fundamental. Interestingly -- and this is the point I was trying to make in the posting -- is that while on Nehalem we find MFENCE has good "simple" latency, it doesn't appear to pipeline as well as LOCKED:ADD. Regards, -Dave
Posted by David Dice on May 29, 2009 at 05:58 PM EDT #