Over the years the JVM has used both MFENCE and LOCK:ADD of zero to the top-of-stack as "fence" primitives on x86 platforms. Typically, the JIT emits these after stores to volatile fields. (We emit "MEMBAR #storeload" on SPARC). Unfortunately we've had to change back and forth between MFENCE and LOCK:ed instructions over the years as the relative latency shifts with evolving platforms. On the very latest AMD processors, for instance, we've found that MFENCE has nearly twice the latency of the LOCK:ed instructions. (As an aside, it seems odd the MFENCE would ever be *worse* than a LOCK:ed instruction. This appears to be an implementation artifact and not fundamental). For the purposes of this discussion I'm counting XCHG as a LOCK:ed instruction. XCHG has bidirectional fence semantics just as do the LOCK:ed read-modify-write instructions, and can thus serve as a volatile fence. While we don't do so currently, it may be profitable -- depending on register pressure -- to replace a store;fence sequence with a simple XCHG. Briefly, javac and the JIT compiler control for compile-time reordering of accesses while the platform fence instructions constrain architectural-based reordering. Refer to Doug Lea's Java Memory Model cookbook for more details: http://g.oswego.edu/dl/jmm/cookbook.html. But suffice it to say that the JVM is responsible for reconciling the platform memory model with the official Java Memory model (JMM). I decided to take a quick look at the performance of MFENCE vs LOCK:ed instructions on a Core-i7 "Nehalem". The results were interesting: MFENCE appears to have about the same simple latency as LOCK:ed instruction but it doesn't pipeline as well when mixed with normal code. First, I'll over apologies in advance as the nomenclature below is rather cryptic. I set up a simple single-threaded 64-bit C experiment that times loops. I've made sure the compiler can't unroll the loops and can't optimize at all within the loop body. We run 100K iterations and report the duration in msecs. There are two types of computations that can appear in a loop body : "B" contains non-atomic computation and "A" is either an MFENCE or LOCK:ADD [TopOfStack],0 instruction. Note that the dummy LOCK:ADD instruction has bidirectional fence semantics and can be used as a drop-in replacement for MFENCE (with the exception that it kills the integer condition codes) Instead of a LOCK:ADD we could also use a CAS -- LOCK:CMPXCHG. Both the LOCK:ADD and LOCK:CMPXCHG forms yield similar results, so I'll only report results form the LOCK:ADD form. The following experiments vary A between MFENCE and LOCK:ADD and vary B between various forms I'll describe below. The "RNG" operator is a simple Marsaglia Shift-XOR pseudo random number generation. It doesn't access any memory locations. A[] is array of 32-bit integers. B forms: RNG : Simply generate a PRNG value; requires about 7 ALU instructions and no loads or stores. (Marsaglia Shift-XOR PRNG). RNG; A[0] ++ Same as the above but adds a RMW non-atomic increment of a known and fixed location A[RNG & 1F] ++ Generates a PRNG and then uses that value to generate an index into the array, which it then increments. The address can't easily be precomputed or predicted but the access will hit in the L1 data cache (D$). A[RNG & FFFF] ++ Same as the above, but the index span is wider, although still within the D$. A[RNG & FFFFF] ++ Same as the above, but the index span is larger than the D$, so the execution will incur D$ misses. [8B]+[4A] indicates that we use have two distinct loops where the 1st loop body has 8 "B" operations while the 2nd loop body has 4 "A" operations. [8B;4A] is a single loop that executes 8 B operations in order and then 4 A operations. [4(BBA)] is a single loop that executes BBABBABBABBA. So "[8B]+[4A]", [8B;4A] and [4(BBA)] all perform the same amount of work, but just vary the order of execution. The results will tell us if atomics or MFENCE "commute" in performance. [8B]+[4A] has very coarse mixing of the B and A operations, while [8B;4A] and [4(BBA)] have yet finer mixing of B and A operations. A B [8B]+[4A] [8B;4A] [4(BBA)] 1 : MFENCE RNG 2635+3259=5893 5804 5781 2 : MFENCE RNG;A[0]++ 2865+3312=6048 6204 7024 3 : MFENCE A[RNG & 1F]++ 2930+3297=6306 6643 8028 4 : MFENCE A[RNG & FFFF]++ 2870+3291=6298 7284 9590 5 : MFENCE A[RNG & FFFFF]++ 4201+3255=7569 9616 15774 6 : LOCK:ADD RNG 2704+2845=5548 3552 2984 7 : LOCK:ADD RNG;A[0]++ 2812+2791=5599 3859 3892 8 : LOCK:ADD A[RNG & 1F]++ 2930+2770=5700 4108 4003 9 : LOCK:ADD A[RNG & FFFF] ++ 2944+2847=5741 4008 4091 10: LOCK:ADD A[RNG & FFFFF] ++ 4213+2791=7003 5781 4949 Comments: * Loop overhead is tiny, about 80 msecs, so we can ignore it. * LOCK:ADD is slightly faster than MFENCE even absent any (presumed) pipeline issues. * It's interesting to compare line 1 to line 6. In line 1 and the 3 results are almost identical (5893 v 5804 v 5781). So the execution time is the same regardless of order. In line 6, however, as we have "finer" mixing or interspersing of the B and A operations, we see that performance *improves*, suggesting pipeline or out-of-order effects where the execution of the A and B operations overlap in time. * With the LOCK:ADD, in no case is [4(BBA)] or [8B;4A] ever worse than the [8B]+[4A] form. The opposite holds for MFENCE, though. * Missing the D$ seems to incur some multiplier effect with MFENCE. That is, the outcome is worse that the "sum" of a D$ miss and an MFENCE. See line 5. LOCK:ADD appears immune to this effect. * At the risk over overgeneralizing, an instruction can be considered to have a simple latency and as well as a throughput or pipelining quotient, which determines how much the execution of that instruction impedes the execution of instructions nearby in execution order. * Given all that, it appears LOCK:ADD to the top-of-stack is preferable as a fence operator on Nehalem. Dave Dice