David Dice's Weblog
Musing about Lock-free and wait-free algorithms for real-time environments
Intuitively, programmers tend to think that lock-free algorithms are a good match for real-time environments. On a multiprocessor system, however, a low-priority thread might repeatedly perform a short lock-free update that causes a higher priority thread trying a longer update to fail and retry repeatedly. In theory this could lead to indefinite postponement and lack of progress for the higher priority thread. (Because the operation attempted by the higher priority thread is longer, it's also more vulnerable to interference). Wait-free algorithms provide a solution, of course. Wait-freedom guarantees that each individual thread will make progress after a fixed amount of (virtual) processing time. Unfortunately we don't know of as many wait-free algorithms as we do lock-free. But if you care about real-time, RTJ (real-time Java) provides locks with all the semantics you'd want.
The same argument, by the way, will apply to many hardware and software transactional memory implementations. A lower priority thread may be able to cause a transaction in a higher priority thread to repeatedly abort.
Posted at 06:11PM Sep 01, 2006 by David Dice in General | Comments[0]
java.util.concurrent ReentrantLock vs synchronized() - which should you use?
In J2SE 6 there's little difference - either should suffice in most circumstances. ReentrantLock might be more convenient if you need to implement a hand-over-hand locking protocol. A classic example of hand-over-hand locking would be a thread that traverses a linked list, locking the next node and then unlocking the current node. That's hard to express with Java's lexically balanced synchronized construct. Synchronized, however, is a mature and first-class language feature.
With respect to performance, both mechanisms are up to the task. (It's worth noting that the synchronization primitives in modern JVMs provide latency and throughput performance that is typically better than that of the native pthreads_mutex constructs). The builtin synchronized construct currently has a few advantages, such as lock coarsening, biased locking (see below) and the potential for lock elision via escape analysis. Those optimizations aren't currently implemented for ReentrantLock and friends. There's no fundamental reason, however, why a JVM couldn't eventually apply such optimizations to ReentrantLock.
The Synchronized implementation also provides adaptive spinning, whereas ReentantLock currently does not. Adaptive spinning employs a two-phase spin-then-block strategy. Briefly, on multiprocessor systems a contended synchronized enter attempt will spin briefly before blocking in order to avoid context switching. Context switching is wasted work -- it doesn't contribute toward forward progress of the application. Worse, it causes TLBs and caches to be repopulated when the blocked thread eventually resumes. (This is the so-called "cache reload transient"). The spin duration varies as a function of the success/failure ratio of recent spin attempts on that same monitor, so the mechanism adapts automatically to parallelism, current system load, application modality, critical section length, etc. In addition, we avoid spinning for a lock where the current lock owner is itself blocked and unlikely to release the lock in a timely fashion. On solaris our checks can be more refined, determining if the target thread is ONPROC (running), for instance, via the contract private thr_schedctl interface. And it should go without saying that we spin "politely", using a backoff to avoid generating excessive and wasteful traffic on the coherency bus, as well as using PAUSE on IA32 and AMD64 platforms. We'll likely add spinning support to ReentrantLock in a future release.
If you're curious about the inner workings or ReentrantLock, see Doug Lea's The java.util.concurrent Synchronizer Framework. If you're curious about adaptive spinning see synchronizer.cpp in the J2SE 6 source kit.
Finally, ReentrantLock and synchronized are equivalent with respect to the clarified Java Memory Model (JMM, JSR133).
Posted at 05:58PM Aug 31, 2006 by David Dice in General | Comments[12]
Java Thread Priorities demystified
The mapping between Java thread priorities and native thread priorities in J2SE has a long and tortured history. If you're interested, I'd recommend reading the following, which I wrote as part of the release notes for J2SE 5.0. That description applies to 6.0 as well.
In the future I expect to add support on Solaris for mapping thread priorities in the RT and FX scheduling classes as well as for the IA and TS classes. I'm also considering adding flags that would (a) allow low priority Java threads to be placed in the Solaris FX class or the Linux SCHED_BATCH/SCHED_IDLE class, and, (b), where the process has appropriate permissions, to force threads running at Java priorities above a certain threshold into the the RT scheduling class. Alternately, a native interface encapsulated in a .so or .dll that allowed "pluggable" priority mapping policies would be more general and perhaps of more use. Drop me a comment if you feel strongly either way.
Finally, it's worth noting that thread priorities are not currently used in notify(), notifyAll(), or when picking a potential successor to wake up in monitorexit. That's likely to change, of course.
As an aside, under Solaris, threads with higher effective priorities will have shorter quanta, at least in the IA or TS scheduling classes. (Read the man pages for dispadmin and priocntl for details). If your program is compute-bound, you may be able to reduce the voluntary preemption rate by forcing the program to run at lower effective priority, yielding long slices, which in turn results in less preemption, and amortizing the penalty associated with the cache reload transient. You might also consider the FX scheduling class. Beware, of course, that if your process is competing with other, higher priority processes, it'll be at a disadvantage and receive fewer relative cycles from the scheduler. But if the system is otherwise idle, FX or lower priorities can be of benefit.
Posted at 05:31PM Aug 27, 2006 by David Dice in General | Comments[1]
Software Transactional Memory -- Transactional Locking -- TL and TL2
If you're interested in STMs, you might like the following papers. The original TL paper was in Transact'06 (part of PLDI). The TL2 paper will appear in DISC'06. TL was joint work with Nir Shavit. TL2 was joint work with Nir Shavit and Ori Shalev.
Posted at 08:28AM Aug 19, 2006 by David Dice in General | Comments[1]
Lets say you're interested in using HotSpot as a vehicle for synchronization research ...
What follows is a brief overview of HotSpot's synchronization subsystem.
First, we'll assume you're starting with the 1.6 sources. That's a good idea as I overhauled the bulk of the synchronization subsystem between 1.5 and 1.6. As a caveat, I should also point out that the HotSpot C++ sources contain Monitor and Mutex classes. Monitors and Mutexes are used for internal native synchronization only - there's no relationship between Monitors and Mutexes and Java-level monitors. Java-level synchronization uses an entirely different subsystem.
I'll try to give a brief overview of how java-level synchronization works in hotspot. The following description relates to 1.6, but the general concepts still apply to 1.5 and 1.4. First, most synchronization is handled through what we call "fast-path" code. We have two just-in-time compilers (JITs) and an interpreter, all of which will emit fast-path code. The two JITs are "C1", which is the -client compiler, and "C2", which is the -server compiler. I assume the reader is more interested in C2 as it's higher performance. C1 and C2 both emit fast-path code directly at the synchronization site. In the normal case when there's no contention, the synchronization operation will be completed entirely in the fast-path. If, however, we need to block or wake a thread (in monitorenter or monitorexit, respectively), the fast-path code will call into the slow-path. The slow-path implementation is in native C++ code while the fast-path is emitted by the JITs.
The fast-path is emitted by cpu/i486/vm/i486.ad "FastLock" and cpu/sparc/vm/assembler_sparc compiler_lock_object() for IA32 and SPARC, respectively. The slow-path is implemented in the native C++ objectSynchronizer and objectMonitor classes. See synchronizer.cpp. In 1.6 the synchronization system is mostly platform-independent (synchronizer.cpp), with a very narrow platform-specific park()-unpark() abstraction used to block and wake threads. The platform-independent code also relies on compare_and_swap and various platform-specific memory barrier primitives. The park-unpark operators are implemented in the ParkEvent native class in os_solaris.cpp, os_linux.cpp and os_win32.cpp.
Per-object synchronization state is encoded as follows. Each object instance has a two-word header followed by the constituent non-static member fields. The header consists of a so-called mark word followed by a class pointer. The class pointer is the same for all instances of a type and points to a type-specific structure containing class static fields, vtable, runtime type information (RTTI), etc., The mark work is multiplexed, containing the identity hashcode value (if assigned), garbage collector "age" bits, and synchronization information. See markOop.hpp for details.
The synchronization state is encoded in the low-order bits of the mark:
- Neutral: Unlocked
- Biased: Locked/Unlocked + Unshared
- Stack-Locked: Locked + Shared but uncontended
- Inflated: Locked/Unlocked + Shared and contended
Tantamount to deferring unlock until contention
Avoids CAS atomic latency in common case
2nd thread must revoke bias from 1st
The mark points to displaced mark word on the owner thread's stack.
(The actual type is called a BasicLock). Stack-locking is clever in that
he address in the mark conveys ownership, and the target of the address
provides for access to the displaced mark work.
We only use stack-locking when the analysis phase of the JIT
can prove that lock operations are balanced. Note that
javac only emits balanced lock operations, so it's extremely
rare to encounter unbalanced locking. If we find that
a method has unbalanced locking we run it in the interpreter
in perpetuity.
Threads are blocked in monitorenter or wait().
The mark points to heavy-weight "objectmonitor" structure.
Note that HotSpot inflates on-demand, only as necessary.
Deflation occurs at stop-the-world garbage collection time.
Conceptually, the objectMonitor serves as an optional extension
to the object. An objectMonitor is associated with an object on-demand.
Note that the biased state is new in 1.6 and some 1.5 update releases. It serves solely to avoid local CAS latency (100s-1000s of instructions on some processors, particularly some Intel CPUs). In 1.6 we also have lock coarsening (again to deal with CAS latency) as well as lock elision through rather simplistic escape analysis. If we can prove an object is thread or method local we can elide the synchronization operations on the object, as well possibly allocating the object on-stack instead of in the heap. Coarsening reduces (amortizes) CAS latency, but at the possible risk of increasing critical section length and decreasing potential parallelism. See Amdahl's law.
For synchronization, all the mark word encodings in HotSpot are pointer-based. The low order bits of the mark serve as a tag, and the upper bits constitute a pointer to either a BasicLock (for stack-locked objects) or a objectMonitor (for inflated objects). Some sources says that HotSpot uses thin-locks, but that's not accurate and has never been the case. Thin-locks were devised by David Bacon at IBM research. Briefly, a thin lock is value-based and encodes an owner thread-id, recursion count, etc., into a single object header word. Thin locks are typically revert to inflation when the object is contended.
In the text above I mentioned balanced locking. Balanced locking deserves additional explanation as it enables certain optimizations. Balanced locking simply means that the JIT can prove that enter and exit operations in the bytecode are applied in a stack-like manner. For example, {enter A; enter B; ....; exit B; exit A} is properly balanced, whereas {enter A; enter B; ... exit A; exit B;} is not. If the JIT can prove balance, it will be able to associate each enter and exit bytecode with a so-called "BasicLock". The JIT allocates BasicLocks in the frame containing the synchronization site. Javac will never generate unbalanced locking operations so it's rare. Note too, that javac will always wrap a synchronized region in an implicit try-finally block, just to make sure that control doesn't escape the block without releasing the lock. Interestingly, the verifier -- which is really just an abstract interpreter -- doesn't require balanced locking. If we can't prove balance then the code in question will run in the interpreter forever; we'll never bother compiling it. The synchronization code in the interpreter can tolerate unbalanced lock operations. Once the JIT has established that all synchronization operations in a method are balanced, that bytecode is eligible for stack-locking and biased locking.
I'll now describe the operation of stack-locking. The inlined enter code emitted by the JIT will first fetch & examine the mark word. We call this step mark word "triage". (That code first checks to see if the object is biased, but we'll ignore biased locking for a moment and remain focused on stack-locking). If the mark word is neutral as determined by the low-order bits of the mark, we'll try to use stack locking. First, in anticipation of a successful compare-and-swap (CAS), the code will store the just-fetched mark value into the on-frame word that was allocated by the JIT and is associated with that particular bytecode offset. Next, the inline enter code will attempt to CAS the address of that on-frame location over the mark word. If successful, we've locked the object. The mark word now points to a word on the stack and the original mark word value (which may contain a hashcode value) is said to be displaced. In a sense the mark word identifies the current owner (simply by virtue of pointing into the owner's stack) as well as providing access to the original displaced mark value. The inlined exit path will reverse the operation, using CAS to try to restore the displaced mark value from the on-frame location back into the actual mark word in the object.
The JIT also emits fast-path operations for the inflated case, where the object's mark work points to a native objectMonitor structure. The objectMonitor is, in a sense, and optional extension to an object, containing synchronization metadata. Briefly, we need to inflate (associate or bind an object to an otherwise unused objectmonitor) whenever we need to maintain lists of threads that are blocked on entry or are a wait()ing on the object. Inflation occurs on-demand. Deflation occurs at all stop-the-world safepoints, where the JVM will attempt to scavenge and deflate otherwise idle objectMonitors. The objectMonitor has an "owner" field. The object is locked if and only if the owner field is non-null. The fast-path enter code uses CAS to try to swing the owner field from null to non-null.
The fast path code emitted by the JITs currently provides for biased locking (where an enter-exit pair requires 0 atomics), stack-locking (where an enter-exit pair requires 2 atomic CAS operations) and inflated operations where there's no need to block or wake threads (which requires 1 atomic for an enter-exit pair). Rather oddly, the inflated fast-path is currently faster than the stack-locking fast-path. The stack-locked encoding is also redundant with the inflated+locked encoding. Given that, I'd like to eventually eliminate stack locks and switch to single lock encoding such as that found in relaxed locks or ExactVM's Meta-Lock. Relaxed locks, for instance, inflates at lock-time and deflates aggressively at unlock-time, removing the need to scavenge.
Obviously, this was a rather abridged description of hotspot's synchronization subsystem. It's rather complicated, but it's also by far the fastest (in 1.6) of all known JVMs, as well as being considerably better (lower latency and better scalability) than native pthreads synchronization implementations.
For additional information the following presentations might be of help: (a) Synchronization in Java SE 6 and (b) a rather generic and long presentation on Java synchronization.
Assuming you're interested in transactional memory, the most basic approach in a JVM would be to start a transaction (usually abridged "txn") in monitorenter and try to commit in monitorexit. And of course you'll need to fetch and examine the mark, making it part of the txn's read-set in order to monitor that the object is unlocked (in the traditional mutual exclusion sense) at the start of the txn and that the object remains unlocked over the course of the txn. (We call this "commuting" a synchronized block to a hardware txn). And of course you need to deal with interesting issues such as aborts, infeasible txns (too long, traps, wait()), etc., in which case you'd probably want to revert to normal mutual exclusion locking. Given that, I suspect that the inlined fast-path code would be of the most interest.
Posted at 11:43PM Aug 18, 2006 by David Dice in General | Comments[6]
Biased Locking in HotSpot
The biased locking scheme in HotSpot arose from the following paper authored by myself, Mark Moir, and Bill Scherer. Briefly, the compare-and-swap (CAS) operations normally used to acquire a Java monitor incurs considerable local latency in the executing CPU. Since most objects are locked by at most one thread during their lifetime, we allow that thread to bias an object toward itself. Once biased, that thread can subsequently lock and unlock the object without resorting to expensive atomic instructions. Obviously, an object can be biased toward at most one thread at any given time. (We refer to that thread as the bias holding thread). If another thread tries to acquire a biased object, however, we need to revoke the bias from the original thread. (At this juncture we can either rebias the object or simply revert to normal locking for the remainder of the object's lifetime). The key challenge in revocation is to coordinate the revoker and the revokee (the bias holding thread) -- we must ensure that the revokee doesn't lock or unlock the object during revocation. As noted in the paper, revocation can be implemented in various ways - signals, suspension, and safepoints to name a few. Critically, for biased locking to prove profitable, the benefit conferred from avoiding the atomic instructions must exceed the revocation costs. Another equivalent way to conceptualize biased locking is that the original owner simply defers unlocking the object until it encounters contention. Biased locking bears some semblance to the concept of lock reservation, which is well-described in Kawachiya's dissertation. It's also similar in spirit to "opportunistic locks" (oplocks) found in various file systems; the scheme describe in Mike Burrows's "How to implement unnecessary mutexes" (In Computer Systems: Theory, Technology and Applications, Dec. 2003); or Christian Siebert's one-sided mutual exclusion primitives.
Biased locking is strictly a response to the latency of CAS. It's important to note that CAS incurs local latency, but does not impact scalability on modern processors. A common myth is that each CAS operation "goes on the bus", and, given that the interconnect is a fixed a contended resource, use of CAS can impair scalability. That's false. CAS can be accomplished locally -- that is, with no bus transactions -- if the line is already in M-state. CAS is usually implemented on top of the existing MESI snoop-based cache coherence protocol, but in terms of the bus, CAS is no different than a store. For example lets say you have a true 16-way system. You launch a thread that CASes 1 billion times to a thread-private location, measuring the elapsed time. If you then launch 16 threads, all CASing to thread-private locations, the elapsed time will be the same. The threads don't interfere with or impede each other in any way. If you launch 16 threads all CASing to the same location you'll typically see a massive slow-down because of interconnect traffic. (The sole exception to that claim is Sun's Niagara, which can gracefully tolerate sharing on a massive scale as the L2$ serves as the interconnect). If you then change that CAS to a normal store you'll also see a similar slow-down; as noted before, in terms of coherency bus traffic, CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#. Note that lock:cmpxchg will still drive LOCK# in one extremely exotic case -- when the memory address is misaligned and spans 2 cache lines. Finally, we can safely use cmpxchg on uniprocessors but must use lock:cmpxchg on multiprocessor systems. Lock:cmpxchg incurs more latency, but then again it's a fundamentally different instruction that cmpxchg. Lock:cmpxchg is serializing, providing bidirectional mfence-equivalent semantics. (Fence or barrier instructions are never needed for uniprocessors) This fact might also have contributed to the myth that CAS is more expensive on MP systems. But of course lock:cmpxchg incurs no more latency on a 2x system than on an 8x system.
While digressing on the topic of bus operations, lets say that a load is followed closely in program order by a store or CAS to the same cache line. If the cache line is not present in the issuing processor the load will generate a request-to-share transaction to get the line in S-state and the store or CAS will result in a subsequent request-to-own transaction to force the line into M-state. This 2nd transaction can be avoided on some platforms by using a prefetch-for-write instruction before the load, which will force the line directly into M-state. It's also worth mentioning that on typical classic SMP systems, pure read-sharing is very efficient. All the requesting processors can have the cache line(s) replicated in their caches. But if even one processor is writing to a shared cache line, those writes will generate considerable cache coherence traffic; assuming a write-invalidate cache coherence policy (as opposed to write-update) the readers will continually re-load the cache line just to have it subsequently invalidated by the writer(s). Put differently, loads to a cache line are cheap if other processors are loading from but not storing to that same line. Stores are cheap only if no other processors are concurrently storing to or loading from that same line. (We can draw an imprecise analogy between cache coherency protocols and read-write locks in that for a given cache line there can only be one writer at any given time. That's the processor with the line in M-state. Multiple readers of the line allowed and of course the lifetime of a reader can't overlap a write. Unlike traditional read-write locks, however, the cache coherency protocol allows writers to invalidate readers, so we can't push the analogy too far. In a twisted sense, the coherency protocol is obstruction-free). Coherency bandwidth is a fixed and contended global resource, so in addition to local latency, excessive sharing traffic will impact overall scalability and impede the progress of threads running on other processors. A so-called coherency miss -- for example a load on processor P1 where processor P2 has the cache line in M-state -- is typically much slower than a normal miss (except on Niagara). Recall too, that acquiring a lock involves a store (CAS, really) to the lock metadata, so if you have threads on processors P1 and P2 iterating, acquiring the same, the lock acquisition itself will generate coherency traffic and result in the cache "sloshing" of the line(s) holding the metadata. Generally, excessive coherency traffic is to be avoided on classic SMP systems. But as usual, there's an exception to any rule, and in this case that exception is Sun's Niagara, which can tolerate sharing gracefully. We should now return back to the topic of CAS latency, as that issue motivates the need for biased locking.
CAS and the fence (AKA barrier or MEMBAR) instructions are said to be serializing. On multiprocessor systems serializing instructions control the order in which which stores are loads are visible with respect to memory. Such instructions are often implemented in a crude fashion on most current processors. Typically, a serializing instruction will bring the CPU to a near halt, kill and inhibit any out-of-order (OoO) instructions, and wait for the local store buffer to drain. That's OK for simple processors, but it results in considerable loss of performance on more sophisticated OoO processors. There's no fundamental reason why CAS or fence should be slow. Until recently, the trend was toward worse CAS and fence performance, but it appears that this trend may be reversing. IBM made considerable improvements in fence latency between the Power4 and Power5. Likewise, Intel made remarkable improvements for both lock:cmpxchg and mfence latency in the recent Core2 processors.
In the case of biased locking -- since we know that the object is unshared -- it's reasonable to assume that the actions of the bias holding thread would keep cache lines underlying the object's lock word and data in M- or E-state
Interestingly, if a a memory barrier, or "fence" instruction is sufficiently faster than CAS we can implement biased locking revocation with a Dekker-like mechanism. If you're interested in delving deeper you might find the following of interest: Asymmetric Dekker Synchronization. (That same document also describes an optimization used in HotSpot where we were able to eliminate a MEMBAR from the state transition code path on multiprocessor systems).
Kenneth Russell and David Detlefs (formerly of SunLabs) have a paper appearing in OOPSLA'06 -- Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing -- where they describe a scheme to reduce the cost of revocation.
It's important to note that biased locking is permitted under the the Java Memory Model as clarified by JSR133. Refer to the Java Language Specification, 3rd edition, Chapter 17, Section 4. Doug Lea's JSR-133 Cookbook is an excellent resource for JVM developers or curious Java programmers.
Finally, there's not currently space in the mark word to support both an identity hashCode() value as well as the thread ID needed for the biased locking encoding. Given that, you can avoid biased locking on a per-object basis by calling System.identityHashCode(o). If the object is already biased, assigning an identity hashCode will result in revocation, otherwise, the assignment of a hashCode() will make the object ineligible for subsequent biased locking. This property is an artifact of our current implementation, of course.
As an aside, the latency of atomic read-modify-write operations such as CAS was in part what motivated myself and Alex Garthwaite to develop our Mostly Lock-Free Malloc (ISMM 2002) mechanism, which uses processor-local allocation without requiring atomics or the binding of threads to processors.
Posted at 10:40PM Aug 18, 2006 by David Dice in General | Comments[4]
For a glimpse at the possible future of hardware synchronization
For a glimpse at the possible future of hardware synchronization schemes have a look at Marc Tremblay's talk given in March, 2006 at PARC.
Posted at 03:38PM Aug 18, 2006 by David Dice in General | Comments[2]
Java Synchronization
I'm currently part-time in the Scalable Synchronization Group in SunLabs and full-time in JVM Core Engineering where, among other duties, I take care of Java synchronization and threading.
I've posted a number of synchronization-related papers on my external site.
Posted at 08:46AM Aug 16, 2006 by David Dice in General | Comments[0]