David Dice's Weblog

Wednesday Feb 03, 2010

QPI Quiescence

It's not uncommon to find Dekker-like idioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the architecture to appear as a load and then a store in the effective memory order (sometimes called the "visibility order") -- programs must use barrier instructions to enforce memory ordering to implement Dekker correctly. For the purposes of discussion and assuming a relatively common system model we'll define memory order as the order of operations as they appear at the interface between the processor and the first-level coherent cache. Examples of barriers are MFENCE on x86 and MEMBAR #storeload on SPARC. In addition, x86 and SPARC TSO memory models allow only one variety of architectural reordering, the store-then-load form noted above. (For simplicity we'll restrict the discussion to TSO-like memory consistency models). On some platforms barriers introduce significant local latency. Perversely, we sometimes find that atomic instructions which have barrier semantics (are barrier-equivalent) are faster than the purpose-defined barrier instructions. A simplistic barrier implementation might simply quiesce the pipeline and wait for the store buffer to drain. To allay a common misconception it's worth pointing out that barriers -- sometimes called fences -- are typically implemented as processor-local operations and don't cause any distinguished action on the bus or interconnect and instead simply instruct the processor to ensure that prior stores become visible before subsequent loads (subsequent and prior refer to the barrier in program order). That is, they don't force anything to happen -- such as coherence messages on the bus -- that were not already destined to occur. Instead, they simply enforce an order, momentarily reconciling program and memory order. Crucially, at least with current x86 and SPARC implementations, barriers don't force anything to occur off-processor. That also means they don't impede or impair scalability. There's no fundamental reason, however, why barriers should be so slow. The processor implementation is free to speculate over the barrier, for instance, as long as stores in the speculative episode are not made visible and loads in the episode are tracked for coherence. And in fact on at least one processor, barrier instructions effectively have 0 latency.

Returning to the Dekker idiom, threads T1 and T2 might coordinate as follows: T1 might execute (ST A; barrier; LD B) and T2 executes (ST B; barrier; LD A), and in particular we refer to this pattern as the Dekker duality. As a concrete example, we coordinate thread state transitions in the HotSpot JVM via a similar protocol, where T1 is a Java thread (mutator) executing the reentry path from a JNI call, T2 has the role of the VM thread coordinating a stop-the-world safepoint, A is a variable that indicates T1's thread state (executing outside the JVM on a JNI call, or executing inside the managed runtime), and B indicates if a stop-the-world safepoint is pending. Critically, if T1 is running on a JNI call and attempts to return back into the managed environment while a safepoint is executing, we need to stall T1 at the point of ingress, as the VM expects that Java threads will not access the heap during a safepoint. (Among other uses, Safepoints are employed for certain types of garbage collection operations, for instance where we don't want the collector and the Java threads accessing the heap simultaneously). For the purposes of illustration I'm showing just a single mutator thread T1 and a single VM thread T2, but in practice the mechanism is much more general. T1's path, above, is likely to execute much more often than T2's, as JNI calls could be expected to occur more frequently than safepoints. As such, to improve performance we'd like to elide the barrier instruction from T1's path. Asymmetric Dekker Synchronization is a family of related mechanisms that allow us to safely remove the barrier from T1 while shifting the responsibility of dealing with T1's potential reorderings to T2. We call it asymmetric because to be profitable T1's path needs to run much more frequently than T2's. We then call T1's path the fast-path and T2's the slow-path. (This mechanism can enabled and disabled by way of the -XX:+/-UseMembar switch).

The Asymmetric Dekker Synchronization document mentioned above enumerates a number of ways in which we might allow T1 and T2 to coordinate while still removing the barrier from T1's hot path, including signals, cross-calls (inter-processor interrupts), page-protection mechanisms, etc. On windows T2 might simply invoke the FlushProcessWriteBuffers facility, which seems to precisely match our needs. (Some time ago I filed an RFE -- request-for-enhancement -- for Solaris to provide a similar facility). Still, we're always looking for better ways to implement our asymmetric protocol, which almost brings us to QPI quiescence, but first we need some historical background.

Long ago Intel implemented atomics with a global bus lock. It was called the #LOCK signal and driven by the LOCK: prefix on instructions, thus the names we have today. Bus locking was conceptually simple as most multiprocess Intel systems used a common front-side bus (FSB) between the processor and memory. Unrelated atomic operations, however, could impair overall performance as #LOCK had to quiesce the bus. The old FSB was a split-transaction request-response bus, and allowed multiple requests in-flight at a given time, so to assert #LOCK too often could rob the system of performance. Bus locking also supported atomics that spanned 2 cache lines. Intel subsequently switched to so-called cache-locking, where atomic read-modify-write instructions were implemented directly in the local cache of the processor executing the atomic, avoiding the need to lock the shared bus. From the perspective of the bus such atomic operations are no different than a store. (All SPARC systems that I know of use cache-locking). Cache-locking was a good step forward as atomics now scale ideally if there's no sharing of the underlying cache lines. Despite that, Intel preserved bus locking to handle the exotic legacy case of atomics that span cache lines (split atomics), which, by definition, are misaligned accesses. For this odd case the best solution was to simply resort to bus locking so the two lines underlying the operand could be accessed atomically. Note that Intel and AMD appear to frown up such behavior in their reference manuals, but the processors still support it for legacy reasons, at least as of today.

With the advent of QuickPath Interconnect (QPI) Intel eliminated the common FSB and switched to a topology more akin to AMD's hypertransport. Nehalem is the first
processor to use QPI. But even with QPI the architects needed a way to support those legacy split atomics. To that end, QPI has the ability of quiesce the whole system to allow the split atomic to execute. It appears that QPI quiescence also drains the pipes and forces at least of the equivalent of barrier semantics over the whole system. That is, split atomics may serve as a way to force a system wide "remote" barrier.

Additional remarks



  • Beware, it's not clear that QPI quiescence is actually safe and provides true quiescence. Empirical tests with a program designed to stress the -UseMembar facility and inspired by a simple hunch about the QPI implementation suggest so, but absence of evidence isn't evidence of absence --we might yet find Karl Popper's black swan.

  • At best, QPI quiescence should be considered an academic curiosity and should never be used in production code. I'm sure processor vendors would be loath to endorse such stunts because it could ultimately limit their latitude in future bus designs. QPI quiescence is simply an implementation artifact and not an architecturally defined or guaranteed trait. Even if the facility were somehow blessed I doubt vendors would want to expose it by means of split atomics so perhaps a new instruction might be called for. (Put another way, there are two issues, should the facility be provided, and if so how to expose it to programmers). So for the moment QPI quiescence is only good for prototyping what *might* be accomplished with a hypothetical instruction.

  • It's possible such mechanism might be applicable to certain forms of RCU (read-copy-update).

  • Dmitriy V'jukov recently posted some timing results for FlushProcessWriteBuffers. It's pretty efficient on his system. I hope to be able to run similar benchmarks to measure the cost of QPI quiescence in the near future, at which point I'll post the results here. (Dmitriy is also the author of the relacy race detector which I recommend). It's worth noting that implementations of facilities such as FlushProcessWriteBuffers can be made very efficient. For example the implementation might be able to avoid a cross-call to a processor if it's known that no thread in the process is executing on that processor, using the knowledge that context switches are serializing events.


    Some extremely preliminary data shows that QPI quiescence by way of a split atomic incurs a local penalty of about 4800 cycles on an i7-920, and the degree of impact on the progress of other processors is very much a function of the miss rate of those processors.

  • I mentioned the idea of QPI quiescence to Dmitriy, who in turn pointed point out a relevant article on QPI in Dr. Dobbs. The "Locks" section is particularly interesting. As Dmitry noted, if it's possible to use quiescence for hot-plugging then it's not unreasonable to think that both the bus and processors are completely quiesced with no pending stores languishing in store buffers, which is precisely the behavior in which we're interested.

  • Is there a working analog to QPI quiescence on AMD's coherent hypertransport? This is left as an exercise for the curious reader.

  • Can QPI quiescence lead to a break-down of performance isolation in virtual machines running on the same physical system? (That's pretty easy to test even in the absence of virtual machines, with a simple multithreaded program or a few single-threaded processes).

Related reading



  • For another example of the Dekker duality in the HotSpot JVM and further discussion about the challenges of weak memory models
    refer to a previous blog entry about a long-standing bug in the park-unpark subsystem.

  • Biased locking, which is used in the HotSpot JVM, is a mechanism that attempts to the reduce the impact of high-latency atomic instructions. Interestingly, as processor vendors make improvements in the latency of such instructions there may come a time in the near future when biased locking is no longer profitable, at least on some platforms.

  • US07644409

Tuesday Jan 26, 2010

A scheduling and dispatching oddity on Linux

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and more variance than I expected. The benchmark harness runs for fixed time and reports aggregate progress over the measurement interval, and of course the system is otherwise idle. As a sanity check the benchmark also reports the total user CPU time consumed over the measurement interval. Interestingly, the CPU times values were unexpectedly low, considerably less than min(#threads,#cpus) * MeasurementInterval. All the threads should stay running, or least ready, for the duration of the interval. Note too that the effect is independent of and still occurs with long intervals, so it's not a simple issue of allowing the scheduler time to steal and rebalance or otherwise converge on a state where the runnable threads are well-dispersed over the processors.

It appeared that the scheduler wasn't aggressively dispatching ready threads onto idle CPUs. Put another way there were prolonged periods where we had both idle CPUs and ready threads at the same time -- the kernel was failing to saturate the available processors.

To avoid the problem I initially tried binding threads to processors via sched_setaffinity, which provided complete relief. Still, I'm cautious about binding because it requires knowledge of the platform topology. On SPARC/CMT/Solaris, for instance, logical CPUIDs map to physical resources geographically in the following manner: the bits in the logical CPUID select, from most significant bits to least significant, chip then core then pipeline ("thread group" in Sun terminology) then strand. So if you just bind threads by "natural" order (thread N to CPUID N) then you'll end up with many threads sharing some cores and other cores completely idle, which is likely undesirable and may yield skewed scaling results. This, btw, is a common benchmarking pitfall. On Solaris/SPARC you're better off letting the kernel disperse threads onto processors as it'll balance 1st over chips, then cores, then pipelines, which is optimal for independent threads to make headway. (That policy is clearly not optimal if there's sharing -- particular if there are writers -- in which case you might win by packing the threads less "distant" from each other, for some interconnect distance metric, and assuming the increased cache pressure and replication doesn't do too much harm). Unlike SPARC/Solaris, the logical operating system-level CPUID to physical resource mappings on my ubuntu/x64 system are well-dispersed if you use natural CPU assignment, but there's no hard guarantee of that property although I vaguely recall that Intel advises a certain canonical mapping. In more detail, the logical CPUID to physical mapping on my system -- as discovered by iterating over the logical CPUID values and using the CPUID instruction to query physical resources -- is : 0 to C0S0, 1 to C1S0, 2 to C2S0, 3 to C3S0, 4 to C0S1, 5 to C1S1, 6 to C2S1, 7 to C3S1, where C is the core# and S is the relative strand# on the core.

I'm guessing that the linux kernel I'm using institutes polices that attempt to balance power with performance whereas Solaris currently optimizes for performance. After further poking through the Linux kernel sources I realized we could adjust the scheduling policy more to our liking via tunables exposed via the /proc file system. At that point I came upon the tune-sched-domains script that makes it easy to quickly adjust scheduler tunables. (Note that the script assumes bash). First, run tune-sched-domains with no arguments and examine the SD_WAKE_AFFINITY and SD_WAKE_IDLE settings. We want SD_WAKE_AFFINITY clear and SD_WAKE_IDLE set. (If I'm interpreting the comments in the kernel code correctly, WAKE_AFFINITY appears to try to place the wakee on the same CPU as the waker, presuming they communicate through memory that's already in the local cache, while WAKE_IDLE instructs the kernel to aggressively wake idle CPUs when making threads ready). If necessary, compute a new SD_ mask value and run the script again, passing the value (in decimal) as an argument to the script. These settings provided relief for the under-utilization problem.

In addition I noticed that the HotSpot JVM performed much better on multi-threaded workloads under the settings mentioned above.

While I didn't have time for the experiments, it may be the case that adjusting the LB_BIAS flag may also provide relief.

Monday Nov 30, 2009

A race in LockSupport park() arising from weak memory models

I recently diagnosed the root cause of a concurrency bug, CR6822370,
and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a
cluster of bugs that are now thought to be related by a common underlying issue).
Briefly, we have a lost wakeup bug in the native C++ Parker::park() platform-specific
infrastructure code that implements java.util.concurrent.LockSupport.park().
The lost wakeup arises from a race that itself arises because of architectural
reordering that in turn occurs because of missing memory barrier instructions.
The lost wakeup may manifest as various 'hangs' or instances of progress failure.

For background, a Parker is a native HotSpot C++ type that implements
a construct that's somewhat like a restricted-range binary semaphore, except that park() is allowed to
return spuriously. See LockSupport for details. If a thread is blocked in park()
we're guaranteed that a subsequent unpark() will make it ready. (A perfectly legal but low-quality
implementation of park() and unpark() would be empty methods, in which the program degenerates to
simple spinning. An in fact that's the litmus test for correct park()-unpark() usage). A Parker instance
is associated at most one thread at any one time and only that designed thread call invoke park() on that
particular Parker. Any thread, however, may call unpark() on a given Parker. Furthermore,
Parker instances also type-stable and immortal, at least in Sun's HotSpot implementation.
On Solaris and Linux a Parker contains a pthread condvar (named _condvar),
mutex (named _mutex), and a volatile integer, _counter, that represents the state
of the semaphore. Despite its name, _counter only takes on the values 0
(neutral) and 1 (signaled). Each thread has a Parker instance dedicated to use
by LockSupport.

Parker:: park() and unpark()

// Redacted and annotated for clarity
void Parker::unpark() {
pthread_mutex_lock (_mutex) ;
int s = _counter;
_counter = 1;
pthread_mutex_unlock (_mutex) ;
if (s < 1) {
// We signal after having dropped the lock to minimize the hold time and
// in case the underlying implementation doesn't provide wait morphing.
pthread_cond_signal (_cond) ;
}
}

void Parker::park() {
if (_counter > 0) {
// Optional optimization to avoid taking the lock
_counter = 0 ;
return ;
}
if (pthread_mutex_trylock(_mutex) != 0) {
// Another optional optimization
// Must be a concurrent unpark() - just return
return;
}
if (_counter > 0) { // no wait needed
_counter = 0;
pthread_mutex_unlock(_mutex);
return;
}
pthread_cond_wait (_cond, _mutex) ;
_counter = 0 ;
pthread_mutex_unlock(_mutex);
}

Failure scenario


  1. Lets suppose we have a thread T1 whose LockSupport/Parker instance has _counter=1.

  2. T1 calls LockSupport.park() which then invokes Parker::park(). Park() fetches _counter and observes 1, and then sets _counter=0 and returns. There are no atomic or
    fence instructions in this path. Crucially, the store of 0 into _counter languishes in the processor's local store buffer and is not yet visible to other processors.

  3. T1 returns from park(), checks for the condition of interest and observes that it's not yet signaled, and again calls park(). The store to _counter from step (1) is still languishing in the store buffer and is not yet globally visible. Control reaches the very top of park().

  4. Thread T2 causes the condition of interest to enter signaled state via some stores. These are typically stores to Java volatile variables, so the JIT will emit the necessary memory barriers after the stores.

  5. Thread T2 then calls unpark(T1). The unpark() operator stores 1 into _counter. We'll assume that this store becomes globally visible immediately. T2 returns from unpark().

  6. Thread T1's old store of 0 into _counter from step (1) finally drains from its local store buffer and becomes globally visible , overwriting the value 1 just written by T2 in step (6). With due lack of formal precision, _counter is now "globally" 0.

  7. Thread T1 in park() fetches _counter, observes 0, and then blocks on the condvar. We have a lost wakeup — T1 stalls.

Remarks and analysis


  • The underlying issue is a classic optimization in unpark() which tries to avoid taking a lock. The _counter instance variable is often but not always accessed under _mutex. If the underlying platform provided sequential consistency then the code as shown above would be correct, but x86 and SPARC provide a weaker memory consistency models, allowing memory or visibility order to differ from program order. (Refer to the excellent papers by Peter Sewell et al. for background). Given those architectural reorderings, the program admits a race which can in turn result in missed wakeups.

  • The problem would only manifest when we were using the -UseMembar optimization that lets us remove fences from certain hot thread state transitions paths that need to coordinate safepoints between mutator threads and the JVM. This feature is enabled by default, but we can turn it off with the -XX:+UseMembar switch, which causes the JVM to emit normal fence instructions in the state transitions paths. (That particular optimization is an example of asymmetric Dekker synchronization). Crucially, the park() path contains such a state transition. In reality the fence emitted by the +UseMembar switch was simply covering up the otherwise latent Parker:: bug. +UseMembar constitutes a work-around. Sensitivity to UseMembar was initially confounding but ultimately a valuable clue.

    After thinking about various timing scenarios I settled on the one given above as the most likely culprit. To support that hypothesis I wrote a simple C model of the pathology and verified that it would "fail" in a similar fashion. Having collected data with the C model on various platforms I suspect that processors where stores can languish in the store buffer for longer periods are more exposed to the bug.

  • Inserting appropriate barrier instructions after both stores of 0 into _counter in park() provides a solution. Furthermore, we're not formally guaranteed that pthread_mutex_unlock() has barrier semantics, so to be conservative we need a barrier in that location as well. For our purposes a fence instruction prevents subsequent loads (subsequent in program order) from executing before prior stores become globally visible. We typically use volatile to control for compile-time reordering and fences to control for architectural reordering.

  • The bug will not manifest on uniprocessors or environments where threads are otherwise constrained to just a single processor.

  • The bug is a "day-one" bug and present in all versions of HotSpot.

  • Parker::park() and unpark() reside in os_linux.cpp, os_solaris.cpp and os_windows.cpp for Linux, Solaris and Windows, respectively.

  • The built-in synchronized implementation uses a different park mechanism (PlatformPark::) whereas the java.util.concurrent infrastructure uses Parker::. Only Parker:: is vulnerable.

  • Additional reading:

Appendix - elaborated scenario

We have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1.


  • Thread T1 executes: while (C == 0) park() ;
  • Thread T2 executes: C=1; unpark(T1)

Timing:


  1. Thread T1 loads C, observes 0, and calls park()
  2. Thread T1 in 1st park() invocation:

    • Load _counter (1)
    • Store _counter (0) — languishes in store buffer
    • return from park()

  3. Thread T1: Load C (0)
  4. Thread T1: call park() a 2nd time
  5. Thread T2: Store C=1; MFENCE (by virtue of java volatile).
    MFENCE is a completely local operation, influencing only T2.
  6. Thread T2: call unpark(T1)

    • lock _mutex with atomic instruction such as CAS or LOCK:CMPXCHG
    • Load _counter (0) — observes value from memory, completely legal
    • Store _counter (1)
    • unlock _mutex

  7. Thread T1's store of 0 into _counter finally drains from the store buffer and becomes
    globally visible, overwriting the value 1 just stored by T2
  8. Thread T1 continues in 2nd invocation of park()

    • Load _counter (0)
    • lock _mutex
    • Load _counter (0)
    • block on _condvar

Another way to think of the problem is via the Dekker-duality. Observe that T1 executes { Store _counter; Load C; } while T2 executes { Store C; MFENCE; Load _counter;}. Note the missing MFENCE from T1's path. The duality is slightly obscured because the store into _counter occurs in the 1st invocation of park() which returns immediately and the load of C occurs in the application code. An in fact that's what distinguishes this bug - that the failure idiom is distributed over two invocations of park().

Thursday Oct 29, 2009

ROCK Hardware Transactional memory - Sun technical report

We recent published an expanded version of our earlier ASPLOS paper as a Sun labs technical report.

Wednesday Sep 23, 2009

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry.

Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would
be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the
benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple
back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any
fall-off.

If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you
might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention
management) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should
be applied only as needed, in response to contention.

A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where
performance improves when execution is more closely observed.

The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes.

Monday Jun 01, 2009

X86 platform memory model - clarifications in recent Intel manuals

Intel has taken steps toward addressing some of the concerns about potential platform memory model relaxations that I identified in a
previous blog entry. Specifically see section 7.2.3.7 of their
latest Software Developer's Manual which now guarantees global store ordering.

Friday May 29, 2009

Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Sunday May 24, 2009

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere.

I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed.

Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able mempcy.so shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form.

While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.

Thursday Feb 12, 2009

ROCK Hardware Transactional Memory - Early experiences - ASPLOS 2009

Early Experiences with a Commercial Hardware Transactional Memory Implementation by Dave Dice, Yossi Lev, Mark Moir and Dan Nussbaum will appear in ASPLOS 2009.

TLRW-ByteLock : return of the read-write lock -- Transact 2009

TLRW-ByteLock: return of the read-write lock by myself and Nir Shavit will appear in Transact 2009. (Slides).

Wednesday Mar 19, 2008

Carmina Burana Reinterpreted

Among other things, musicologists recontextualize and reinterpret older works for the current generation. If you love Carmina Burana but struggle with medieval high German or ecclesiastic "church" Latin, musicologists have an easy-to-digest answer (SFW).

Friday Mar 14, 2008

CAS and Cache Trivia - Invalidate or Update In-Place

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's worth calling out an interesting exception, however. For background, the level-1 data cache -- often shortened to L1D$ or just D$ -- of Niagara-1 and Niagara-2 processors uses a write-around policy. All stores go over the cross-bar, but if the line is also present in the D$, the store instruction (ST) will also update the line in-place. This is where CAS differs from ST. If the line targeted by the CAS is also in the the D$ of processor executing the CAS, the line will be invalidated instead of being updated in-place. Of course it's not uncommon for the line to be present in the cache given the prevalence of the LD...CAS usage idiom. More importantly, it's extremely common for a thread to access the same line in short order after a CAS. This is where the CAS-Invalidates policy can impact performance. A good example is a lock that's acquired via CAS where the lock metadata is collocated with the data covered by the lock. Obviously, after having acquire a lock a thread is quite likely to access data protected that same lock. The first data access to the just-CASed-to-line will miss in the D$. Another example would be a thread that repeatedly locks and unlocks the same lock. Assuming the lock and unlock operators both use a LD...CAS idiom, even if the line containing the lock metadata wasn't accessed within the critical section, the CAS in lock operation will cause the LD in the subsequent unlock to miss, and the CAS in unlock will cause the LD in the subsequent lock operation to miss. Thankfully on Niagara-family processors a miss in the D$ that hits in the level-2 cache can be satisfied fairly quickly. Still, we'd like to avoid that D$ miss if at all possible. I'd argue that to the extent feasible, processor designers should avoid a CAS-Invalidates policy, which I believe to be an implementation artifact and not strictly fundamental to CAS. CAS-Invalidates is cache-antagonistic to common synchronization usage.

Wednesday Feb 13, 2008

Rock-style Transactional Memory - Lock Elision for Java, etc.

In Applications of the Adaptive Transactional Memory Test Platform to appear in Transact 2008 we should how Rock-like transactional memory can be applied to various components of the Solaris software stack, including transactional lock elision for "synchronized" blocks in Java.

Monday Jan 28, 2008

National Wear Red Day

February 1st is National Wear Red Day. This is particularly dear to my heart because of my Wife's experience with cardiac arrest last May 15th.

Friday Jan 18, 2008

Questions about Java-SE on Solaris?

Feel free to post questions to our Ask the Experts forum.

Wednesday Jan 16, 2008

Java Memory Model concerns on Intel and AMD systems

The Java Memory Model (JMM) was recently clarified by JSR-133 with the corresponding changes incorporated into chapter 17 of the Java Language Specification, 3rd edition. Doug Lea's excellent JSR-133 Cookbook reinterprets JSR-133 from the perspective of and for the benefit of JVM implementers. A JVM must reconcile the JMM and the memory consistency model of the underlying platform. Intel/AMD (x86) and SPARC Total Store Order(TSO) define relatively strong memory consistency models; the only architectural reordering of concern is that a store followed by a load in program order can be reordered by the platform such that the store becomes visible before the load executes. If we require that store to become visible before the load executes then a serializing instruction -- typically an atomic instruction, such as CAS or a fence (MFENCE, MEMBAR #StoreLoad) -- must execute between the store and load in question.

The JMM defines a strong memory model akin to sequential consistency (SC) for volatile accesses. On Intel, AMD and SPARC processors, it's sufficient for the JVM to execute a fence instruction after all volatile stores. In practice this means that while translating Java bytecode to native code the just-in-time-compiler, or JIT, emits a fence after all volatile stores. In addition to avoiding architectural reordering through the use of fence instructions, the JIT will also avoid compile-time ordering of volatile accesses. To be somewhat more precise, a volatile load has acquire semantics and a volatile store has release semantics.

Of late, however, both Intel in their IntelĀ® 64 Architecture Memory Ordering White Paper and AMD (in section 7.2 "Multiprocessor Memory Access Ordering" of their recently updated systems programming guide) have relaxed the definition of their platform memory models. Under their previously defined memory models, for instance, if MFENCE instructions appeared between all store-load pairs you'd effectively have sequential consistency. That no longer holds. Instead of sequential consistency we'll instead have slightly weaker causal consistency. (As an aside, I wonder if these specification changes apply to existing processors already in the field -- that is, they clarify the behavior of existing processors -- or if they reflect future or planned processors? I'd hope the latter). Intel claims to have analyzed a large body of existing code in the field and believes that no programs will observe the change or be adversely affected. Strictly speaking, however, existing JVMs that emit MFENCE instructions after volatile stores would be in violation of the JMM when running on processors that actually implemented causal consistency instead of the previous TSO-like model. Collectively, we could clarify the JMM yet again to admit causal consistency for volatiles. Another option would be to change the code emission in the JIT to use locked instructions or XCHG instead of MFENCE. By my reading of the new Intel and AMD documents that'd be sufficient to put the JVM into compliance with the JMM on processors with the relaxed memory model. That's likely slower, however.

Readers interested in this topic would also likely enjoy Hans Boehm's presentation Getting C++ Threads Right which touches on the analogous problem for the new C++0x memory model (Youtube video).

Update 2008-3-7: See Rick Hudson's IA Memory Ordering talk.

Update 2008-10-27: See http://www.cl.cam.ac.uk/~pes20/weakmemory/ and in particular their POPL 2009 submission The Semantics of X86 Multiprocessor Machine Code.

Thursday Jan 10, 2008

Java Thread Priority Revisited in JDK7

In an earlier blog entry I described how Java thread priorities relate to native thread priority. We recently ran into bug 6518490 that forced me to change the default mapping of priorities on Solaris™ as previously described.

Briefly, in the Solaris IA(interactive) or TS(timeshare) scheduling classes, the effective priority of a thread is function of the thread's assigned priority and a thread-specific system-managed behavioral factor. (That's a gross oversimplification. If you're curious, I recommend Solaris Internals, 2nd Edition Section 3.7 followed by a skim of ts.c. In the case of Java the JVM maps the logical Java priority set via Thread.setPriority() to an TS- or IA-specific priority, and then calls priocntl() to set the assigned priority for the thread. The scheduler avoids indefinite starvation of low priority threads by periodically applying a boost (this is the system-managed behavioral factor) for threads that have languished too long on the ready queue without having been run. In addition, threads that block voluntarily -- as opposed to exhausting their time slice and requiring involuntary preemption -- may receive a boost to their effective priority by way of changes to the previously mentioned behavioral factor.

Lets say we have a 2x multiprocessor system, otherwise idle, with threads T1 and T2 at high assigned priority and T3 at low assigned priority. T1 and T2 loop, yielding, or alternatively they might pass a "token" back and forth via a pipe. Critically, they block often, so they receive the aforementioned boost to their effective priorities. Thread T3 simply loops, computing the digits of π. T3 will languish on the ready queue. Over time the scheduler will boost T3's effective priority, but unfortunately that boost isn't sufficient to push T3's priority beyond that of the boosted effective priorities of T1 and T2, so T3 can starve indefinitely.

To avoid the problem I've inverted the default for the -XX:[+/-]UseThreadPriorities flag from True to False on Solaris. The defaults setting for the flag remains unchanged on other platforms. Running all Java threads at the same priority sidesteps the problem and avoid starvation. I considered just posting a work-around and leaving the flag polarity unchanged, but the bug is difficult to diagnose properly. A work-around, albeit simple, is useless if a customer is forced to diagnose and identify a rather opaque and exotic bug. Put another way, a work-around is no better than the diagnostic procedure used to identify the associated malady. The change to UseThreadPriorities is in JDK7 and is propagating back through JDK6 and JDK5. If you desperately want Java threads priorities to map to underlying native priority you can use -XX:+UseThreadPriorities, but in a sense you'll be accepting the risk of the hang. You have to explicitly opt-in to enable priority mapping. Priorities are a quality-of-implementation concern, not a correctness concern. Recall that thread priorities are strictly advisory, and the JVM has license and latitude to implement or change them as needed. They should never be depended upon to ensure progress (relative to other threads) or lack of progress (i.e., synchronization).

Finally, this changed isn't as significant as it might seem. Under IA and TS, the upper half of the logical priority range was previously mapped to native "high" and the lower half was mapped over the range of native priorities. The only observable difference between the old behavior and running with thread priorities disabled is that Java threads assigned priorities in the lower half of the logical range will now run at the same effective (native) priority as threads in the upper half. When competing for CPU cycles with other threads, these low priority threads will receive relatively more cycles from the scheduler than they did in the past.

Tuesday Apr 24, 2007

More musical parody - interpret this as you may

Alanis Morissette parodies "My Humps". Deep social commentary, shallow parody, both, or something else altogether? It's your call.

Sunday Mar 11, 2007

kill-dash-nine song

Would you consider it good or bad if you understand all or most of the references in kill-dash-nine song? Consider it a geek litmus test. (NSFW, NPC2007)

Monday Feb 26, 2007

Understanding Tradeoffs in Software Transactional Memory

Understanding Tradeoffs in Software Transactional Memory by Nir Shavit and myself was accepted in CGO-2007. The paper describes the TL1 (Transactional Locking) mechanism. The slides presented at the conference are available as well. (We first described TL1 at Transact'06 in Ottawa). TL1's successor, TL2, authored by mself, Nir Shavit and Ori Shalev, was published in DISC'06.

Briefly, STMs such as TL1 periodically revalidate the read-set during a transaction. This is necessary as a transaction might have read a set of values that are inconsistent and then, based on that inconsistent data, enter an infinite loop, generate traps, or otherwise misbehave. We refer to such a transaction as a zombie: it's observed inconsistent state and could never successfully commit, but it's still running. We could avoid zombie transactions by revalidating the entire read-set after each transactional load. That's sufficient to avoid zombies, but the cost is quadratic with the size of the read-set. Instead, TL1 admits zombies (transiently) and validates periodically. The period of zombie execution is bounded. But the TL1 runtime environment must also provide zombie containment to gracefully tolerate the misbehavior of zombie transactions.

Another way to avoid zombies is to use so-called visible readers. Conceptually each transactional variable maps to a read-write lock. The first time a transaction reads from a location it acquires the read-lock. And of course transactional writes acquire the lock for writing. Visible readers avoid zombie execution but at the cost of considerable read-write coherency traffic on the read-write locks.

TL2 does not admit zombies execution, potentially making it more suitable for C or C++ unmanaged runtime environments. And it manages to avoid zombies with validation costs only proportional to the read-set size and without visible readers.

Sunday Feb 25, 2007

Geocaching - In the median strip

Warning: This video clip could be spoiler for the final of a rather interesting multicache.

Monday Nov 13, 2006

Open Source Java - Crossing the Rubicon

At this point I should probably employ a pithy turn of phrase or draw a clever analogy between Sun's announcement of open source Java and some interesting historical event. But if you've read this far I'll spare you that pain. There are considerable high-level implications to Sun's actions, but those are well-covered elsewhere so I'll stick to local prognostication. The HotSpot sources are now accessible through an easy-to-view opengrok instance, similar in spirit to that used by Open Solaris. And of course there's svn access or zipballs too. Bloggers will able to annotate their blogs with pointers into the source, providing a much richer narrative and enabling better engagement with the larger community. My personal hope is that the HotSpot JVM will be more appealing to academic and research environments. I look forward to those collaborations.
Alea Iacta Est.

Monday Oct 02, 2006

Hardware-assisted transactional read-set validation

The LoadLocked and StoreConditional primitives, commonly referred to as LL-SC, provide for optimistic concurrency control. To atomically increment a shared word in memory, we could execute LL on the variable, fetching it into a register, increment that register, and then try to execute SC to conditionally store the updated register value back into memory, overwriting the previously fetched value. The SC instruction will fail if another processor managed to update the variable between the LL and SC, in which case the program might typically retry the operation. In most LL-SC implementations, the processor executing LL tracks the location using the existing snoop-based cache-coherency mechanism, watching for remote updates. In a sense, LL-SC provides a degenerate form of hardware transactional memory (HTM) where the read-set (the set of shared variables read during the transaction) and the write-set (the set of shared variables speculative stored-into during the transaction) are collectively restricted to just one location. LL-SC is available on PowerPC, MIPS, Alpha and forthcoming multiprocessor ARM processors. A related instruction, compare-and-swap (CAS) is available on SPARC, IA32 and AMD64. CAS and LL-SC are key primitives for lock-free algorithms. LL-SC can be extended to allow multiple disjoint locations in the read-set and write-set, yielding a more general HTM.

First generation HTMs, even when available, are likely to place restrictions on the size of the read-set and write-set. As such Software Transactional Memories (STMs) are of interest. STMs can be implemented on today's stock hardware.

Both STMs and HTMs typically ensure that the data observed by a transaction (this is, its read-set) is mutually consistent. For instance lets say that we have shared variables A and B with the invariant that (A+B) is even. A and B are both initially 5. Transaction T1 reads A and observes 5. Transaction T2 updates A to 4 and B to 6. Transaction T2 then reads B and observes 6. We say that T2 has read inconsistent data. In this case T2 must abort and retry the transaction. Databases often provide the so-called ACID transactional properties - Atomicity, Consistency, Isolation and Durability - whereas transactional memory implementations usually provide just ACI.

HTMs provide consistency by snooping for remote invalidations to the current transaction's read-set. STMs typically provide consistency in one of two ways. First, the STM can implement so-called visible readers. Conceptually, each transacted upon location is covered by a read-write lock. A visible reader acquires the read-lock for each element of the read-set at the time the transactional read is performed. In a sense, this is pessimistic as the read-lock ensures that the values previously read in the transaction remain unchanged; the read-lock prevents an update transaction from acquiring the write-lock until the reader's transaction completes. In the example above transaction T1 would acquire the read-lock covering A, preventing transaction T2 from acquiring the write-lock needed to update A until T1 completes. Unfortunately visible readers require that readers write shared transactional metadata to acquire and release the read-write locks. See the following blog entry for a discussion of why we want to minimize writes to shared data and metadata.

STMs such as TL associate versioned write-locks each transactional location. The version number is advanced after each update. A versioned write-lock is similar in spirit to a linux Seqlock. The STM tracks the version numbers associated with the read-set and validates the read-set at commit-time. This provides invisible readers; no writes (stores) to shared metadata are necessary to track the read-set. The technique is optimistic in that it detects and recovers from inconsistent data whereas visible readers prevent inconsistent data.

STM read-set consistency management of either flavor - visible or invisible readers - can constitute a considerable fraction of the transactional overhead. In addition, for most transactions the size of the read-set typically dominates the size of the write-set. (This also holds normal code - loads greatly outnumber stores. Depending on which literature you prefer, the load:store ratio is usually given as 4:1 or 3:1). Nir Shavit and I propose a hybrid scheme where read-set consistency is managed by hardware but the write-set is managed in software through locking, as is done in TL(TL1) or TL2. A traditional HTM as described in the literature handles both the read-set and write-set in hardware whereas we propose decoupling read-set and write-set management. The goal is reduce transactional latency by making a hopefully modest change in the processor, leveraging existing cache coherency constructs. To be profitable, the hardware read-set tracking would need to be more efficient -- have lower latency -- than the explicit version number checks used by STMs such as TL(TL1) or TL2. (The version number checks ensure that the read-set is mutually consistent).

STMs such as TL1 periodically revalidate the read-set during a transaction. This is necessary as a transaction might have read a set of values that are inconsistent and then, based on that inconsistent data, enter an infinite loop, generate traps, or otherwise misbehave. We refer to such a transaction as a zombie: it's fated to eventually abort but it's still running. We can avoid zombie transactions by revalidating the entire read-set after each transactional load. That's sufficient to avoid zombies, but the cost is quadratic with the size of the read-set. Instead, TL1 admits zombies (transiently) and validates periodically. But the TL1 runtime environment must also provide zombie containment to gracefully tolerate the misbehavior of zombie transactions. Note that TL2 does not admit zombies execution, potentially making it more suitable for C or C++ unmanaged runtime environments. Relatedly, STMs that use visible readers avoid zombies too, but at the cost of considerable read-write coherency traffic. If we have hardware-assisted read-set validation and notification that a thread's read-set has been "spoiled" by a remote write is immediate (say by interrupt delivery) then we can use schemes such as TL1, but still avoid zombies and the complexity of zombie containment.

Hardware read-set management could be accomplished in various ways. First, we might modify the existing data cache (D$) to tag and track the lines containing read-set elements. Remote snoop invalidations to such tracked cache lines that occur before commit-time would result in in an forced abort. This leverages the existing cache-coherency protocol. One problem with this approach is that loads during a single transaction might cause earlier transactional loads from the same transaction to be displaced from the cache because of insufficient cache associativity. If that were to happen we'd lose the ability to track displaced line and thus the ability to guarantee consistency. The transaction would need to abort. Retrying is useless, of course, as the resource deficit is local and would simply recur. To avoid such problems we might use a high-associativity cache similar to a victim cache. Such as specialized cache wouldn't necessarily need to contain data, but rather just tags and tracking metadata. This idea is vaguely similar to the store-miss accelerator proposed by Chou et al., in IEEE Micro'05. Both of these approaches have capacity limits, however, and some transactions may not be locally feasible. To avoid capacity limits we might provide each processor with a local hardware Bloom filter keyed by physical address. The contents of the filter would be reset at the start of a transaction. As transactional reads were performed, those addresses would be added to the local Bloom filter. That is, the Bloom filter would track the read-set. Remote invalidation probes would test the Bloom filter and if there was a hit, cause the current transaction to abort. This mechanism admits false-positive aborts, but by proper sizing of the Bloom filter they could be made relatively rare. (Critically, a Bloom filter is conservative and does not admit false-negatives - it will not fail to miss any instances of read-set interference. In the absence of remote coherency traffic, all local transactions will be feasible).

For additional details see the following write up.

Most modern speculative Out-of-Order (OoO) processors already provide coherence tracking for speculative loaded operands. Transparently to the programmer, as an optimization, loads might be executed more aggressively than program order would seem to allow. In that case such speculative loaded addresses are tracked. If another process writes to a speculatively loaded-from location before the commit "wave front" passes the load, the process will discard the now-stale state and quietly rerun the load and discard any speculative state that depends on that load. Tracking state allows OoO execution but preserves the platform's memory model. I recommend "Memory Ordering: a Value-based Approach" and "Deconstructing Commit".

Finally, Chris Purcell and Tim Harris mention a clever idea in their paper Implementing Multi-word Atomic Snapshots on Current Hardware where they use existing performance counters available on most processors to detect read-set interference. Their scheme works for read-only transactions and requires two passes over the read-set.

Further reading: Bulk Disambiguation of Speculative Threads in Multiprocessors by Ceze, et. al.

Monday Sep 25, 2006

Parallelism Manifesto

Java provides threads to express parallelism and locks to restrict or prune that parallelism. This subtractive model seem somewhat awkward: threads give parallelism and locks take it away. (I'm intentionally ignoring lock-free programming and optimistic transactions for the purposes of this rant. Note too, that multithreaded programs and thus locks are still required on uniprocessors, but I'd argue that threads are used on uniprocessor systems only to achieve IO-compute overlap. Better asynchronous IO facilities provide a better solution than threads.) Think of taking one drug [parallelism] as a remedy for a disease [limited single-CPU performance] and then needing a 2nd drug [locks] to counter act the side-effects [races] of that 1st drug [parallelism]. Remember, true parallelism is not a feature -- it's just a remedy, and a problematic one at that. Just as Java removed explicit memory management from C and C++, I expect that the next generation of programming languages may remove threads and locking as we know it today. Explicit threading will still be available, of course, but ideally I'd like to see the world switch to a model where, by default, threads are managed by the runtime environment. Ultimately, we'd disappear threads (sadly, disappear has entered regular usage as a transitive verb). Programmers would describe potential parallelism with new language constructs but avoid explicitly prescribing threads. Such a descriptive model is better than our current prescriptive model as we trade two mechanisms (threads and locks) for one construct used to express potential parallelism. By default, execution would be serial unless the programmer explicitly marked it as potentially parallel. That's inverted from our current model. The managed runtime would automatically provision the process with real threads as needed.

Of course you can approximate a nearly thread-free programming model today with deft use of thread pools in the java.util.concurrent libraries. But instead of a library implementation I'm proposing first-class language integration. Relatedly, one might correctly argue that the descriptive model described above could be emulated today with very light-weight threads. In such a model we'd use threads and implicit cyclic barriers to express the parallel regions. The "body" of the thread's run() method would typically be short and not contain any synchronization operations. While workable, that approach isn't a complete solution. Not all concurrency problems map cleanly or easily onto the fork-join model. See also Cilk and Fork-Join parallelism.

The key economy we're grappling with is really human scalability (developers), not machine scalability. (Programmer scalability is just as important as program scalability). Parallelism has arrived on our desktops and the free lunch is over. Traditional concurrent programming has been error-prone and difficult, making it expensive and best left to a small priesthood of specialists. That situation can't stand. It's probably true that humans are naturally better at sequential reasoning and programming (c.f., dichotic listening). You can even draw a sketchy analogy between human multitasking (NY Times, 2007-3-26) and parallel programming. To the extent possible, we need to provide languages and constructs that allow programmers to "think sequentially" but "execute concurrently". There's no reason that the programmer who has domain expertise in international tax law should also be required in the future to become a concurrency expert.

Sun's Fortress language is an interesting step in the right direction. Fortress provides loop-level parallelism by default. (If you're interested in how concurrency is expressed in modern programming languages then IBM's X10 is also worth a look).

Note that we're not describing automatic parallelization, where the compiler or runtime analyzes the dependencies in a serial instruction stream, converting implicit instruction-level parallelism (ILP) to thread-level parallelism (TLP). That's the holly grail, but it remains an open research topic. It's also likely that for most codes, the benefit is bounded. This is the same reason we're seeing diminishing returns from out-of-order and speculative execution in modern processors.

Since locks are problematic, a useful interim step is to provide transactions in lieu of locks. Think synchronized but without the need to specify an object. Some researchers prefer atomic { ...code...} instead, but it's all syntactic sugar over the same concept. Briefly, transactions provide correctness as a first-order property. The programmer doesn't need to be concerned about which variables are protected by which locks or locking protocol. Deadlock is impossible as well - atomic sections are composable. As a secondary benefit, depending on the application and the implementation of the transactional subsystem, the program might enjoy more potential parallelism and thus improved throughput. There's often no need for the programmer to design complex and error-prone fine-grained locking protocols to extract additional parallelism. (Of course hand code algorithms will always have a performance edge, but we hope that STMs will often be able to archive competitive performance). This particular area is where STM (Software Transactional Memory) research intersects language design. As you'd expect, much of this remains an open -- and very active -- area of research. If you're curious, you might enjoy reading about the atomjava project as well as watching this video interview with Simon Peyton-Jones and Tim Harris of Microsoft research or the following Google TechTalk video on
transactional memory.

As an aside, some have suggested MPI or OpenMPI as a workable parallel programming model. Message passing has a certain aesthetic appeal, but I don't see it as the way forward. See Gosling's blog entry on the topic. Relatedly, a segment of the concurrency community believes that Erlang (or a programming model like Erlang) is the right path for the future of concurrency. Erlang doesn't allow traditional threads and shared memory, but instead provides efficient asynchronous message passing between processes. A process can neither address, read, or mutate the memory of another process. Many of the bugs we face today in a classic threads + shared-memory programming model aren't even expressible. Minimizing shared mutable data is generally a good practice; Erlang simply takes that the philosophy to the extreme and eliminates shared mutable data, providing communication through explicit message passing instead. (Of course you could approximate the Erlang model in Java by partitioning data as follows: Mutable unshared data would be thread-local -- one thread couldn't even reach another thread's unshared data. Shared immutable data is benign. Finally, you could ban shared mutable data and communicate through java.util.concurrent queues with serialized objects.) The Erlang programming model is seductive, but we don't know if it scales out or maps to large problems where state can end up dispersed. Collectively, we have more experience with the traditional threads memory model. Interestingly, David Patterson mentioned in his Feb 2007 Stanford EE380 talk that recent programmer productivity studies suggested message passing might be more difficult to master. (See also The Landscape of Parallel Computing Research: A View from Berkeley.) Distributed programming brings its own benefits as well as problems. I suspect in the future we'll use hybrid environments with message passing between nodes and threads with shared memory for intra-node cooperation.

It's also worth noting that concurrency isn't a panacea and must be used with care. It's entirely possible to encounter negative scalability. In the classic case if plot throughput on the Y-axis and concurrency on the X-axis you'll see either a concave graph or a monotonically descending line. If the communication costs (shared memory coherency traffic, synchronization, etc) don't overcome the benefits derived from compute parallelism then you may encounter this problem. In addition, we'd like to use fine-grain parallelism to maximize system utilization, but fine-grained parallelism typically incurs higher communications costs.

Finally, it's worth commenting on language closures. Normally you wouldn't think there's much connection between closures and parallelism, but in practice closures are an excellent way of expressing certain actions. Think of a closure being passed into a ParallelApply() method that operates on a collection and you'll quickly see the benefits of closures. BTW, the scala programming language provides a rather elegant closure model. I've only played with it briefly, but scala looks promising. It's a nice shiny new language but the compiler generates Java bytecode so you can take advantage of your highly optimized JVM and mature Java libraries. Closures are also a hot topic for Java, of course. For a balanced look at the design choices I recommend John Rose's blog entry.

Monday Sep 11, 2006

A tool to decode hs_err files

The hs_err6.pl perl script parses an existing hs_err file and annotates it with additional information. In particular the script will decode the the rather cryptic "Error ID" string as well as making an attempt to decorate the module-relative offsets (e.g., [jvm.dll+0x1234]) found in native stack listings with more useful symbolic names.

The script is provided as-is and is entirely unsupported. It works only on J2SE 6 reference platforms. To get started run with the --help option. Ideally, future versions of the JVM will integrate this feature and there'll be no need for the script.

Copyright © 2006-2010 Dave Dice, All Rights Reserved.

Calendar

Feeds

Search

Links

Navigation

Referers

XML Feed
Creative Commons License
This work is licensed under a Creative Commons Attribution-No Derivative Works 3.0 License.

The views expressed on this [blog; Web site] are my own and do not necessarily reflect the views of Oracle.