Hardware-Assisted Transactional Read-Set Validation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dave Dice, Nir Shavit A set of concurrently executing threads can use either pessimistic or optimistic concurrency control to safely moderate access to shared data. Pessimistic concurrency control prevents undesirable or inopportune interleavings by means of mutual exclusion (locks) while optimistic concurrency control detects and recovers from inopportune interleavings. Optimistic concurrency control mechanisms can be expressed by way of transactions. A transaction typically has an operational phase, where the transaction reads shared variables (the read-set) and produces a tentative or contingent set of updates (the write-set) transiently stored in a private buffer, followed by a commit step, where the read-set is validated and updates contained within the write-set are made public or visible to other threads. If read-set validation fails the transaction aborts, discarding the accumulated write-set. Presumably the read-set values are used as inputs to produce the write-set. If the read-set changed during the midst of a transaction because of concurrent updates performed by other threads -- that is, the read set became inconsistent -- the resultant write-set is invalid. In this case the application code will typically retry the transaction. Note that commit is simply a specialized form of multi-word CAS (MWCAS). Transactional support can be implemented in hardware, software, or a hybrid of hardware and software. Consider a protocol that allows concurrent threads to update a variable without risk of interference or data races. A pessimistic form would acquire a lock that was, by convention, associated with the variable, load the variable into a register, increment the register, store the register contents back into the variable, and finally release the lock. An optimistic form could be implemented with the commonly available compare-and-swap (CAS) instruction by loading the variable into a register, incrementing the register (forming a tentative new value) and then using CAS to try to install the value over the original value. The CAS will fail if some other thread asynchronously changed the variable's value between the load and the CAS. In that case the operation must be retried. CAS provides a single-word transaction. The CAS-based form has the advantage of being lock-free; in the lock-based form if the owner is preempted or stalls other threads trying to increment the variable may be impeded. That risk does not appear in the CAS-based form. Forthcoming hardware transactional memory (HTM) implementations will implement the read-set and write-set and constructs in hardware. The read-set consists of the set of locations read during a transaction, while the write-set holds the set of contingent deferred stores. The processor tracks the consistency of the read-set during the transaction, aborting the transaction if the read-set happens to be modified by another processor before the transaction can commit. Critically, the commit will abort unless the read-set remains consistent (unmodified by external agents). The capacity of the hardware read-set and write-set will likely be restricted in first generation HTM implementations. Transactionals that exceed the those constraints are infeasible and will always abort. Transactional memory may also be implemented in software. For instance Transactional Locking or TL (Transactional Locking) employs versioned write-locks. (A versioned write-lock is similar to the SeqLock construct used in the linux kernel). Each shared variable is associated with one write-lock. In the preferred TL embodiment the versioned write-lock consists of a word where the low-order bit serves as a write-lock, and the remaining bits form a version number. A transactional write does not update the target shared variable, but instead the address and speculative value are held in the write-set. A TL transactional read fetches both the the lockword and the variable. The write-lock bit of the lockword must be inspected to ensure that the variable is currently unlocked. Assuming that is the case, the TL load operator then saves the address of the lockword and observed lockword version into the software-maintained read-set. At commit-time TL acquires the locks associated with the write-set and then checks the read-set to ensure that the previously observed lockword versions kept in the read-set still match the current lockword versions. If not, the transaction aborts, otherwise the transaction is deemed successful and the commit operator writes-back the contents of the write-set into their ultimate locations (the shared variables). Finally, the commit operator releases the write-set locks and increments the lockword version numbers. Read-sets and write-sets are thread-local software constructs. Whereas a typical HTM tracks the consistency of the read-set by leveraging the existing snoop-based MESI cache coherency mechanism, TL tracks consistency with explicit version numbers maintained in the versioned write-locks. We propose decoupling the transactional read-set and the write-set. A novel hardware assist mechanism, described below, will use the processor's existing snoop-based cache-coherency mechanism to track the read-set and ensure consistency. The write-set is managed entirely in software with locking, as described in the TL paper. Given that read-sets tend to be much larger than write-sets and that read-set maintenance costs tend to dominate the cost of transactions we believe that using hardware for read-sets but locks for write-sets is a viable approach. Note that Purcell and Harris [PUHA04] describes a mechanism that could be used to track and ensure read-set consistency, their scheme requires two passes over the read-set where our mechanism does not require a 2nd pass. Representative embodiment of the proposed mechanism ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We add two new registers to the processor: 1. TXSEQUENCE is readable and writable in user-space. For the purposes of explication we'll say the register is 8 bits wide, allowing for 256 possible unique values. 2. INTERFERENCE is a 64-bit register readable in user-space, but not writable. We add a new dc_txsequence field to the cache line layout. (A typical cache line already contains tag, valid, MESI/MOESI coherency state bits, and data). Each time a word is loaded from the cache line the processor will set the line's dc_txsequence field to value contained in the processor's TXSEQUENCE register. These facilities are used as follows. At the start of a transaction the thread increments the TXSEQUENCE number and then reads and saves the INTERFERENCE register as part of the transaction metadata. The TL transaction proceeds, executing reads from both the read-set variables and the versioned locks "covering" those read-set variables. Unlike the classic TL described above, TL augmented to use hardware-assisted read-set consistency checking will forgo saving the lock address and version in the read-set. (The read-set is neither constructed during the transaction nor consulted at commit-time). The modified processor will store the current TXSEQUENCE value into the dc_txsequence field on each read (load). If some other processor modifies (stores to) that line, the line will be invalidated and displaced from the cache. When a line is displaced from the cache the processor will check the line's dc_txsequence field against the TXSEQUENCE register. If equal, the processor increments the INTERFERENCE register. This is a case of remote eviction. Relatedly, if a line associated with the read set is displaced through a capacity or conflict miss due to processor-local operations, the CPU will again check the dc_txsequence field against TXSEQUENCE and conditionally increment INTERFERENCE. This is a case of self-eviction. (Since we use the cache to track coherency, if a a read-set line "leaves" the data cache the processor loses the ability to track coherency and detect remote updates. In that case the processor must conservatively increment INTERFERENCE). Put simply, we use the coherency protocol used by the data cache to detect remote interference against the read-set. To use the data cache in this manner the read-set must remain continuously present in the data cache from the 1st load until the commit. At commit-time we use the TL protocol to acquire write-locks (if any) in the normal fashion. We then check the current INTERFERENCE value against the value previously saved at the start of the transaction. If they remain the same then the read-set is known to be consistent at that instant in time and the transaction is successful -- the write-set will be spilled, the write-set locks released, and the write-set version numberis will be incremented in the normal fashion. If the INTERFERENCE value differs then the read-set *may* be inconsistent and we abort. Because of aliasing on the TXSEQUENCE values or local capacity and conflict eviction our mechanism admits false-positive aborts. In the case of an abort we can simply retry the transaction but revert to the software-only TL scheme to provide read-set validation. Using Our scheme allows us to eliminate read-set construction and provides for very low-latency read-set consistency validation in the majority of cases. In addition it's likely that we allow larger read-sets and transactions than will first generation HTMs. Discussion ~~~~~~~~~~ * As described, the hardware assisted read-set validation mechanism requires that we add a dc_txsequence field to the cache. As caches are often built from SRAM this extra field could be costly. Instead of adding dc_txsequence to each line of cache we might instead implement a dedicated coherency tracking cache akin to the Store Miss Accelerator Cache [SMAC] described in MICRO5. No data would be stored in this specialized cache, which might have higher associativity but less capacity. The coherency cache would participate in the usual cache coherency protocols used by the primary data cache. Relatedly, we might implement the coherency cache as a simple low-associativity cache but augment it with a high-associativity "victim cache", similar to those used to augument the direct-mapped UltraSPARC L2 caches. A small but high-associativity victim cache would have the benefit of providing reasonable worst-case guarantees for read-sets. (That is, if the victim cache was 32-way, we would be guaranteed that transactions having a a read-set of 32 of fewer locations would not suffer capacity or conflict displacement). * Alternately, we could implement hardware assisted read-set validation with a hardware Bloom filter. Note that the Bloom filter is probabilistic and admits false-positive interference detection outcomes. Each processor would have a local Bloom filter. As elements are added to the read-set, the corresponding bits in the Bloom filter are set by the hardware. The input to the hash function of the Bloom filter is a physical address. The Bloom filter passively monitors snoop invalidations sent by remote processors, and, if an address being invalidated or the subject of an RTO (m-state) coherency transactions matches in the local Bloom filter, inteference has been detected. The size of the Bloom filter is directly related to the false+ interference detection rate. * It is useful to augment INTERFERENCE with status bits that indicate _why the last increment was performed, indicating whether local capacity-conflict displacement or displacement caused by remote update. If INTERFERENCE was updated because of capacity or conflict displacement then retrying the same operation using the hardware assisted mode where software does not track the read-set is likely to be futile. In that case we should revert immediately to traditional TL where the TL transactional infrastructure logs and validates the read-set. If the operation failed because of remote interference, however, retrying again using the hardware assist is likely the best policy. * Wider TXSEQUENCE and dc_txsequence fields will reduce the rate of false-positive INTERFERENCE aborts. Since TXSEQUENCE can wrap-around, old, residual lines in the cache that are not part of the current read-set might have dc_txsequence values that inadvertently match the current TXSEQUENCE register. This condition does not adversely affect safety or correctness, but it might impact performance. * The processor must increment INTERFERENCE on all taken traps or exceptions. This permits TL to be used by the kernel- and user-mode or by different threads on the same processor. * Our mechanism is not specific to TL, and could be used with other software or software-hardware transactional memory implementations. * Unlike HTMs, TL and TL augmented with hardware-assisted read-set validation are linearizable. * If multiple processors share common cache, as may be the case in modern SMT/CMP/CMT systems, such as Sun's Niagara T2000, then the implementation would install and track both the logical_processor_id (also known as the "strand_id") and the TXSEQUENCE in the cache, instead of simply the TXSEQUENCE as would be done in a traditional system where caches were not shared amonst processors. * The read-set assist mechanism provides _strong_atomicity_ [BLUN05] whereas most STMs or hybrid HTM/STM implementations provide only _weak_atomicity_. That is, transactions using read-set assist will interoperate properly with non-transactional stores. * As described above, transactional code would re-read the INTEFERENCE counter at commit-time. Alternately, when the processor incremented INTERFERENCE it could also raise an exception and generate a trap into a software handler. Bibliography - Transactional memory hardware ~~~~~~~~~~~~ [VMT05] Virtualizing Transactional Memory Ravi Rajwar, Maurice Herlihy, and Konrad Lai 32nd Annual International Symposium on Computer Architecture (ISCA), June 2005. [SLE01] Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and James R. Goodman 34th International Symposium on Microarchitecture (MICRO), December 2001. ACM SIGMICRO 2001 [TLR02] Transactional Lock-Free Execution of Lock-Based Programs Ravi Rajwar and James R. Goodman ASPLOS-10, October 2002. Describes TLR - Transactional Lock Removal [TCC] Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Ben Hertzberg, Brian D. Carlstrom, John D. Davis, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun ISCA-31, 2004 [UTML] Hardware Support for Unbounded Transactional Memory. Masters thesis, Massachusetts Institute of Technology. Sean Lie (May 2004). [UTMA] Unbounded Transactional Memory. C. Scott Ananian and Krste Asanovic and Bradley C. Kuszmaul and Charles E. Leiserson and Sean Lie. HPCA-11, 2005. [TLTM] Thread-Level Transactional Memory. Kevin E. Moore and Mark D. Hill and David A. Wood Mar 2005 Technical Report: CS-TR-2005-1524, Dept. of Computer Sciences, University of Wisconsin, :1--11 [AZUL] Speculative Locking - Breaking the Scale Barrier http://developers.sun.com/learning/javaoneonline/2005/coreplatform/TS-7429.pdf Bibliography - Software Transactional Memory ~~~~~~~~~~~~ [WEL04] Transactional Monitors for Concurrent Objects Adam Welc, Suresh Jagannathan, Anthony L. Hosking ECOOP'04 [ANA05] Efficient Software Transactions for Object-Oriented Languages C. Ananian, M Rinard In Proc. of Synchronization and Concurrency (SCOOL05) [SHA97] Software Transactional Memory N. Shavit, D. Touitou Distributed Computing 10,2 (February 1997) [HAR03] Language Support for Lightweight Transactions Tim Harris, Keir Fraser ACM OOPSLA 2003 [HER93] Transactional memory: architectural support for lock-free data structures M. Herlihy, E. Moss ISCA 1993 [MAR04] Design Tradeoffs in Modern Software Transactional Memory Systems V. Marathe, J. Scherer, M. Scott LCR 2004 [MAR05] Adaptive Software Transactional Memory V. Marathe, J. Scherer, M. Scott DISC 2005 [ENNA05] Software Transactional Memory Should Not Be Obstruction Free Robert Ennals Submitted to SCOOL 2005 [ENNA05+] Cache-Sensitive Software Transactional Memory Robert Ennals Submitted to SPAA 2005 [ANAN05] Efficient Object-Based Software Transactions C. Scott Ananian, Martin Rinard SCOOL'05 [ROCH05] Hardware Acceleration of Software Transactional Memory Shriraman, A., Marathe, V., Dwarkadas, S., Scott, M.L., Eisenstat, D., Heriot, C., Scherer, W.N. III, Spear, M.F., TR 887, Computer Science Dept., U. Rochester, December 2005. [SAHA06] A High Performance Software Transactional Memory System for a Multi-Core Runtime Bratin Saha, Ali-Reza Adl-Tabatabai, Richard Hudson, Chi Cao Minh, Ben Hertzberg (To Appear) PPoPP 2006 [LIBLTX] http://libltx.sourceforge.net/ [LIBCMT] http://libcmt.sourceforge.net/ [SMAC] Store Memory-Level Parallelism Optimizations for Commercial Applications Yuan Chou, Lawrence Spracklen and Santosh G. Abraham IEEE MICRO 2005 [TL06] What Really Makes Transactions Faster? Dave Dice, Nir Shavit Submitted to TRANSACT'06. Describes the Transactional Locking (TL) mechanism. Sun Invention Disclosure # SUN060720 [HYTM] Hybrid Transactional Memory Mark Moir Submitted to ASPLOS 2006 Bibliography - General ~~~~~~~~~~~~ [TMBIB] http://www.cs.wisc.edu/trans-memory/biblio/hwtm.html [BLUN05] Deconstructing Transactions: the subtleties of atomicity. C. Blundell, E. Lewis, M. Martin Fourth annual workshop on Duplicating, Deconstructing, and Debunking, 2005. (WDDD'05). [Her91] Wait-free Synchronization Maurice Herlihy ACM Transactions on Programming Languages and Systems (TOPLAS) Volume 13, Issue 1. January 1991 http://doi.acm.org/10.1145/114005.102808 [Her93L] A Methodology for Implementing Highly Concurrent Data Structures Maurice Herlihy ACM TOPLAS November 1993 [VAL95] John Valois Lock-Free Data Structures Ph. D. Dissertation - Rensselaer Polytechnic Institute, 1995. [GRE99] Michael Greenwald Non-Blocking Synchronization and System Design. Ph. D. Dissertation - Stanford University, 1999. [FRA03] Keir Fraser Practical Lock-Freedom Ph. D. Dissertation - King's College, Univerity of Cambridge, 2003. (Final version, February 2004) http://www.cl.cam.ac.uk/~kaf24/papers/phd.pdf [FH04] Keir Fraser, Tim Harris Concurrent Programming without Locks 2004 http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-cpwl-submission.pdf [HAR05] Revocable locks for non-blocking programming. Tim Harris and Keir Fraser ACM PPoPP June 2005. [DIGA2] Mostly Lock-Free Malloc Dave Dice, Alex Garthwaite ISMM02 [DG04] Supporting per Processor Local-allocation Buffers Using Multi-Processor Restartable Critical Sections Dave Dice, Alex Garthwaite, Derek White Sun Laboratories Technical Report TR-2004-126 [DGPA] US Patent # US06799236 Method and Apparatus for executing code while avoiding Interference Dave Dice, Alex Garthwaite Issued Sept 24, 2004 [JVM97] The Java Virtual Machine Specification T. Lindholm, F. Yellin Addison-Wesley 1997 [JLS3E] The Java Language Specification, 3rd Edition Gosling, Joy, Steele, Bracha. 2005. http://java.sun.com/docs/books/jls/download/langspec-3.0.pdf Chapter 17 describes the clarified Java Memory Model (JMM) defined by JSR166. [AGE99] An efficient meta-lock for implementing ubiquitous synchronization. O. Agesen et al. ACM SIGPLAN International Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), 1999. [RLP01] Dave Dice Implementing Fast Java Monitors with Relaxed-Locks. USENIX JVM'01 http://www.usenix.org/events/jvm01/full_papers/dice/dice.pdf [RLP02] Relaxed Lock Protocol US Patent # 6,735,760 [BARN] A method for implementing lock-free shared data structures G Barnes SPAA 1993. [PUG02] Atomic Instructions in Java ECOOP 2002. David Hovemeyer, William Pugh, Jaime Spacco [CSPE] CriSPE: Critical Seciton Pre-Execution Min Xu, Vinod Ganapathy [PUHA04] Implementing Multi-word Atomic Snapshots on Current Hardware (short) Chris Purcell, Tim Harris. 2004 PODC: 23th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing