We have come a long way from what seems now to be a distant memory - systems that had the CPU built from a large number of integrated circuits.  Enter the age of 'system on a chip' heralded by Sun's innovations with the Ultra SPARC T2 processor.

The new age processor with a second generation release in 2007, right on the heels of the original  T1 processor that debuted in 2005, is a paradigm shift from traditional thinking.  Having proved itself on the network and the application tier, it has now arrived to fire up the database performance. The TPC-C benchmark by Oracle has just proved that using a dozen multi-socket UltraSPARC T2 systems  coupled with other innovations from Sun.  The blazing fast 5100 storage is a key component but we focus on the 'system' in this blog.

Chip Multi Threading (CMT)

The T2 processor has eight cores, with each core supporting eight threads using two execution units. While the other processors are playing catch with this new design philosophy running fewer threads on higher clock speeds, the T2 processor supports sixty four threads in all. It also integrates networking, security and i/o - integrated 10 Gb Ethernet, 8 lane PCI-Express and 8 FP & cryptographic processing units. The T2 Plus processor lends itself to multi socket designs using the networking function to provide coherency links.

In the last couple of decades, performance improvements have largely come from increasing clock speeds  and enhancing instruction level parallelism by means such as multiple instruction issue, out-of-order execution and branch prediction. However, we have witnessed diminishing returns forced by two constrains, namely

- level of instruction level parallelism possible in today's commercial applications
Commercial applications tend to have low ILP due to large working sets and poor locality of reference on memory access, resulting in poor cache hit rates.  Also prediction becomes tough on data dependent branches making discarded work on complex and power hungry designs very expensive.  Increasing power usage and heat dissipation arising from design complexity further compounds the problem.

- memory latency
It is common knowledge that CPU speeds have gone up by order of magnitude above the rest of the system including memory. A cache miss and a resultant slower access to the memory makes the processor to idle on a few clock cycles every time.  

Contrast this with the break-away approach of the Sun design in the T2 processor characterized by a more rounded architecture.  The T2 processor provides a dramatic increase in the number of threads and a high bandwidth memory subsytem.  Each of the eight processor core support two groups of four threads  each using two Integer Execution Units. On every clock one instruction from each thread group is picked up by one Execution unit (EXU) - the Least Recently Fetched among the ready threads.

The hardware hides memory access delays and pipeline stalls by scheduling other threads onto the execution pipe with zero cycle penalty on the context switch.  Rather than have the processor idle on a cache miss, the T2 processor takes up another thread in the next clock, aided by context switching in the hardware. Each thread has its own program counter, one-line instruction buffer and a register file.  Each EXU contains state for four threads and the Integer Register File (IRF) contains 8 register windows for every thread.The large memory bandwidth available ensures a smooth flow of supplies for the  many threads and cores.

The contrarian design yields the following advantages:
- Reduced thermal envelope by simplifying processor design
- Increased overall system throughput with 64 threads
Together, the above two factors offer a higher 'Performance per Watt.' 

For the moment, the majority of commercial applications fall short of maximizing the throughput potential of this new design. There are also other applications which are heavily dependent on single thread performance making the processor clock speed the single most important factor.

As with all new thinking, a new application development approach is needed. The following are the critical factors:
The quantum of work done should be large enough to utilize the high thread count.
The work needs to be broken into independent units, bringing in parallelism

- compilers with optimization options take care of the obvious while the
- programmer will need to take care of the not so obvious possibilities of code parallelization

Load balancing across threads becomes the key
Thread synchronization penalties and thread creation overheads need to be controlled

Linking to high performance libraries supplied by the vendor provides substantial performance increase.

As with any paradigm shift, the support for this approach would grow towards a crescendo. This would be further helped by the direction almost all of the mainstream processor technologies are taking. Sun is clearly in the driver's seat with the big lead it has in the UltraSPARC T2 processor and the Solaris Operating system, the proven platform for multi-threaded workloads.

Comments:

Samson,
Excellent overview on the Ultra Sparc T2 Processor.
A perfect overview to go along with the Datasheet..
Regards,
Isaac

Posted by Isaac Devairakkam on November 18, 2009 at 04:34 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2010 by samson