Hooray! We're announcing our new SunFire T2000 platform today. Being Sun's first server to incorporate the UltraSPARC T1 processor, it delivers remarkable performance in a rack-optimized 2U enclosure while drawing less than 400W of power.
With prices starting at just $2995, it is probably the most affordable solution in its class on the market today. Here's a brief overview of SF T2K's main features.
- 32 simultaneous execution threads on a single chip
- 8x1.2Ghz cores (with 4 threads on each) connected through a 134Gb/sec crossbar switch to a
high-bandwidth 12-way shared on-chip 3MB L2 cache (read Hugo's blog entry if you think this cache is too small)
- On-chip public key encryption support
- Up to 32GB of DDR2 memory
- 5 PCI slots (3 PCI-E and 2 PCI-X)
- 4 integrated Gigabit Ethernet interfaces
- Up to 4 73GB Small Form Factor SAS (Serial Attached SCSI) disks
SunFire T2000 is also Sun's first system with support for the Hypervisor. The Hypervisor
is a small firmware layer that provides a stable virtualized machine architecture (sun4v) that
runs below the operating system layer. The sun4v architecture is closely based on the legacy
sun4u architecture of UltraSPARC-III systems, so as a result all sun4u applications just work
on sun4v. The UltraSPARC T1 processor has new hyper-privileged mode of execution (in addition
to user and privileged modes) so that the processor can distinguish between a legitimate access
to various state registers by the hypervisor and illegitimate accesses by a rogue operating
system. Memory is virtualized as well -- new intermediary "real" address address space is
introduced so that the hypervisor can re-arrange physical memory resources without a guest
OS needing to know exactly what happened. The I/O subsystem now also supports virtual devices.
There's much more to talk about the Hypervisor, but that's a topic for a separate blog entry.
What I would like to talk about today is the scheduling optimizations that
Eric Saxe and I have introduced for
the SunFire T2000 platform. There are two main optimizations:
-
Halting of idle CPUs -- when the scheduler idle loop can't find any work to do
for a given hardware thread (of which there are 32), it
makes a special call to the hypervisor layer to stop executing further instructions and waits for some interrupt
to arrive. This is very similar to the way HALT instruction works on x86 and there are huge benefits
for using it. When the idle loop is looking for work and tries to steal threads from other CPUs, it
actually uses substantial amount of its core's execution pipeline bandwidth, which otherwise might have been
given to various software thread doing some real work. In early development stages we saw a solid boost
in performance when CPU halting was added.
-
Core level load balancing -- because the processor has 8 cores, and all of them share a single
L2 cache, one may think that it's not really important to choose on which core and which hardware thread
within that core to run the threads. Turns out it's quite the opposite. Even though per-core L1 caches
(16K I$ and 8K D$) and TLB entries (64-entry fully-associative I & D TLBs) are small and don't usually play any significant role here, the throughput that can be achieved by running two software threads
simultaneously on two different cores vs. on two hardware threads of the same core can be very different.
On UltraSPARC T1 hardware threads within the same core can switch on every cycle if other thread is ready.
Threads also switch on all long latency instructions (loads, stores, fp ops, divisions, etc) and on
various pipeline events (cache misses, traps, etc) while fair scheduling of threads within each core is
done using LRU scheme. For more detailed description of VT (Vertical Threading) on UltraSPARC T1 go
read Jim Laudon's blog. Given all this, threads that don't have to compete with other software threads
running on the same core will get interrupted less often and as a result will get better throughput.
The way the kernel finds out what CPUs belong to what cores is by analyzing the Machine Description
DAG (directed acyclic graph) provided by the Hypervisor. This way, when we come up with even more complex CPU topology in the future, we'll
just have to update that description and Solaris will take care of the rest.
As an example of what things looks like when simple while(1) loops run on otherwise idle SunFire T2000
system, here are two snapshots from David Powell's wonderful xlp
utility (download sparc and/or
x86 version) showing 8 and 16 threads getting evenly distributed across all 8 cores.
Monitoring CPU utilization with mpstat is so last century :-)
We're now working on additional scheduling optimizations which should yield even better performance
on CMT hardware. This is a new and very exciting area for us to be in and we're spending quite a bit
of time with our colleagues from SunLabs trying to figure out how
to squeeze the last bit of performance on these systems. The tricky part of doing scheduling optimizations
is that you don't want to "overdo" it as the benefits might get washed out by the extra complexity
introduced.
[ Technorati: NiagaraCMT, Solaris ]