UltraSPARC T1 and T2 performance

Tuesday Dec 06, 2005

UltraSPARC T1 - low power and the SWAP metric

Low power was a key design point of the UltraSPARC T1 "Niagara" processor from the
ground up. Unlike other processors we took a simpler approach to the pipeline.
Avoiding deep and complex Out of Order pipelines reduced power consumption to less
than 5 watts per core. We have also made a smaller more efficient L2 cache which
is great for throughput and consumes a lot less power. The total power consumption
for the chip is typically less than 70 watts, about half of comparible x86 processors.

Soon after first silicon we started to measure the power drawn from
the processor and DDR memory. Using our Fireball bringup system we measured current
drawn by modifying the motherboard to attach test points to the memory and cpu voltage
regulators. The result was a very ugly setup involving an oscilliscope.

We ran an Oracle, Java and SAP workload on the box while taking the measurements.
None of these really drove the memory bandwidth, however, which has a theoretical max
of 25GB/s. WSe switched to to a memory exerciser that could push about 14GB/s to the
memory subsystem.

With the arrival of p0.0 Ontarios power measurement began in earnest. We used a number
of different memory exercisers. We took measurements with 8GB, 16GB and 32GB of
memory, with and without the PCI-X and PCI-E slots occupied. We also got a breakdown of
the power costs of the major components of the systems


  • UltraSPARC T1 processor
  • The DDR2 memory
  • The 4 internal drives
  • The fans
  • The power supply efficiency
  • The PCI-E and PCi-X slots

All these components make up the total system power.

These tests proved that even under the heaviest load the T2000 did not require the
initial 550 Watt power supplies we had specified. The most we measured was 340 Watts
fully loaded. A system running at 340 Watts is not in the efficient range of a
550 Watt supply. We decided to change the supply for a 450 Watt version which
is far more efficient the 340 Watt range. This represented a significant power saving.
This transition is currently progressing.

These initial T2000 power measurements were done with a Voltech AC Power Analyzer and
a Tektronix 745 Oscillascope after modifying the power cord. This mechanism
was not practical for most benchmark environments and did not scale as the number
of systems increased.

The solution we found was a simple Power Analyzer and Watt
meter called "Watts Up" which is available from Frys for about $70.
The meter is simple to use, it provides a socket for the T2000 power cable and
then the meter is plugged into the wall socket. You need two for the dual power
supplies of the T2000.

We bought a bunch of these meters and put two on each of the benchmark configurations
we were testing. We also used these meters on comparable Xeon systems. When running
a workload on a T2000 we often run it in parallel on a Xeon system and take
performance and power measurements for both.

All our measurements on T2000 showed huge reductions in power consumption and
heat generated relative to other servers. At the same time the performance on
the T2000 was better than these systems.

We were also presenting our technology to many customers at the time and were hearing
again and again how power an cooling were becoming a critical factor in server deployment.
Datacenters are at the limits of power be in and how hot air that
can be extracted. Racks can only be deployed half full or require a large amount
of space around them.

Customers were extremely interested in our low power technology. But we needed a
metric to combine performance with power consumption and the space taken up by
racks of systems in the datacenter

To achieve this Rick Hetherington, Distinguished Engineer and Chief Architect for the
UltraSPARC T1 and Mat Keep and his team developed the very elegant "Space, Watts and
Performance (SWaP)" metric calculated using the following formula



    Performance
    -----------
    Space X Power Consumption


  • Performance is any benchmark such as an industry standard or customer specific workload
  • Space is the height of the server in rack units (RU)
  • Power is the Watts consumed by the system either by specs or even better by measuring the draw
    with a meter during the actual workload.

An example:

If two systems deliver the same throughput performance but one is 2RU and
draws only 300 Watts and the other 3RU and draws 800 Watts.

The SWap rating of the first would be 0.83 and the second 0.16.

The first system has a 5X advantage over the second when deployed in the datacenter.

For a full description see:

www.sun.com/servers/coolthreads/swap

We really believe SWaP is good for customers as it shows the true cost of deploying
a system in a datacenter. We hope it becomes an industry standard approach to
power and space management in the datacenter.


[ Technorati: NiagaraCMT, ]

UltraSPARC T1 large page projects in Solaris

A Translation Lookaside Buffer (TLB) is a hardware cache that is used
to translate a processes's virtual address to a physical memory address.
UltraSPARC T1 has a 64 entry Intruction TLB and a 64 entry Data TLB per
core. The unit of translation is page size, UltraSPARC T1 supports
4 page sizes, 8k (the default), 64k, 4m and 256MB. When memory is
accesssed and the mapping is not in the TLB this is termed a TLB miss.
Excessive TLB misses are bad for performance

The 64 entry TLBs are relatively small compared to current SPARC processors
They have the advantage, however, that you can mix and match all page
sizes in the TLB ie the TLB does not have to be programmed to a particular page size.
One entry can be 8k for instance and the next one can be 256MB.

We know that TLB performance was going to be critical for UltraSPARC T1. Early on
we started a number of Solaris projects to provide optimal TLB performance
for the processor.

The first project was MPSS for Vnodes (VMPSS) also known as large
pages for text and libraries. Before this project binary text and library
segments were always placed on 8k pages. For large binaries such as Oracle
or SAP, or applications with a large number of libraries, this results in a
high number of ITLB misses per second.

VMPSS provides in kernel infrastructure and mechanisms so that large pages
can be used with file mappings that are text and initdata segments of binaries
and libs. Text and libraries are placed on the largest page size possible.
For smaller binaries and libraries this is usually 64k pages but for bigger
binaries such as Oracle it is 4MB mappings.

Use pmap -xs to see the pagesize that has been allocated. The first entries in
the output is the binary itself, libraries are usually twards the end of the listing.

14148: /usr/sap/SSO/SYS/exe/run/saposcol
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000100000000 320 320 - - 64K r-x-- saposcol

The performance gains on UltraSparcT1 were significant, up to 10% on some Oracle
workloads.

The second TLB related project was Large Pages for Kernel Memory which provides
large pages for the kernel heap. The kernel is a particularly bad TLB miss
offender. Code generally spends less time in the kernel and so
on entry the TLB is usually cold. Prior to this project the kernel heap
has been mapped on 8k pages. We saw moderate performance gains with this project.

The third project added was Large Page OOB (out-of-the-box) Performance
The Multiple Page Size Support (MPSS) project in Solaris 9 added support
for pagesizes other than 8k. MPSS Environment variables needed
to be set and a library mpss.so.1 preloaded prior to running an application
The aim of the MPOOB project was to bring the benefits of large pages to a
broader range of applications out-of-the-box, without requiring
the need for the MPSS variables.

MPOOB affects the allocation of heap, stack and anon pages.

Again check if large pages we obtained using pmap -xs

0000000104858000 32 32 32 - 8K rwx-- [ heap ]
0000000104860000 3712 3712 3712 - 64K rwx-- [ heap ]
0000000104C00000 8192 8192 8192 - 4M rwx-- [ heap ]

This is a huge win for our customers, freeing them from the need to set environment
variables to tune the TLBs.

If for some reason pages are not allocated correctly by default the MPSS variables
can still be used to override. See manual sentry for mpss.so.1 for details

The fourth large page project was support for 256MB aka Giant pages on
UltraSPARC T1 systems. This project was actually added as part of the
UltraSPARC IV+ project, however the TLB programming is different on Niagara.

For an allocation to be a candidate for 256mb pages it most have the following
characteristics

- At least 256mb in size
- Be aligned on a 256MB address

Giant pages can only be allocated on a 256mb address boundary. If the
allocation is greater than 256mb Solaris will attempt to use 256mb pages
at the next boundary. Solaris will attempt to allocate 8k, 64k and 4mb pages until
the boundary is reached.

One of the biggest performance gains from giant pages is in the Oracle SGA
which is allocated as System V shared memory. If the SGA is large it should
be allocated on giant pages. Again use pmap -xs to confirm

0000000380000000 25427968 25427968 - 25427968 256M rwxsR [ ism shmid=0x3 ]
0000000990000000 16384 16384 - 16384 4M rwxsR [ ism shmid=0x3 ]
0000000991000000 56 56 - 56 8K rwxsR [ ism shmid=0x3 ]

In the previous example the first 25GB of SGA is allocated on 256mb pages, there is a tail
at the end that is first allocated on 4mb pages. The residue is 56k which is allocated
on 8k pages.

The final project added to Solaris was Large Page Availability. The aim of this
project was to increase the number of large pages in the system and improving the
efficiency of creating large pages. This project is largely hidden from the end user.
It is key however to ensuring applications can allocate large pages.

To determine how well the TLBs are doing use the trapstat command. The trapstat -T
option breaks down the data as follows

- Per hardware Strand
- User and kernel
- Pagesize wwithing each mode

On an 8 core 32 thread UltraSPARC T1 system the output is very long. The example
below gives the last strands output plus a total

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
31 u 8k| 989 0.1 5 0.0 | 28050 1.2 3 0.0 | 1.3
31 u 64k| 2510 0.2 0 0.0 | 139354 5.4 4 0.0 | 5.6
31 u 4m| 2768 0.2 0 0.0 | 94936 4.5 0 0.0 | 4.7
31 u 256m| 0 0.0 0 0.0 | 79590 3.6 0 0.0 | 3.6
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
31 k 8k| 1921 0.1 0 0.0 | 35701 1.3 6 0.0 | 1.4
31 k 64k| 0 0.0 0 0.0 | 330 0.0 0 0.0 | 0.0
31 k 4m| 0 0.0 0 0.0 | 71 0.0 0 0.0 | 0.0
31 k 256m| 0 0.0 0 0.0 | 3388 0.2 4 0.0 | 0.2
==========+===============================+===============================+====
ttl | 278212 0.6 68 0.0 | 12334583 16.4 368 0.0 |16.9

Note the difference with traditional Sparc processors - 512k pages have been dropped
and new 256mb entries added.

In this example we see hardly any iTLB misses, this is because of large pages for text
and libraries. There are also 256MB page misses in the kernel indicating large pages
for kernel heap is also in operation.


[ Technorati: NiagaraCMT, ]

UltraSPARC T1 in a Fireball - the ugly duckling

Just after I arrived in the Niagara Arch group we taped out UltraSPARC T1 and
10 weeks later we had first silicon. The guys in the bringup team spent
many late nights and got Solaris booted in a couple of weeks

Now what. We knew what the performance should be in theory but we needed
to prove this as quickly as possible. Silicon verification was ongoing
in Sunnyvale using all the available systems but we managed to beg one
from the team to do some performance evaluation.

The system the we received was without exception one of the ugliest I have
ever seen. It was called a Fireball and was a UltraSPARC T1 board jammed sideways
into a 4U deskside server. It had a an industrial 6 inch fan on it that
was irritatingly loud. There were cables and wires everywhere to aid debug
In the front were slots for 8 old full height SCSI disks. It looked like
no mechanical engineer had been involved in its creation. A picture of this
would later appear in Jonathans blog

Little did we know that we would come to love these systems and that
they would still be involved in performance testing a year later.

The system had limited I/O capabilities so we decided to initially test
a throughput cpu/memory Java benchmark. Initial chips were only
rated for 800MHz but you cannot keep a good performance engineer down.
We worked out a way to hack the reset code to drive the chips to 1.2GHz
by increasing the core voltage. As these were initial silicon samples
we didn't know what to expect. We tested a number of chips until we
found 3 that could run at 1.2GHz.

After that it was mostly software. When working with very early systems
firmware and OS bits are hand built, panics and powercycles are common.
Because Niagara is such an exciting new technology, however, people were
prepared to take a lot of pain to run early workloads. It took about
a week to get the right software stack in place but it was worth the wait.

The initial number at 800MHz was nearly 100k Ops/sec. When we
cranked it up to 1.2GHz we got 129k with minimal tuning. We were astounded
at how the UltraSPARC T1 threads absorbed work.

The Software to access Hardware performance counters was not yet in Solaris
so we scrambled to add this functionality. What they revealed was that the
utilization of the Niagara pipe was nearly 70%. In 12 years of Sparc Performance work
I had never seen a number that high. Not only was the silicon working
beautifully but the pre-silicon simulations had been right.

We rushed to run other CPU/memory benchmarks including an internal XML
test and got similar results.

A few weeks later I gave a presentation where I first showed the standard
CMT/Niagara slide with the 4 threads on a core and the 8 cores absorbing
all the stall. I'm sure most folks in the room were moaning to themselves
"here we go again" . Then I put up a slide with simply stated "An now
it is real" with the Java throughput and XML results. The age of CMT had
truly arrived.


[ Technorati: NiagaraCMT, ]

Calendar

Feeds

Search

Links

Navigation

Referrers