BM Seer Facts & Questions from an Anonymous Sun Source

World Record Single-Chip UltraSPARC T2 SPECint_rate2006 Performance with gccfss

Thursday Feb 14, 2008

World's fastest chip. The Sun SPARC Enterprise T5120 server, running at 1.4 GHz, delivered a world record single chip result of 83.9 SPECint_rate2006. Please remember it is about system performance and chips not about things inside a chip (like perf/transistor, perf/NAND-gate, perf/metal-layers, perf/thread, perf/bore, oopps perf/core, perf/silicon grain).

The Sun SPARC Enterprise T5120 using the GCC for SPARC Systems (gccfss) compiler topped all competitor's single-chip results including beating the IBM p570 single-chip 4.7GHZ POWER6 result by 38%. IBM used its proprietary compiler, XL C/C++.

The Sun SPARC Enterprise T5120 using the GCC for SPARC Systems (gccfss) compiler beat the performance of the HP DL360 G5 with a single chip quad-core 3.16GHz Xeon X5460 by 15%.

The gccfss compiler allows one to use the optimal Sun SPARC optimization tools along with the popular gcc coding conventions and deliver performance that has not been possible before without time consuming code changes.

For more information on gccfss and how to get it, go to http://cooltools.sunsource.net/gcc/.

Sun also submitted results on the SPECfp_rate2006 benchmark suite using just a single disk. The Sun SPARC Enterprise T5120 server, running at 1.4 GHz, delivered a result of 62.1 SPECint_rate2006.

This result was run on a single disk. The previously reported result used the electrical equivalence rule of SPEC, but the configuration used more disks than fit in a T5120. This result shows that the performance is comparable, regardless of the disk configuration.

SPEC CPU2006 Performance Charts - bigger is better, selected recent results, see www.spec.org for complete results

SPECint_rate2006

System Processors Performance Results
Type GHz Chips Cores Threads Peak Base
T5120 (gccfss 4.2) UltraSPARC T2 1.4 1 8 64 83.9 76.2
T5220 (gccfss 4.2) UltraSPARC T2 1.4 1 8 64 83.2 75.6
T5120/T5220 UltraSPARC T2 1.4 1 8 64 78.5 73.0
T5220 (gccfss) UltraSPARC T2 1.4 1 8 64 78.0 71.6
Asus P5E3 Intel QX9650 3.0 1 4 4 76.7 69.0
HP DL360 G5 Intel X5460 3.16 1 4 4 73.0 62.1
Asus P5E3 Intel QX6850 3.0 1 4 4 69.1 64.9
Dell T3400 Intel QX9650 3.0 1 4 4 68.8 61.4
IBM p 570 Power6 4.7 1 2 4 60.9 53.2
Fujitsu RX100 Intel X3210 2.13 1 4 4 54.4 48.0

SPECfp_rate2006

System Processors Performance Results
Type GHz Chips Cores Threads Peak Base
T6320 UltraSPARC T2 1.4 1 8 64 62.3 58.1
T5120/T5220 UltraSPARC T2 1.4 1 8 64 62.3 57.9
T5120 (one disk) UltraSPARC T2 1.4 1 8 64 62.1 57.9
IBM p 570 Power6 4.7 1 2 4 58.0 51.5
Intel Asus P5E3 Intel QX9650 3.0 1 4 4 52.0 49.9
Dell T3400 Intel QX9650 3.0 1 4 4 47.2 44.9
HP DL360 G5 Intel X5460 3.16 1 4 4 44.5 41.3

Results as of 12 Feb 2008 from www.spec.org.

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark. It measures:

  • "Rate" - system performance of CPUs, memory, compiler
  • "Speed" - single thread performance of chip, memory, compiler; not intended to stress multi-core designs
  • The strategic metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding

    SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

  • There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

    Disclosure Statement:

    SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 2/12/08. Sun SPARC Enterprise T5120 gccfss (UltraSPARC T2, 1 chip, 8 cores), 83.9 SPECint_rate2006. IBM p570 (POWER6, 1 chip, 2 cores), 60.9 SPECint_rate2006. HP DL360 G5 (Xeon X5460, 1 chip, 4 cores), 73.0 SPECint_rate2006. Sun SPARC Enterprise T5120 (UltraSPARC T2, 1 chip, 8 cores), 62.1 SPECfp_rate2006.

    Results Summary

    Results
    Reference Date: Feb 12, 2008
    System: Sun SPARC Enterprise T5120
    Processor: Sun UltraSPARC T2, 1.4 GHz
      83.9 SPECint_rate2006
      62.1 SPECfp_rate2006
    Software: Solaris 10, Sun Studio 12 Compiler gccfss

    [17] Comments
    Like this post? del.icio.us | furl | slashdot | technorati | digg
    Comments:

    My understanding of SPECint_rate2006 is that it is an extremely parallel benchmark, more cores == more performance; it doesn't really test much except core & memory performance. So how is this a fair comparison with a p570 which had two cores?

    Also what is the SPECint_rate2006 for a 2-core M8000 system? How does this compare with that of a p570 (Power6)?

    Definition of CPU today is highly a "marketing" definition so using that to compare is not really relevant.

    Posted by venki on February 14, 2008 at 10:29 PM PST #

    venki: yes, SPECint_rate2006 is very parallel. That's the point of the benchmark and what bmseer demonstrates is that even if the T2 runs "only at 1.4 GHz", because it has so many cores, it is able to beat other chip by a large margin.

    "it doesn't really test much" -> wrong, the purpose of SPECint_rate2006 is to evaluate the speed of real-world applications: it runs bzip2, gcc, a H.264 compression tool, etc, see: http://www.spec.org/cpu2006/Docs/readme1st.html#Q11

    Posted by zzp on February 15, 2008 at 12:04 AM PST #

    Venki is spot on.

    why do you continue to try and compare T2 and Power 6 ? T2 is a niche chip for niche workloads.....ie light massively parallel threads.

    63 running instances of the benchmark. Realistic ? i think not.

    Posted by AlexR on February 15, 2008 at 01:59 AM PST #

    and where are the SPECint2006, SPECfp2006 results ?

    it seems Sun have chosen not to publish those results for T2. I wonder why ?

    Posted by AlexR on February 15, 2008 at 02:08 AM PST #

    AlexR and venki - you guys are stating the obvious. Most sales engineers I know here at Sun are not blindly advocating the Niagara based systems for tasks they are clearly not best suited for. If all you have is single threaded workloads that can't be run in parallel and need the best response times then the T2 most likely isn't for you. SpecInt and SpecFP numbers would just restate this fact.

    The T2 is a great chip - and great in many cases is an understatement - for high concurrency and high throughput workloads if you can tolerate slightly higher response times at the lower levels of the concurrency curve. I would argue that covers a much wider swath of the application space then your niche comment suggests. And you get this a using less space and power - when you can throw at lot of work at a Niagara based system their are few servers that can touch it at their price levels.

    I can comfortably state all this having done testing with my network equipment provider customer with their applications that sit in the call path at Tier 1 service provides today.

    Posted by Wayne Abbott on February 15, 2008 at 07:51 AM PST #

    For light, highly parallel predictable threaded applications, its a good chip.

    For single threaded apps, and mixed, random or unpredictable applications (ie...the majority), its not so great.

    Factor in T2 has poor scalability, only runs Solaris 10, poor virtualisation etc etc.

    Id consider using it as a web server, but not much else !

    Posted by AlexR on February 15, 2008 at 09:27 AM PST #

    You guys are missing the point. If you run a datacenter take a glance at it. Do you only have 1 server with only have 1 job running? If you have a datacenter hundreds or "hundreds of thousands" of thread jobs running all at the same time.

    SPECint is like testing one cylinder of your 8-cylinder car, while interesting if you are an cylinder designer it does not say what your whole engine does. SPECint_rate tests your whole engine.

    On comparing UltraSPARC T2 to Power6. The price of a 4.7GHz IBM p570 2-core 4-thread 64GB system is about $300,000, or nearly SEVEN TIMES more expensive than an 8-core 64-thread Sun UltraSPARC T2 based T5220. Look above or at the following link and you'll see the UltraSPARC T2 is way ahead of the ultra expensive system in performance for Database, Application, and Web.
    http://www.sun.com/servers/coolthreads/t5120/benchmarks.jsp
    http://www.sun.com/servers/coolthreads/t5220/benchmarks.jsp
    Yes better whole application delivered performance and $/performance.
    So much so that IBM can only legally claim per/core performance statements.

    If you seriously understand cores in modern design you'll realize that they are completely different. IBM has the world record for the most expensive cores in the industry. Seriously try to find the cost of a 16-core p570, you'd have to sell your $Million$ dollar home to afford a 16-core server from IBM. I couldn't afford one I don't have a million dollar home :) Core to core performance comparisons are the only thing left IBM can try to confuse one one. Because quite simply they lose on system performance, system price, System price performance, virtualization, managability, and chip performance, chip prices, ...

    SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 2/12/08. Sun SPARC Enterprise T5120 gccfss (UltraSPARC T2, 1 chip, 8 cores), 83.9 SPECint_rate2006. IBM p570 (POWER6, 1 chip, 2 cores), 60.9 SPECint_rate2006. HP DL360 G5 (Xeon X5460, 1 chip, 4 cores), 73.0 SPECint_rate2006. Sun SPARC Enterprise T5120 (UltraSPARC T2, 1 chip, 8 cores), 62.1 SPECfp_rate2006.

    Posted by BM Seer on February 15, 2008 at 11:00 AM PST #

    You keep coming up with the same ludicrous argument, comparing cost of an enterprise class p570 with just 2 cores, against an entry class T2. These machines are NOT in the same class.

    Who would buy a P6-570 with 2 cores ? If thats the kind of power you need, then your looking at P6-520, Blade, or Power 5.

    Posted by AlexR on February 15, 2008 at 12:58 PM PST #

    Mr Seer, you are the epitome of delusion ! :)

    I shouldnt bite, but go on then, lets hear it, explain how Sun has better virtualization than IBM !!

    This should be amusing ! :)

    Posted by AlexR on February 15, 2008 at 01:03 PM PST #

    Sorry I meant to say 4-core 8thread p570 that is the one that costs
    $300,000 oh and that is the one beaten by the Sun T5120/T5220 that can handle all of the same workloads and higher performance than IBM, with all of the same redudancy and availability. So why are they the same class.

    Class is an ambigious term that allows you to avoid competition because IBM is loosing?

    Posted by BM Seer on February 15, 2008 at 03:09 PM PST #

    None of the Sun entries in the FP table mention gccfss - did those use gccfss, or did those use Studio? Compilers compilers everywhere (gccfss, gcc propper, and Studio) what's a developer to choose?-)

    While the rate metrics might be more "strategic" to Sun for selling T2-based systems, I doubt they are any more "strategic" to SPEC et al than the speed metrics.

    Interesting that the proprietary nature of the IBM compiler is mentioned, but not that of the Sun Code Generator for SPARC Systems 4.2.0 on which the results depend. Just like Solaris and Studio, that bit of code is covered by a license which is rather different from the GNU cloak it seems Sun wants to have us think is wrapped around gccfss. In particular it has that ever popular clause 5f - "(f) You may not publish or provide the results of any benchmark or comparison tests run on Software to any third party without the prior written consent of Sun." It would be interesting to know if the IBM compilers are similarly license constrained.

    The "rate" benchmarks are indeed rather parallel. They have no data sharing between instances at all. The only "contention" will be to see how much fits in the shared caches and TLBs and the like. If Sun were to publish the SPECcpu2006 "speed" (single instance) figures we'd have some idea of how well all those HW cores and threads scale on those workloads.

    Posted by rick jones on February 16, 2008 at 08:15 AM PST #

    Ok, lets look at your comments.

    1. T2 can beat P6 on all workloads ? Ok, how about single threaded performance ? I dont think so !! P6 1 - T2 0

    2. T2 outperforms P6 ? Oh really ? Your benchmark results give the impression that T2 can keep up with P6, but the reality is that your benchmarks are fudged. Running 64 copies of the benchmark, and adding up the results does not give you the full picture. And those 64 threads, only 16 of them are running at once. P6 2 - T2 0

    3. T2 same redundancy & availability ? Oh please. You have a system with a single processor, a single processor = single point of failure. The p570 can have dark cpu/memory resources which can be activated if/when cpu/memory fails. No contest here. P6 3 - T2 0

    and in addition

    4. Virtualization. Lets face it, this again is a no contest. You only have containers, which are limited in their use. Dont talk about ldoms, thats not virtualization, thats thread partitioning. Sun likes to talk about 64 servers on a chip. Ok, and how much memory can you allocate to each of these 64 servers ? and how would those servers perform, having to share execution time with 4 other servers ? Dependency on Solaris 10, and limited hardware. P6 4 - T2 0

    5. Scalability - Still only 1 processor and 64gb of memory
    maximum. Even the P6-p520 scales way beyond that, let alone the p570 which you so like to compare T2 with.

    Thats why T2 systems are not in the same class as a P6-p570. I would question wether they are even in the same class as the P6 p520 and p550 too.

    Posted by AlexR on February 18, 2008 at 01:41 AM PST #

    I still don't see an answer for this part of my post...

    Also what is the SPECint_rate2006 for a 2-core M8000 system? How does this compare with that of a p570 (Power6)?

    Has Sun stopped selling or planning to stop selling these systems?
    Why are all the comparisons only with p6 570 and T5210/T5210 systems?

    Posted by venki on February 18, 2008 at 12:24 PM PST #

    Hi AlexR,

    I think your arguments need refinement. I don't believe BMSeer has said the T2 beats other CPUs on all workloads - I certainly did not say that if you recall my comments. My argument is that the type of workloads the Niagara chips excel at cover a much wider space than you claim - I base that on my coverage of 4 business units and the CIO of a major teclo equipment provider for over 10 years (and before that I was a developer for 10 years prior to coming to Sun). Let's agree not to rehash the single-thread argument - we agree that most other CPUs will beat the T2 on single-thread workloads that can't be run in parallel. We disagree how much of the application space that type workload is covered in datacenters, etc. I would argue that many datacenter and telco oriented type of applications are a perfect fit for the type of workloads the T2 excels at.

    As far as availability is concerned most MTBF numbers get worse as the number of components go up. Although the current Niagara based systems have only 1 CPU it is highly unlikely that a CPU will fail before other components in a chassis. I have to verify this but I'm fairly sure the CPU can operate if a core fails - it will be marked as bad and the system will come up on the remaining cores. Beyond that the systems don't have many other single points of failure as far as I'm aware (they have dual power, hot swap disks, I can have redundant paths of I/O of different PCI buses, etc.).

    Your virtualization argument about LDOMs makes no sense to me. If I can run multiple OS images on a single system and they all have a different view of the chassis resources that's classic virtualization. I'm not sure why you are caught up on methods vs the end result - CPU and I/O is virtualized and LDOMs at logical level looks like many other classic virtualization approaches. And yes you can allocate memory on a per OS instance - what you can't do yet is dynamically reallocate memory without rebooting the affected LDOMs you're taking and giving RAM. With regard to your question as to how the OS instance would perform having to share execution time is no different than how someone deploying VMware (which of course costs you some cash vs LDOMs which has no additional costs) has to look at the situation - it depends on what's running in the LDOMs and you mileage will vary. There is no magic bullet and experienced sys admins do this type of deployment assessment everyday. The fact that we can also offer a more lightweight means to partition resources using containers is only a plus - no one hammer is right for every task at hand. I would imagine this is the same reason IBM recently introduced technology similar to containers in AIX. You knock on the Solaris dependency is incorrect - you can run a supported version of Ubuntu on the Niagara based systems (currently 4 rackmount servers and 2 blades with more coming soon) which also can take advantage of LDOMs. The fact that you need the recent version of any OS to get improved functionality is just a fact of life. Fortunately Solaris 10 has been out for years and thousands of applications are qualified on it for those who can't get by on the binary compatibility guarantee alone.

    In general I would argue that the T2 systems easily can play in the same space as P6 systems depending on the workload and availability scenario especially when you include cost and space in the equation. In addition you'll see systems very soon that will scale beyond 1 CPU/64 GB. There are of course many cases where our M series systems are a better suited to match up against the P6 servers.

    venki: Sun has not stopped nor will it stop anytime soon the selling of the M class systems - believe me they will be sold for years to come. Even when the Rock based systems come out there are just some classes of applications and availability scenarios that the M series server will always be the better choice.

    Posted by Wayne Abbott on February 19, 2008 at 07:08 AM PST #

    Wayne, the way i read the Seers comment "T5120/5220 can handle all of the same workloads and higher performance than IBM" got my goat, because its clearly untrue, and hence my response regarding single thread performance.

    I did note that you had earlier agreed on this point, hence my rant was directed at BM Seer and not you.

    Regarding my opinion on virtualization, and ldoms, maybe this is terminology, but my understanding is that static threads are allocated to ldoms. If those threads are unused, then resource is wasted. This seems more like static partitioning rather than virtualization to me. For example, on IBM, unused cpu cycles can be ceded back into a shared pool(s), and used elsewhere. The virtualization of the resource is much more dynamic.

    Suns inability to dynamically move memory between ldoms is again another example. This has been possible since AIX 5.2 (on P5), which is now nearly end of life. It seems ldoms have much more in common with partitioning from Power 4 and AIX 5.1 days, which had similar restrictions.

    The sharing of execution time relates to the fact that Sun marketing hypes 64 concurrent threads and 64 servers on a chip. In reality, only 16 of the 64 threads can execute concurrently, the other threads execute on a round robin basis.

    Solaris containers is a useful technology, and naturally its because of this feature that IBM felt it had to react to develop workload partitions. Both seem to similar in concept, but both appear to have limited usability. However, im not averse to either, as they compliment the both virtualization offerings, giving more flexibility and therefore cant be a bad thing.

    So to summarise, wpars and containers seem about equal, but ldoms seem to be a long way behind lpars, especially now the PowerVM and mobility features are out. Yet BM Seer states that opposite ! Go figure !

    Ok, wasnt aware of Ubuntu, so i accept that point. However, i still see Sun as being way behind IBM in this department. IBM can support multiple versions of AIX and linux across Power 5 and Power 6. A vastly wider variety of kit.

    I do accept that the T2 is an interesting processor, and is especially good at certain types of workloads (web, java app), but i just really cant see how it can be justifiably compared against enterprise class P6 systems like p570, which should really be compared against the M class.

    Posted by AlexR on February 19, 2008 at 09:31 AM PST #

    Hi AlexR,

    You are correct that with LDOMs HW threads are allocated for a specific domain. They can however be moved between domains without rebooting the affected domains if the OS supports this (Solaris obviously can - not sure about Ubuntu Linux). As mentioned dynamic memory allocation is on the roadmap but I'd have to check to see where that is these days. Overall however I'd say this is pretty nice functionality that doesn't come at any extra cost (I'm not sure what costs if any are associated with using LPARs). I would argue that if one needs to to do that level of flexibility where they can't tolerate the downtime to reboot a domain then they really should be looking at a different class of systems which of course we have (and have had for over a decade). I think we're in agreement that the T2 servers can play in the space of higher end systems but they are not a 1 for 1 match depending on the circumstances.

    If you want a better and more efficient sharing of resources so that idle HW threads are not wasted then containers is the way to go (assuming your apps can all run on S10). Also with a recent S10 update you can employ much better resource controls and unique IP stacks to a container for even better management of your overall system. What LDOMs gives you is better isolation and ability to run different OSes (Solaris, Linux, rev levels, etc.). As I said having multiple tools to solve your problems is a good thing. Also now with the recent Solaris 8 Migration Assistant you can effectively have Solaris 8 based containers on any Solaris 10 sparc based systems which gives people more flexibility in how they can take advantage of new hardware and S10 functionality. That said I don't agree with your view that containers have limited usability - for security isolation, speed of provisioning, etc. they are very handy to have at your disposal.

    As far as the marketing hype goes I hear what you are saying. However the CPU excels at dealing with the memory stalls to keep those execution pipelines busy - unlike other CPUs the Niagara based systems thrive when they have a ton of work to do. Real world tests I've done with my customer or other 3rd parties have documented in my opinion bare this out. Check out http://www.stdlib.net/~colmmacc/2006/03/23/niagara-vs-ftpheanetie-showdown/ and http://www.sun.com/servers/coolthreads/testimonials/ for some light reading ;-)

    I also don't agree with your statement on AIX flexibility story over Solaris. Solaris runs across our gear, Fujitsu, and on many systems in the x86 space which is a far bigger pool than what AIX can cover. Not only that the binary compatibility story has been superior to IBM from what I've seen over the years. I will concede that to take advantage of our latest HW you do need to run S10 but you can run S10 on systems going back almost 15 years at this point (I just did a S10 install on my old Ultra 2 server with dual 300 MHz CPUs last week in my lab). I can run Solaris at nearly all price points along the curve from my laptop all the way to the highest end datacenter class systems.

    Posted by Wayne Abbott on February 19, 2008 at 02:29 PM PST #

    My point re containers (and wpars) having limited use, is by virtue of the fact that its virtualization of a single OS image. The dependency on the host OS, for wpars/containers to have the same patch/kernel level etc is a major limitation.

    For example, would you run a production, pre-prod and test environment within a single OS ? I wouldnt.

    They do have some uses though, and they do compliment logical partitioning such as lpars and ldoms.

    LPARs and LDOMS are essentially the same thing, but LPARs are much more flexible, efficient and do proper virtualization (eg shared resources dynamically, not statically). You can also do the same level ofvirtualization on all Power 5 and Power 6, ranging from 1U p505 to 64 way p590. It works exactly the same way.

    Anyway, we clearly have different opinions, so may as well agree to disagree.

    Posted by AlexR on February 20, 2008 at 02:29 AM PST #

    Post a Comment:
    Comments are closed for this entry.