Thursday Jun 28, 2007

Switching Subjects

Until very recently, Sun's been hard to find in the world of high performance computing (HPC). Back in the 2001 timeframe, we lost the recipe for the P in HPC - customers that wanted performance were no longer looking at small numbers of big systems (a traditional Sun specialty), they were looking to large numbers, clusters, of small systems. And in 2001, that wasn't our focus.

But over the last five years, we've been investing to change that. Our Galaxy and Niagara product lines are just about the fastest growing products at Sun. OpenSolaris is beginning to catch a wave of adoption on small systems, and we've been doubling down on compiler optimization and language innovation. All with a focus on extreme efficiency/performance. If there were ever a time to reenter the market, it's now.

Rather than mimic the competition, we started out examining the issues and challenges facing the largest HPC installations we could find. Performance was certainly the main priority. But there were others - and not what you'd expect if your idea of a cluster was three PC's in a closet.

At three or four hundred computers, the challenges of building a cluster shift around quite a bit. Dissipating heat, sourcing enough power, managing software versioning or hardware failures, just to name a few. Get to three or four thousand nodes, and all of a sudden everything from weight (floor loading), to the bend radius of optical cabling, to massive software provisioning challenges, even the speed with which data can be moved around a room become critical factors. And that's where we decided to focus our efforts, at the extreme - on the assumption it would one day become the norm (as so often is the case in this industry).

I've read quite a lot of feedback from pundits and analysts over the past few days, and wanted to be sure to respond to one item - from those who believe the high end supercomputing marketplace is small, esoteric, and has very slim profit margins.

The high end of the supercomputing marketplace is small, esoteric, and has very small profit margins - they're absolutely right.

And like the world of free software (in which no one's going to get rich selling to the open source community), no one's going to build a profitable business selling to the academics and researchers who dominate the extremes of HPC.

That's not the point.

The academic supercomputing community (there's that word again) sets the pace for enterprise computing across the world – which has grabbed on to HPC for an array of real world challenges, from virus, disease, and drug discovery, to customer purchase pattern analytics, capital markets trading, energy discovery, dynamic resource management - you name it, it's one of the fastest growing segments in the marketplace. Proving that what starts in academia, ends up on main street. Industry looks to academia and research institutions to understand the innovations that enable breakthrough scale and performance (just ask Linus - who, come to think of it, still hasn't responded to my dinner invite... I hope it's not my cooking.)

What We Announced

In Dresden, Germany, earlier in the week, we announced the Constellation System - a set of generally available building blocks any customer, educational or commercial, can use to build from a few teraflops system, to more than a 2 petaflops system. As a part of this broad announcement, we unveiled a few component elements - notably...

Our commitment to the rise of OpenSolaris in the HPC community – joining Linux as a reliable, resilient platform for petaflops scale systems (those capable of executing a thousand trillion instructions per second). What's driving preference for OpenSolaris? Legendary support for huge memory configurations, integrated virtualization, DTrace and the ZFS file system are probably the biggest drivers – but support for ROCKS, a price tag that says FREE/open source, and the fact it'll run on any server built are a big help, too. Success in HPC is a very high priority for the Solaris team, and an area of investment for us and our partners. (And no, this doesn't lessen our focus on Linux - if we can combine licenses, it'll amplify it.)

Second, we unveiled an integrated 48 blade rack that supports all volume microprocessors, AMD, Niagara and Intel – in the same rack, with standardized I/O. Picture on the left. We also announced a new blade, Pegasus, designed purely for HPC grids. No seatbelts, no redundant anything, just raw compute performance.

Third, and most importantly, we unveiled Project Magnum (at right), an absolutely massive (3,456 port - click here to find out the significance of that number) infiniband (IB) switch – designed to alleviate a ton (three tons, actually) of the cabling, weight, expense and latency nightmares saddling most supercomputing facilities. This one innovation, courtesy of the extraordinary Systems team led by chief architect Andy Bechtolsheim, allows those with serious computing needs to dispense with a massive amount of complexity and expense. The largest competitive IB switch in the market today is 288 ports - so you'd need a lot of them (with an equivalent proliferation of support nodes, cabling and complexity) to match Magnum. In an industry where size matters, we're feeling plucky. (We expect the economics behind Magnum to prove out around 420 nodes – so even if you're building a little grid, Magnum pays for itself.)

Our view is we can reduce by a factor of two or three, at least, the cost and complexity of building a supercomputer – in an academic or commercial environment. Bringing general purpose systems, and volume economics, back to a market that was starting to turn proprietary. What the Constellation System allows for is a transition from this first picture...

To this, a vastly simpler, lighter, easier to manage/maintain Petaflops scale HPC installation.

.

Three tons lighter, three times less expensive to build, a fraction of the cabling and vastly simpler to manage. And at up to two petaflops, I'm quite convinced we could spank Bobby Fischer...

For those interested in the details behind our win at the Texas Advanced Computing Center (TACC). Here's what they're running:

TFLOPs: approximately500 TERAFLOPs
Magnums: 2 (>2000 4x IB ports each, expandable to 6,912 ports)
Thumpers: 72 (1.728 PB)
Metadata storage: STK6450 RAID (9.3 TB)
Tape storage: STK SL8500
Storage/Data Management: SAM/QFS
Racks: 82
IB NEMs: 328
Pegasus blades: 3936
Aggregate memory size: 123 TB
Number of cores: 62,976

Total racks: 94
Approx footprint: 2,037 sq ft
Approx power: 2.4 MWatts
IB cable length: ~14 kilometers

To put that in perspective, their computing facility will be about half the size of an NBA basketball court. Not exactly small - and in fact, likely the largest on earth.

And for those curious as to why we settled on 3,456 ports...

____________________________

Begin forwarded message:
From: Andreas Bechtolsheim
Date: June 28, 2007 6:58:59 AM PDT
To: Jonathan Schwartz
Cc: John Fowler
Subject: 3,456

We implement a 5-stage fabric, and with a 24-port switching element
the maximum number of ports is n*n/2*n/2, or 24*12*12 =3456.

Other Infiniband switches in the market today are 3-stage fabrics
and they have n*n/2 or 24*12 = 288 ports.

Now you can build a 5-stage 3456 port switch with 12 288-port switches
and 288 24-port leaf switches but you end up with 300 boxes occupying
about 456U of rack space or 12 racks, and 6912 cables.
We use one double rack with 1152 cables, so it is 1/6th the space,
1/6th the cables and 1/6th the weight.

On Jun 28, 2007, at 6:36 AM, Jonathan Schwartz wrote:

so - why 3,456 ports?

----------------------------

and last, but certainly not least - if you'd like to try a supercomputer on an hourly basis, just point your browser to network.com... we've made a ton of progress in the past 6 months...

Share this post  del.icio.us | digg.com | slashdot.org | technorati.com | reddit | facebook | stumbleupon

Friday Jun 15, 2007

Better Honest Than Polite

I gave a short speech a couple nights ago at a gathering organized by Pat Mitchell and the newly named Paley Center for Media. I was joined by some august guests, including California State Governor Arnold Schwarzenegger (who wore, and I'm not joking, green alligator skin boots); along with Eric Schmidt (CEO, Google - and a little known fact, my very first boss at Sun), and Terry Semel (CEO, Yahoo!).

After dinner, I found myself talking to a group of media company CEO's. I asked a simple question, "do you have a general counsel reporting to you?" The answer was universally, yes.

I do, too. Mike and his team are central to the evolution of Sun (as I've said, we are nothing less, or more, than an intellectual property company - it's hard navigating those waters without a great legal team).

But then I asked a harder question: "Do you have a chief technology officer reporting to you?"

I do, and I talk to Greg at least every day. He plays a central role at Sun. Central as in nervous system. He's involved in every major strategic decision I make (and a ton of minor ones, too).

But in repsonse to my question, the answers from the group were more dismissive than substantive - most did not. And in my view, if you have a general counsel reporting to you, and not a CTO, you're saying legal advice is more important to you than technology counsel. Which seems backward for a media company. Why?

Because convergence isn't a legal phenomenon. It's a technical and social phenomena first and foremost - that's why you can't talk about media without talking about software (what is an MP3? AAC? Java? Flash?). You can't talk about distribution without talking about free media, social networking or mobile devices (technical assets that reach more of the planet than all other network outlets). Ask Eric or Terry (or Steve or Mark if they have CTO's reporting to them. Of course they do, they're media companies using technology to win. Or vice versa. It doesn't matter, they've converged.

Which brings me to a simple, and heretical conclusion - for which I'm sure I'll be apologizing for years to come. But I'd rather be honest than polite.

Media company CEO's without a CTO on their staff should prepare to be acquired or broken up - they are fighting the future rather than monetizing it.

Share this post  del.icio.us | digg.com | slashdot.org | technorati.com | reddit | facebook | stumbleupon

Apologies

With apologies for the up and down nature of a post I'll be making later on today - we're rolling out a new version of our blogging infrastructure, which is today demonstrably proving it's "always in beta"...

Share this post  del.icio.us | digg.com | slashdot.org | technorati.com | reddit | facebook | stumbleupon

Wednesday Jun 13, 2007

An OpenSolaris/Linux Mashup

To non-technical readers of this blog, or those uninterested in the ebbs and tides of the free software world... this might be a good entry to skip.

I was just forwarded a pointer to this note regarding Sun and OpenSolaris, written by the eponymous Linus Torvalds. And I wanted to respond directly.

__________________________

Linus,

First, I'm glad you give credit to Sun for the contributions we've made to the open source world, and Linux specifically - we take the commitment seriously. It's why we freed OpenOffice, elements of Gnome, Mozilla, delivered Java, and a long list of other contributions that show up in almost every distro. Individuals will always define communities, but Sun as a company has done its part to grow the market - for others as much as ourselves.

But I disagree with a few of your points. Did the Linux community hurt Sun? No, not a bit. It was the companies that leveraged their work. I draw a very sharp distinction - even if our competition is conveniently reckless. They like to paint the battle as Sun vs. the community, and it's not. Companies compete, communities simply fracture.

And OpenSolaris has come a very long way since you last looked. It and its community are growing, as a result of more than ZFS (although we seem to be generating a lot of interest there, not all intentional) - OpenSolaris scales on any hardware, has built in virtualization, great web service infrastucture, fault management, diagnosability, and tons more. Feel free to try for yourself (and yes, we're fixing installability, no fair knocking us for that.)

Now despite what you suggest, we love where the FSF's GPL3 is headed. For a variety of mechanical reasons, GPL2 is harder for us with OpenSolaris - but not impossible, or even out of the question. This has nothing to do with being afraid of the community (if it was, we wouldn't be so interested in seeing ZFS everywhere, including Linux, with full patent indemnity). Why does open sourcing take so long? Because we're starting from products that exist, in which a diversity of contributors and licensors/licensees have rights we have to negotiate. Indulge me when I say It's different than starting from scratch. I would love to go faster, and we are all doing everything under our control to accelerate progress. (Remember, we can't even pick GPL3 yet - it doesn't officially exist.) It's also a delicate dance to manage this transition while growing a corporation.

But most of all, from where I sit, we should put the swords down - you're not the enemy for us, we're not the enemy for you. Most of the world doesn't have access to the internet - that's the enemy to slay, the divide that separates us. By joining our communities, we can bring transparency and opportunity to the whole planet. Are we after your drivers? No more than you're after ZFS or Crossbow or dtrace - it's not predation, it's prudence. Let's stop wasting time recreating wheels we both need to roll forward.

I wanted you to hear this from me directly. We want to work together, we want to join hands and communities - we have no intention of holding anything back, or pulling patent nonsense. And to prove the sincerity of the offer, I invite you to my house for dinner. I'll cook, you bring the wine. A mashup in the truest sense.

Best,
Jonathan

President, Chief Executive Officer,
Sun Microsystems, Inc.

Share this post  del.icio.us | digg.com | slashdot.org | technorati.com | reddit | facebook | stumbleupon

Sunday Jun 10, 2007

Blackbox on a Shake Table

We're continuing to see a lot of interest in Project Blackbox - a complete datacenter we're introducing to allow customers to leave behind traditional raised floor facilities for vastly less expensive, more power efficient and faster to deploy alternatives.

With network equipment increasingly managed via technology, not people, and service operators assuming individual components and machines will fail but a web service never can ('reliable services built from unreliable parts'), our view is Project Blackbox, and fail-in-place software infrastructure like the free ZFS file system, represent real options for CIO's and CTO's out of space, power, money or patience.

We put a blackbox on one of the world's largest shake tables at University of California, San Diego, to get a sense for how it'd handle a severe earthquake. Rather than "shake and bake" a computer, we figured we'd test out the equivalent for a complete datacenter (and throw some Sun SPOT sensors into the mix to harvest data):

For those interested in our blades launch event last week, at which John Fowler and Andy Bechtolsheim unveiled our first Intel product, a blade designed for our AMD/Intel/SPARC Sun Blade 6000, please see here. There's more data, here, as well (and yes, they'll run Windows, Linux and Solaris - all under the same management software, and they'll even fit in a Blackbox...).

And if you carefully watch Chapter 3 of the launch video, you'll see John and Andy provide a sneak peek for a project internally code-named "C48" - it's behind the big black drape...

Share this post  del.icio.us | digg.com | slashdot.org | technorati.com | reddit | facebook | stumbleupon