Wednesday Dec 21, 2005

The problem with the world

My father in law says that the problem with the world is more people writing than reading. He has been saying it for decades. Who wrote that? I would riposte then, in my impertinent days before the Internet. We fight this battle one book at a time, or more. My father, for example, has multiple books in progress in different rooms. No, I haven´t told them about Blogs, and how Blogs exacerbate the world´s maladie now that any impertinent like me can write and publish.

The Web started fine, no writer surplus when pages were predominantly read-only static content. Mundane stuff did not deserve to be on the net; I was homepageless for many years. Static content concentration resulted in browsers accessing the same content over time, so Web caches made sense. Caching content at large aggregation points, like corporate Internet access proxies and at service providers, saved bandwidth and shortened response times. If you use something often, keep it close to you. What a concept... Processors keep cache lines in fast on-chip memory, operating systems keep file system caches in system memory, and restaurants keep the most popular dishes pre-cooked ready to heat and serve.

Except for mutual fund fees, which are damn predictable regardless of future performance, past behavior may not be a good predictor for the future, warns the prospectus. Such warnings fit the Web cache case, though. Web usage evolved to include much more dynamic content. Content is now highly customized to our identities, and cannot be days old. Auction sites, brokerage houses, and blogs demand content that is personal and timely; the impact on infrastructure is simple, it drives more bandwidth and end-point capacity so that this content can be assembled and served fresh. The Moore-Shannon match I described in a previous post gives us more endpoint and channel capacity in the servers and the plumbing that make up the net. I mean no disrespect by skipping the sophisticated distributed caching and tiered processing that also make up the net, I am exaggerating to highlight how brute force caching is becoming less useful.

Expectedly, brute force caching is not the best culinary choice either. We sent men to the moon but haven't made a reheated pizza that tastes the same, and in spite of decades of civil aviation pre-brewed airline coffee smell is as cruel a torture as coach class legroom. Stashing pre-cooked dishes in a BIG refrigerator is brute force cuisine. I am here to advocate the Big Oven approach instead.

Incidentally, we faced the same choice when we created our CMT UltraSPARC T1 processor. Allocate more transistors and power resources to caching or save them for the processing resources themselves. Larger caches in processors ARE the brute force approach. In our case, just like most restaurants, we were optimizing for throughput (and particularly throughput per Watt), and there was a better solution, vertical threading. By making each of the eight cores in the UltraSPARC T1 vertically threaded, AND by having a wide memory interface (23 Gbytes/sec bandwidth) the cores can keep retiring instructions in the face of long latency memory accesses, which is exactly the same problem tackled with caches. Long latency memory accesses are like cooking steps in culinary recipes, you must wait for the oven to do its thing, and that takes a while.

Cautious customers and, as of our product launch also some competitors, ask how an eight core CMT can perform with just a 3 Mbyte L2 cache. Isn't our CMT like eight processors in a single socket, shouldn't it have eight times the cache of traditional processors to keep them individually busy. Well, the whole point is that CMT addresses the memory latency problem through vertical threads, and this makes it much less sensitive to cache size because it is less sensitive to cache misses in the first place. Instead of a large refrigerator full of pre-cooked dishes to be heated and garnished by a single overworked cook, we put a large oven and hired eight nimble cooks. The cooks were taught how to handle four orders at a time (just like my father does with books), and whenever one of these orders goes in the oven they switch immediately to one of the other three that is not in the oven. That is why the large 23 Gbytes/sec oven is important, it holds up to four orders for each of the eight cooks at the same time.

We explained this approach to cautious customers through architecture and modelling data, but the most convincing step, in computers as in food, is testing and tasting. Trust the Explanations but test the product. Two weeks back I heard about the “try and buy” program. Evaluate the CMT box and only buy it if you want to keep it. Not many restaurants go that far. Some won't even let you in the kitchen to look inside those BIG refrigerators. I am not sure the “try and buy” program will satisfy competitors. After all competitors react in disparage or embrace modalities. Questioning the cache size is an example of the disparage modality. The embrace modality was used by another competitor, first claiming they already had multi-core vertical threaded network processors, and more recently announcing CMT plans themselves.

Network processors (aka NPUs) are indeed vertically threaded processors. Unlike NPUs the UltraSPARC T1 is a vertically threaded general purpose processor, with all the software development advantages of standard tools and languages, full memory protection, virtual memory, cache coherency across cores (at L1 and of course the shared L2), arbitrarily large program memory, and no collaborative thread yielding constraints. The UltraSPARC T1 is a good foundation for I/O and network facing workloads without the programming quirks of network processors. Competitors arguing they already have CMT technology is akin to comedian Benny Hill's reaction when told about Neutron bombs that destroy people without damaging their buildings. “Oh, we already have them in England, we call them mortgages”. That is how similar they are...

As for embracing CMT as their future direction, that would be just flattering.



[ Technorati: NiagaraCMT, ]

digg del.icio.us
Friday Dec 09, 2005

Names more precious than oil

Higher driving and heating costs, attributed to supply-demand imbalances, are a reminder that energy is increasingly a scarce resource. Or is it? On one hand more oil reserves have been generally discovered than consumed each year, but on the other hand we may be about to exit that phase. So what's a man to do? A man is to forgo trying to predict the future, and buy energy positions that neutralize the cost of living impact of these runaway costs. Granted, serious imbalances would disrupt our way of life in ways that no hedging position can restore, but that is a bigger problem than a humble blogger can solve.

Having hedged the energy (and health care costs while at it), a man can then pour a glass of Cabernet and ponder about other scarce resources. We could worry next about the electromagnetic spectrum. Borrowing lines from real estate agents, they don't make spectrum any more, it does not grow on trees. Spectrum is crowded by radios, TVs, cell phones, garage door openers, wireless hot spots, radars, you name it. I am ever impressed at our ingenuity for getting more out of a limited spectrum over time. From the AMPS cellular system introducing frequency reuse with variable cell sizes [Bell System Technical Journal, 1979], to spread spectrum (CDMA) and the way it packs more information in the same channel band.

I have bored many an audience by repeating that we are essentially dealing with Claude Shannon's upper bound on amount of information per channel by exploiting the computation enabled by Moore's law. Shannon vs. Moore to 12 rounds. We have applied this to wireless links, to data center wiring, and even to processor chips input/output interfaces. Through Moore's law, God gives us the transistors to store and crunch information, but doesn't give us the pins to get all that information in and out of these circuits.

We fought valiantly against Shannon's tyranny. 1000BASE-T moves Gigabit Ethernet bits over Unshielded Twisted Pair (UTP Ethernet was 1000 times slower when it started around 1987), EV-DO pushing a couple of Megabits of packet data into the existing 1.25MHz CDMA spectrum could become a global wireless DSL of sorts (no more hotel Internet access fees!), and SERDES technology for 5Gbps and beyond (per pin pairs) in and out of ASICs and processors. All this enabled by signal processing at the transmission line endpoints, in the form of sophisticated coding and modulation, adaptive equalization, clock recovery, echo cancellation, path diversity, and so on. But Claudes will be Claudes, and eventually we'll reach the bound, or a point of diminishing returns. We can go then to higher capacity channels, like optics for cabled applications, and like Robert Drost's work on Proximity Communications for future processor chips immediate interfaces. No more sleep lost over scarcities other than the scarcity of sleep itself.

For those of us that always find something to worry about, I'll mention a really scarce resource in our global world: Names. Good names for our classes, methods, and variables. Good signal names for our ASIC's RTL. Good names for our children. I, for one, didn't consume a middle name when our first boy was born, we kept good names dry for later. Actually the first born, like a processor with a single register, doesn't need a name until a second baby arrives in the family. We couldn't quite convince the nurse, and had to name him before we could take the baby home.

Given names are contested, ask me, I share mine with a Mermaid and a German detergent, but they are nothing compared to the stakes around brand names, trademarks, or domain names. Armies of lawyers descend. A land grab for those easy-to-remember, short-to-type names, devoid of negative connotations in most of the 2800 languages spoken on the planet. That is global scarcity. To make things worse, perfectly good names are condemned by unfortunate events, a "Titanic", an "Enron", get out of circulation, because human memory has no cold reset input to make us forget.

The creative pace of high-tech aggravates name scarcity. Think about project names, about industry initiatives, about technologies. A chronic name deficit forces name reuse, name overload, or even worse, acronyms. Incidentally, the cool CMT (Vade Retro, an acronym!) technology we just launched with the productization of the UltraSPARC T1 processor platforms was internally known for a while as Niagara. Not a bad name. Relatively short, no residual meaning inside Sun, and even visually metaphoric for throughput. A minor weakness, Niagara rhymes with a prescription medication of singular use, but outside adolescent circles, who would have the poor taste to bring that up. (Breaking news: a three letter competitor spokesperson brought the medication up, our competitors must be employing adolescents).

But the best part is the name efficiency of Sun's CMT play. No namespace clutter, one powerful technology, one name to remember UltraSPARC T1, one multi-core processor, one socket. This simplicity helps my strained memory. So strained that when I am asked about how UltraSPARC T1 stacks up against our Intel based competitors, I have to pull out an Intel roadmap cheat-sheet to be sure. Do they refer to Bensley with Lindenhurst, or the Truland platform, Paxville or Tulsa, Dempsey or Woodcrest, Sossaman through Whitefield on a Conroe platform or all the way to Dunnington. The casual listener thinks I am doing a public reading of Harry Potter, and I haven't even invoked the Itanium namespace.

Energy, Spectrum, and Names. On the good side of each scarcity. CMT power savings, tackling Shannon, and wrapping it all up in a single name. I'd love to write about what we are bringing next to the Moore vs. Shannon fight, but you'll have to wait for the next round gong, or invite me for a preview.



[ Technorati: NiagaraCMT, ]

digg del.icio.us
Thursday Dec 08, 2005

Walking with a limp

You can't please everybody. This customary parent or mentor wisdom is usually thrown at our lack of academic focus, or at our design choices as engineers. Generalists are useful, but complete knowledge was EOL announced when we left The Garden of Eden (or got evicted rather), and Last Shipped around Leonardo Da Vinci. Generalists cover just a different subset of the knowledge tree, exemplifying another form of a specialist. Shall we say Horizontal Specialists. Last time I had an antagonistic experience with an HMO General Practicioner, I voiced what I really thought about him. He called building security. Next time I will just use the Horizontal Specialist sobriquet, and walk away. But this blog is about the specialization of machines and devices. I will leave human and medical doctor specialization out to avoid entangling my beloved employer.

For computing devices the specialization dilemma is captured by the historical name we have been using for the processors and the servers we make, General Purpose . A machine suitable for many uses, possibly beyond its original designer's intent, say the optimists. An engineering specialty so narrow that its practicioners do not know much about the software and workloads above it, say the cynics. They are both right, "generalist" machines designed by "specialist" humans.

Those who drive teenagers to school may have seen how they drag their feet on their way to class. This teenager's mother commands him to stop being lazy and please pick up his feet. His father lectures him that men don't do whatever their mother's say, men do what they want. In an attempt to please both parents the poor teenager drags one foot and picks up the other, walking with a limp. Are we designing a limp into our machines by trying to please everybody? The CMT throughput server philosophy postulates that we can run faster if we don't limp, and indeed the UltraSPARC T1 based products we are launching these very days do just that. We decided to please throughput horizontal integer workloads at the expense of floating point single thread.

The immediate payback we get from creating more specialized general purpose processors and systems, is that a whole set of applications and deployment architectures take far less boxes. This benefit is compounded by the power efficiency of these UltraSPARC T1 boxes. Before you accuse me of spewing out unquantified generalities, I will offer quantification along two dimensions, within and across Moore's law process generations.

For purists interested in architectural prowess, comparing within a given process manufacturing technology is the fair comparison. Dealt a hand of cards (wafer cost, transistors, complexity, and power), architecture is the game of playing them best. But given exponential transistor increases across generations, ignoring the impact of process technology and just relying on architecture leads to certain defeat. We must compare both within and across generations.

Within a generation CMT is roughly an order of magnitude improvement. Take the web facing workloads I care about for some of my work, they run about 8 to 10 times faster on a Niagara system than on a contemporary general purpose Sparc processor consuming the same power, made in the same 90nm process, but having the limp of pleasing the single thread and Floating point constraints. Getting one order of magnitude out of the same power envelope and manufacturing technology is pretty compelling, but in the interest of full disclosure let's show the card we pulled out of our sleeve: Memory. CMT is all about making the most of system memory bandwidth, and in comparing within a generation CMT plays with much more memory bandwidth in its hand. Did we cheat? Not really, architecture is also about interfaces and optimizing the memory interface is part of playing our cards.

Comparing across generations is projecting the throughput ratio between CMT on the current vs. the next process technology nodes, while keeping the power and cost as invariants. If you were expecting the answer to be another order of magnitude, I'd like to have some of what you are smoking. Architectural order of magnitude improvements are kind of rare, so now we fall back to riding Moore's Law. Having lowered your expectations, here are the good news. Unlike previous limping approaches, CMTs will give you nice integer factors (2x for example) across Moore's law cycles. And that is all we can ask from a new architecture, a solid one time jump to a different curve, and climbing the new curve at least with the same rate as before the jump.

Our competitors don't neatly align their offerings to facilitate my simplistic two dimensional comparison within and across process technology, they put their stuff out and compete. I will just say that the competitive data gives me a warm feeling about this cool technology, and defer to my fellow Sun bloggers covering competitive angles much better and deeper than me.
I recommend following Welcome to the CMT era, a great repository of all things CMT at Sun by Richard McDougall so I can walk away from trying to please everybody and shift back to my original topic, generality vs. specialization.

Is the sacrifice in generality worth the benefit. Does the sacrifice break our axiomatic belief in layered modular system design, in not caring and not counting on the implementation details of other modules or layers. I will state a claim so counterintuitive that you might want to invite me in to explain myself and take a Breathalyzer test. We claim that a specialized CMT architecture actually broadens the applicability of the processor beyond what was possible with its original general purpose sibling. It derives enough generality to apply to other elements of the IP and telephony network infrastructure, which hopefully is the subject of a future posting, for which I have at most the heading. But if I don't get to it, feel free to give me a call and invite me.



[ Technorati: NiagaraCMT, ]

digg del.icio.us
Wednesday Dec 07, 2005

How to tell a hardware from a software person

They both write code on a screen at a very high abstraction level, they test their code before integrating into a larger blob through highly abstracted interfaces. Verilog looks just like a structured programming language to the uninitiated, and tools keep most coders equally removed from the ultimate assembly language and transistor level details.

Some argue that the large costs of tooling modern semiconductors put a perfection burden on the hardware designers that, through insomnia, molds them into somber personalities. Software engineers are pictured, by opposition, as carefree characters always able to land on their feet by recompiling and patching. This is all passe'. Hardware design has adopted a train model where fixes are phased in at multiple pre-defined points, and the impact of software defects can result in damages exceeding semiconductor tooling costs.

To tell software and hardware people apart just ask QUI BONO?
Specifically: Qui bono from Moore's law? Who benefits from Moore's law?
Moore's law renders hardware achievements obsolete, while turning slow or bloated software into achievements. A colleague and myself designed the first LAN controller with embedded memory, a first that enabled packaging the entire controller in 24 pins. We put 2 kilobytes of RAM buffers, solved the embedded memory yield issues, went for beers and felt great about our achievement. Our bragging rights for that chip lasted as little as our beers. Darn law. So the test is simple, if you are a victim of Moore's law you are in hardware; if you are a beneficiary of Moore's law you are in software.

Hardware is further victimized by Moore's law constant pressure on price. Sometimes this leads to spiraling prices for a given hardware function, and other times to increased capacity for a roughly constant price. Server processors have followed the latter, namely the speed bump regime. Successive processors push the clock frequency higher and thus deliver a performance benefit instead of a cost reduction. Software executes faster, subsequently making a bigger and more complex software edifice viable.

But if all good things must come to an end, how long will this regime last? Moore's law is not out of gas yet, but cranking up processor clocks is getting harder and less productive. You heard the reasons, power dissipation vs. frequency and the impact of system memory latencies. Interestingly, the UltraSPARC T1 CMT anticipates this new regime of exponential growth in transistors without a good incentive to push the frequency further. Will customers demand a price reduction now that the speed bump is dead? Not if we transition instead to a thread bump regime. UltraSPARC T1 inaugurates this transition to a thread bump regime, and consecutive CMT generations offering thread increases commensurate with Moore's law should provide the bumps. (Note to self: Contact Niagara add agency with idea "We are the Bumps in Thread Bumps".)

But wait a minute, does this mean that our carefree software developers are no longer automatic beneficiaries? Indeed, this time around they may have to sweat a bit more to turn additional threads into software achievements. Oh, and maybe multi-core processor designers can have simpler lives now that there is more repetition and less unique circuits to design, verify, and lose sleep over. Not quite a reversal between victim and beneficiary, but we may need a better test than QUI BONO in the future.



[ Technorati: NiagaraCMT, ]

digg del.icio.us
Tuesday Dec 06, 2005

The warmth of vacuum tubes

I grew up listening to vacuum tube nostalgia. Radio technicians could diagnose a radio receiver with just a screwdriver, and sometimes even fix it. But beyond that, the transition from vacuum tubes to semiconductors was a religious topic within the Radio Amateur circles. It got harder to build your own gear, some said it didn't sound the same, the non-linearities of transistor amplifiers, you know. But the main complaint was not technical. Radio Amateur operators missed them because vacuum tubes kept their hands warm during the cold winter nights.

Radio Amateur anecdotes are some of the most memorable stories I could tell, maybe some other day, on some other blog. And I would also join the mourning of the vacuum tube, if it weren't for a more profound and recent displacement I need to mourn. The displacement of the HF radio Amateur at the hands of the Internet... A 3khz voice channel, shared, that may or may not work on a given day to a given place, displaced by a DSL line and a browser. I am not the only deserter, just look at the roofs of a city like Montevideo, once the highest density of Yagi antennas on the planet, you walk its streets today and there are few antennas to be seen. Victims of the ubiquitous Internet.

Yet I appreciate irony, and with modern life's Internet addiction filling my once HAM radio nights, I discover that the Athlon laptop gets warm just like the old vacuum transceivers. Deja Vu. We replaced hot vacuum tubes with cooler solid state radio, then things got pretty hot when we put lots of transistors in NMOS integrated circuits. I recall my first IC design in NMOS, clocked at a meager 10MHz, it required a ceramic package and got too hot to touch. They got cooler again with CMOS, so we started building bigger and faster semiconductors. Up to the point that the semiconductors running a lowly laptop keep my hands warm. The Internet server infrastructure (replacing the Ham radio ether) requires major ventilation and air-conditioning. At the rate we are going we might have to host the planet's infrastructure at the poles. How is that for an idea, dual home the entire net infrastructure to the North and South poles, no single point of failure, affordable land, maximum redundancy, cooled by keeping the windows open, and solar cells for 24 hours a day (well, half a year). I digress, but you read it here first. Hosting the net at the poles...

What is next? How do we pull the CMOS cool device trick again? For the moniker we can certainly reuse the letters CM, that is a start. As for the substance, let me narrate a customer lab visit we had here in Newark. We were showing off our first UltraSPARC T1 bringup machine, verbally conveying how naturally Horizontal Micro-scaling fits telephony infrastructure network elements. Brought Solaris up, showed our demo, and asked the customer to do the honors and check how many processors Solaris reported. Impressing somebody by printing the number 32 on a screen may not get you very far socially with your friends, or at bar, but for a skeptical techie the number 32 out of a single processor socket was meaningful. He lived on that side of the fine line between skepticism and paranoia. He touched the processor, and feeling it cold, accused us of smoke and mirrors; basically of running the demo from a different machine. We proved the accuser wrong, and ended up making the unintended point that Niagara really is a Cooler technology. We earned the right to reuse the letters for the next cool technology, CMT.

But before you start asking about how to keep your hands warm in the CMT era, fear not, there is still Memory, that is, plenty of DIMMs to keep the operator warm. What a coincidence, every train of thought takes me back to the Memory theme.



[ Technorati: NiagaraCMT, ]

digg del.icio.us

Turning the tables on Horizontal Scaling

While previewing the UltraSPARC T1 CMT processor to many customers I have been saying that it turns the tables on horizontal scaling, and here I am to expand on what was just a bullet in my slides through this and other CMT related musings.

We hear about Google indexing and search server farms as the ultimate in horizontal scaling, 5000, 6000, 10000 servers. Every time I hear about Google the number goes up. Not the stock, the number of servers in the farm. I also remember Subodh stating that whatever can be horizontally scaled, will. Essentially the wisdom of Horizontal Scaling is to realize that the cost per processor tends to be higher in large SMP systems than uni or dual processor systems. If the application (or the workload) is amenable to separation as loosely coupled networked processing, then the Horizontal Scaling (a.k.a. scale out) architecture is appealing based on cost, and service availability metrics.

We all know that web facing workloads are horizontally scalable by virtualizing the service IP address through a load balancer box, or using Round Robin DNS. Wireless telephony servers also tend to use scaling out for most network elements, generally with a clustering or HA layer instead of the simplistic intercepting load balancer used for IP networks. Sunray servers are hybrid species, they are deployed horizontally as groups of servers, but users also want fast Sunray sessions on fast SMPs. This point is proven by my colleague Jochen, who has mastered the art of manual load balancing, that is, repeatedly sliding a Javacard until you get the fastest Sunray server in the building.

Does Subodh's prediction mean that the world is inexorably converging on server farms or clusters? Does it mean that commoditized generic 1P/2P whiteboxes will underpin our Internet and telephone networks? Not if we realize that there is one act left to Horizontal Scaling, that is, scaling processors inside a chip rather than scaling them out across the sheetmetal of discrete servers. CMT is exactly that, turning the tables on Horizontal Scaling by keeping the cost and availability benefits, but throwing out its drawbacks. And one of the drawbacks of the scale out model is the theme that got my blogging started: Memory.

Next time somebody tells you about the Google server farm, forget the number of servers they claim by then, just visualize the number of memory DIMMs, imagine them all lined up and warm just in case you want to search something at that very moment. Now ask yourselves, is there a way of getting the computation to scale horizontally without having to spread all these memory DIMMs all over the place? CMT is exactly that, distributing computation while consolidating the system memory. Need a shorter description of that? How about Horizontal Micro-scaling . You heard it first in my blog.

Now I'll stop and wait for a comment or two before I go on. And the one comment I need is: But, ain't memory cheap anyways?



[ Technorati: NiagaraCMT, ]

digg del.icio.us