Waiting for I/O
Archives
« November 2009
MonTueWedThuFriSatSun
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
      
Today
Click me to subscribe
Search

Links
www.flickr.com
sdsouza's photos More of sdsouza's photos
Blogroll
Praveen Kalugotla
Malhar Anaokar
SeChang Oh
Ken Pepple
Divyesh Shah
Takashi Shitamichi
Saday Tiwari
 

Today's Page Hits: 9

« Mobility with Securi... | Main | Arithmancy, or the... »
Tuesday Jun 12, 2007
iTunes : Coming soon to a POWER6 chip near you

I am an unabashed fan of processor design, and the past few years have been exciting times. The most recent processor introduction (the POWER6 from IBM, in case you had gone walkabout over the last fortnight) was much awaited then, by processor aficionados.

Doubling frequency not halving execution time
At 3.5-4.7GHz, the POWER6 hums a ditty rather older than recent hits from Intel, AMD and Sun. Obtaining overall system performance benefits by increasing processor clock frequency is becoming, to put it mildly, difficult. Main memory latencies do not keep pace with conventional CPU frequency increases, meaning that the gap between a CPU stalling and memory supplying data for it to resume execution is wide, and getting wider by the year. The illustration alongside indicates what happens in a two year period - if all the additional transistors that Moore's Law gave us in that period were spent in doubling the CPU frequency, only the time in which the CPU is computing halves. Since memory hasn't appreciably sped up meanwhile, stall time remains the same. Net result - the thread does not finish executing in half the time.

This is a simplification, of course - techniques like larger caches, pre-fetching, super-scalar execution, deeper pipelines, simultaneous multi-threading, etc., are crammed into the chip to try and hide the memory latency problem. The issue is that these can be expensive to research and implement, and one rapidly runs out of newer ways of hiding a problem one can't solve (Watch this space though).

The alternate is an option pioneered by the UltraSPARC T1. Thread level parallelism combined with 8 cores (each can switch amongst 4 threads with zero latency) on a chip replace all of the techniques mentioned above, and deliver a system throughput that 2 years ago was 15x more than an equivalent chip was delivering 4 years ago. Not content with 32 threads, we will soon be launching systems based on a 64 thread successor.

Besides, even all of those techniques do not result in squeezing everything out of the increase in processor frequency. A classic example is the POWER6 itself. At 4.7GHz its frequency is 2.13x that of the POWER5+ at 2.2GHz, but performance went up only 1.4-1.45x, according to IBMs p5 570 and the POWER6 based 570 rPerf indicators. Further, at a reported 160W the 4.7GHz POWER6 chip consumes twice the wattage of the UltraSPARC T1.

What does all this have to do with iTunes? For reasons completely undecipherable, the IBM press announcement had the following statement :

the processor bandwidth of the POWER6 chip — 300 gigabytes per second — could download the entire iTunes catalog in about 60 seconds — 30 times faster than the Itanium processor in H-P's servers.

This gave me pause. A chip can do that? How interesting. To see whether this is possible, I ran a thought experiment. The most famous practioner of thought experiments was Einstein, who ultimately produced the Special Theory of Relativity as a result of one. Far be it from me to claim anything remotely as revolutionary, but what I came up with was the following diagram of the POWER6.
POWER6

I then tried downloading 300GB per second into that chip, ignoring the tiny problem of where I would store it while downloading the next 300GB chunk as time ticked on. There is the GX+ I/O bus, but that gives me only 20GB, assuming all of the bandwidth is for iTunes downloads. The L3 cache and Main Memory are just intermediate storage locations for the same data. Perhaps we can use those to temporarily store songs, especially because there is only 20GB coming in so far. Never mind that the processor may not be able to process 20GB of packets coming in and push them into memory in the same second. There are the inter- and intra- node buses but they only connect to other chips. Unfortunately, IBM does not publish delivered I/O throughputs, so my thought experiment comes to a grinding halt, breaking down at 20GB/s as the theoretical best one chip can do.

Perhaps we can exploit the GX+ buses on other chips to pump in the remaining 280GB? That would mean we need 15 chips to connect to each other, and then some incredible system and OS efficiency to saturate 15 sets of system buses. One could spend a lot of pre-thought experiment time cramming sections of the 18 Terabytes into all the devices on all the I/O buses. Hmm, the two POWER6 processors on our particular chip will then pack up and apply for entry to the Elysian Fields. Assuming they thus began receiving the 18TB, where would they put it? The POWER6s on the other chips will have commenced a clock-down strike, demanding to know why our chip is the privileged recipient, while they are slaves working for a song.

You might admonish me - press releases are full of sound and fury signifying nothing, you might say. They should not be read (atleast not literally) by people who have nothing better to do than conducting frivolous thought experiments. My only riposte to this is that I did manage to get a blog post out of it all.

I must confess that I got the details on the POWER6 chip from Bradley McCredie's POWER roadmap presentation - slide 7. Some of the bandwidth numbers are CPU frequency dependent, and lesser at 3.5 and 4.2GHz.

Tags :

Posted at 11:53PM Jun 12, 2007 by Santhosh D'Souza in Sun  |  Comments[0]

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed