Three kids, a dog, a cat, sunny days, ocean breezes, and way too much time online
SLO Life

 
www.flickr.com
This is a Flickr badge showing public photos from kamundse. Make your own badge here.
 

See all my pictures here.
 

 

Archives
« March 2006 »
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
15
18
19
20
21
22
23
25
26
28
29
 
       
Today
XML
Search

Links

The requested Bookmark Folder does not exist: Blogroll

 
 

Today's Page Hits: 60

All | Geeky | Linux | Personal | rand() | Sun
« Previous month (Feb 2006) | Main | Next month (Apr 2006) »
20060331 Friday March 31, 2006
64-bit processors - Cache and Memory
 
 
Who doesn't like cache? Why, just the word makes me feel happy... ah... cache...  
 
Opteron  
 
The Opteron has an on-die L1 and L2 cache for each core. What does this mean? Cache is memory available to the processor which is faster to access than the main memory on the system. There are different levels of cache which are successively slower and larger. The level 1, or L1, cache is inside the core of the processor. Most L1 caches are split into a data cache and an instruction cache. The data cache stores recently used or computed data, the instruction cache stores program instructions. It is the fastest and smallest cache. The level 2, or L2, cache is larger and slower and may not be in the processor core. For processors where the L2 cache is not in the core, it is still on the processor chip. Some processors include a level 3, or L3, cache which may or may not be on the chip. It is larger and slower than the L2 cache. If it is off-chip, it is often accessed through the memory bus. Cache is used by the processor to keep data and program instructions which have been recently used by the program for faster access in the future. Changes to the cache size and speed have significant effects on the processor's performance. Cache takes a large amount of space on the chip and increases the cost of production for the processor. The L1 instruction and data caches are 64K split into four 8K banks and is 2-way associative. What does this mean? The splitting of memory into banks is a common technique. It allows data to be spread across the banks for quicker access. This is called interleaving. Associativity refers to how many possible places in the cache a particular piece of data can be stored. In a cache with 1-way associativity, there is only one location a particular piece of data can be stored. This location is determined by using values such as the virtual or physical memory address of the data. Because the data can be in only one place, looking in the cache to see if the data needed is there is a fast operation. Since the cache is not very big, often more than one piece of data will map to the same single location. This means it will overwrite whatever data is already there. By increasing associativity, the number of places a piece of data can be in the cache increases. This decreases the frequency of data overwriting other data, but makes finding the data slower. For a 4-way associative cache the processor must look in as many as four places to find the data needed, in an 8-way associative cache it must look in up to eight places, and so on. A fully-associative cache means a piece of data can be stored in any location, which means the processor may need to search every line in the cache to find the data needed. They both use 64-byte lines. What does this mean? When data is stored in the cache, it is stored in lines. The size of a line can be different between the L1 and L2 cache. When data is fetched into the cache, it is done one line at a time. The bigger the line, the more data is fetched. This has advantages and drawbacks. Getting more data in a single fetch can speed up the processor since fetches are time consuming. But, fetching more data means overwriting more data in the cache, which can increase the need for further fetches later because needed data was overwritten. The unified L2 cache is 1M, 16-way associative and can handle 10 simultaneous requests. What does this mean? While the L1 cache is split between a data and instruction cache, the L2 cache is a unified cache, so both data and instructions are in the same L2 cache. Notice the associativity of the L2 cache is much higher than the L1 cache. If a piece of data is not in the L1 cache or in the L2 cache, unless the machine has an L3 cache it must be loaded from memory. This is a very slow operation (1000s of cycles). To increase the odds that needed data will be in the L2 cache, the associativity is increased. This makes it slower but it is still much faster than main memory. The most basic cache can handle a single request each cycle. The L2 cache in the Opteron can handle 10 requests each cycle. The L2 cache uses a LRU replacement policy. What does this mean? LRU stands for "least recently used". This means when new data needs to be stored in the cache, it overwrites the data in the cache which has been accessed the longest time ago. The idea is that data which has not been accessed in a long time is less likely to be needed in the future as data which has been accessed more recently. The Opteron does not have an L3 cache. The on-chip memory controller (named Northbridge) is 128-bits and operates at the same frequency as the core. It has a maximum bandwidth of 5.3G/s. What does this mean? To access main memory, the processor sends requests to the memory controller. The memory controller knows how to access the memory and is responsible for returning to the cache the data requested from memory. Different memory controllers can handle different sizes of data. For the opteron, this is 128-bits. It operates at the same frequency as the core, which means it sends and receives data at the same frequency as the processor operates. If it operated at half-core speed then it would be able to send or receive data only every other cycle. The maximum bandwidth refers to the amount of data can be transferred per second. In this case, it can transfer a maximum of 5.3G each second to or from the memory. Memory access on an Opteron returns lines critical-word first. What does this mean? When data is requested from memory, it must return in a whole line. A word is the base unit for data. Lines are made of words. When the processor requests a piece of data from memory, it can be returned in one of two ways. The line may come with all the words in order. In this case, if the processor needs the 5th word in the line, it has to wait for the first four to come back from memory before it gets the one it wants. If the number of words in a line is small, this is not a big delay. If it is large, say 100 words, the processor could have to wait a while if the word it wants is the 100th one. The other way to return the data is for the needed word to come first and the rest of the line comes after. This is critical-word first. It gets the data to the cache faster but the words in the line must be reordered at both ends of the transaction. The processor has a 40-bit physical memory size and a 48-bit virtual memory size. What does this mean? Physical memory is usually the RAM on the system, commonly in the installed as DIMMs, SIMMs, or RIMMs. A 40-bit physical memory size means that the processor can use memory addresses up to 40-bits in size. The larger this value is, the larger the maximum amount of memory on the system can be. In this case it is 2^40, or 1 terabyte of memory. Processors also use a type of memory addressing called virtual memory. Virtual memory is used in multitasking computers (ones that run more than one program at the same time, pretty much any computer you've used in the last 25 years) to give programs a contiguous memory space. Physical memory addresses, which may be spread all over the memory, are mapped to new virtual addresses which are in one continuous block. The virtual memory size does not need to be the same size as the physical memory size, as is seen in the case of the Opteron.  
 
Pentium D  
 
The Pentium D has an 8-way associative, 12K L1 instruction cache (trace cache), which uses 64-byte lines, for each core. Compared to the Opteron, the L1 cache in the Pentium D is much smaller, 12K rather than 64K. It is does have a higher degree of associativity than the Opteron's cache. Because it is smaller, it lends itself to this higher associativity because it is a smaller cache space to search through. It uses the same line size, 64-byte, as the Opteron. There is also a 4-way associative, 16K L1 data cache on each core. The L1 data cache is larger than the L1 instruction cache and is less associative. It is not uncommon for processors to have different sized L1 instruction and data caches, as we'll see as we look at more processors. The L1 data cache is non-blocking and allows up to four cache requests. A non-blocking cache allows the processor to issue more than request at a time. For the Pentium, it can handle up to 4 requests at a time. The load latency is two cycles for an integer and six for floating point. Load latency is the amount of time it takes for requested data to be returned. Each core has a unified L2 cache that is 1M for Smithfield and 2M for Presler. The L2 cache is 8-way associative and is non-blocking with a load latency of seven cycles. The bandwidth between the L1 and L2 cache is 48G/s. What does this mean? Again, a unified L2 cache means both instructions and data are in the same cache rather than separated as they are in the L1 cache. Although the L1 cache in the Pentium D is more associative than the L1 cache in the Opteron, its L2 cache is less associative. A maximum of 48G can be transferred each second between the L1 and L2 cache. All the caches have a LRU replacement policy. The hardware also supports prefetching. It attempts to stay 256 bytes ahead of the current data access location. What does this mean? Prefetching is loading data or instructions into the cache before they are requested by the processor. In the case of the Pentium D, it assumes that needed data will immediately follow data currently being requested and loads the next 256 bytes of data as well. The advantage to prefetching is that often data that will be needed in the near future is immediately following the currently requested data. By loading it before it is requested, the processor eliminates the need for the instruction to request the data from memory since it will be in the cache, which is must faster. The disadvantage to prefetching is that when data is prefetched, other data in the cache must be overwritten. If prefetching is too aggressive, it will overwrite enough existing data in the cache that it will cause extra requests to memory to return that needed data which was lost. The memory controller for a Pentium D is not on the chip. It is accessed through the front-side bus (FSB). What does this mean? Unlike the Opteron, the Pentium D does not have its memory controller built into the chip. Doing this makes the processor smaller and the design simpler, and may reduce production costs. The drawback is that communication across the front-side bus is slower than on the chip. Since the memory controller must use the bus eventually to fetch data from main memory, this slowdown may not be an issue since the bottleneck of the bus will occur either way. Having an off-chip memory controller also means a multi-processor machine could share a single memory controller among multiple processors, rather than having one in each chip. Again the advantages are the same, the disadvantage is that the processors may have to wait while the memory controller is executing another processor's request. This is a general disadvantage with any shared resource. The FSB on the Pentium D is 800Mhz with a theoretical maximum bandwidth of 6.4G/s. The Pentium D uses 40-bit physical address and 64-bit virtual address sizes.  
 
Power5  
 
The Power5 has a separate L1 cache for each core. The L1 instruction cache is 64K, 2-way associative, and is direct mapped from the L2 cache. What does this mean? The L1 instruction cache direct mapped from the L2 cache means that everything in the L1 instruction cache is duplicated in the L2 cache. Since the L2 cache is shared between the cores on the Power5, this allows the cores quick read access to each other's L1 instruction cache. The L1 data cache is 32K and 4-way associative. The L2 cache is shared between the two cores. It is 1.875M divided into three slices and is 10-way associative. What does this mean? Unlike the Pentium D and the Opteron, the Power5 shares its L2 cache between its processors. Advantages to doing this are that the size of the processor is smaller than if it had two separate L2 caches, and it allows the cores to share data and instructions more quickly. Like the Opteron's L1 caches, the L2 cache on the Power5 is divided into slices to allow interleaving of data. The L2 cache uses 128-byte lines and has a bandwidth to the L1 cache of 64G/s. The Power5 also has an off-chip 36M L3 cache with an on-chip directory. What does this mean? The Power5 is one of the processors examined which uses an L3 cache. It's L3 cache is largest in size of the processors examined in this paper. An on-chip directory of the L3 cache is provided so that the processor can determine more quickly if needed data is in the L3 cache or must be fetched from main memory. This cache is directly connected to the L2 cache. Having the L3 cache connected directly to the L2 cache rather than accessed through the memory controller speeds up access time. The bus to the L3 cache operates at half-core speed. Unlike it's predecessor, the Power4, and the UltraSPARC IV+, the L3 cache on the Power5 is not accessed through the on-chip memory controller. In order to speed up access time to the L3 cache, the L3 cache is connected directly to the L2 cache through a back-side bus.  
 
UltraSPARC IV+  
 
Panther moves from the 2 levels of cache in the UltraSPARC IV to three levels of cache. It has a 64K L1 instruction cache which is divided into two 32-byte subblocks. Again, this cache uses interleaving by splitting the cache into subblocks. The L1 data cache is also 64K and uses a write-through policy. What does this mean? The job of the cache is to provide a fast access copy of data in main memory. When this data is changed from its original value, that change must be written back to main memory. There are two common ways to do this, write-through, and write-back (also called copy-back). In write-through, when data is written to the cache it is simultaneously written to main memory. The advantages to this policy are that it is simpler to implement and it keeps the cache and main memory consistent at all times. With write-back, changes to data in the cache are only sent to main memory when the changed cache line is evicted (overwritten). This leads to less traffic on the memory bus and speeds up system performance but comes with the risk that if the computer were to have an event such as power loss or a system crash, the changed data in the cache may be lost. The L1 cache includes 2K, 64-byte line prefetch buffer accessed in parallel with the L1 instruction cache. What does this mean? With the Pentium D, we discussed data prefetching. Some processors include the ability to prefetch program instructions. The IV+ can store up to 2K of prefetched instructions in a special prefetch buffer. Panther’s L1 cache also includes a 2K fully associative write cache. What does this mean? A write cache allows the processor to continue on with other operations rather than wait for a write to complete. The L2 cache for the UltraSPARC IV+ was moved on-chip and is shared between the cores. It is 2M, 4-way associative, and operates at half-core speed. It also is completely inclusive of all L1 caches. The L2 cache uses a copy-back policy to decrease bus traffic. The L3 cache on the Panther is 32M and 4-way associative. It has 64-byte lines and also follows a copy-back policy. L3 tags are kept on-chip. Like the Power5, the IV+ also has an off-chip L3 cache with an on-chip directory. A difference between the Power5 and the IV+ is the Power5's use of a back-side bus for access to the L3 cache. The L3 cache is a victim cache, only being written to when things are evicted from the L2 cache. On a hit, the L3 line is copied back to the L2 cache and then invalidated in the L3 cache. What does this mean? When needed data is found in the L3 cache, is it copied into the L2 cache and then removed from the L3 cache. The purpose of the L3 cache on the IV+ is to store data which has been used by the processor in the past but was evicted from the L2 cache. This data is a "victim" of being overwritten. When it is needed again, it is moved to the L2 cache so it can be accessed quicker in the future and since it is no longer a victim, it is removed from the L3 cache. In cases where the two running threads cannot cooperate using the shared L2 and L3 cache, Panther has a mechanism for pseudo-splitting the shared caches. When split, both threads can read all of the cache but can only write to half of it. What does this mean? How can threads not cooperate? If the threads keep overwriting each other's data in the cache, it slows both threads down. By splitting the cache into two separate areas and only allowing each thread to write to one half, the processor can prevent the two threads from clobbering each other. The reason for not having this be the default setup is that doing so cuts the sizes of the L2 and L3 cache in half from the view of each core. When they are not clobbering each other, having the larger cache sizes significantly increases system performance. In the extreme case, if one thread was idle, the other thread could be utilizing all the cache rather than being restricted to half.  
 
UltraSPARC T1  
 
Each core has a 16K, 4-way associative L1 instruction cache that uses 32-byte lines. The L1 data cache, also in each core, is only 8K, 4-way associative, uses 16-byte lines, and has a write-through policy. One thing you may notice right away is the big difference in size of the L1 cache from the UltraSPARC IV+. This is just one of the many big changes in the T1 from the rest of the current SPARC processor line. The T1 does not just differ significantly from other SPARCs, but from the other 64-bit processors as well. Although its L1 instruction cache is larger than the Pentium D, and the same size as the Itanium, its data cache is noticeably smaller than any other processor. The L2 cache is shared between cores and is accessed through a crossbar interconnection network. What does this mean? An interconnection network allows communication between the core and resources such as other cores, memory, cache, I/O, etc.. In the case of the T1, with eight cores sharing the L2 cache and communicate with each other, a standard linear connection network (basically a wire connecting all eight cores to each other and the L2 cache) would get bogged down quickly. A crossbar is a type of connection switch which can handle more traffic. Imagine it is a grid of wires connecting each core and the L2 cache. This provides multiple paths for data to get from point A to point B without having a collision with data going from point C to point D, or even to point B. The crossbar provides more than 200G/s of bandwidth. Here is another difference between the T1 and other processors. The bandwidth on the crossbar interconnect is three to four times as much as the on-chip bandwidth of the other processors. The T1 does have eight cores to support, rather than two, so this bandwidth size is not surprising. The L2 cache is 3M banked four ways, 12-way associative, and uses 64-byte lines. Data is interleaved across the banks in 64-byte granularity. The L2 cache has a directory of all eight L1 caches. There are four on-chip memory controllers shared by the eight cores and accessed through the crossbar. The memory bus on the T1 is significantly larger than other processors with a bandwidth of 20G/s. Recall that the bandwidth sizes for the other processors were less than half of this, with the next largest being the the UltraSPARC IV+ at 9.6G/s. The T1 uses 40-bit physical addresses split into two sections, memory and I/O addresses, based on bit 39. What does this mean? A 40-bit physical memory size means that memory addresses are 40 bits long. The last bit (the range is 0 to 39, not 1 to 40) tells the processor whether this is a memory or I/O address. What is an I/O address? An I/O address is an address that belongs to one of the system's I/O devices. This could be video, network, or other devices. Remember the T1 was designed to be "network facing" so it expects to do a lot of I/O functions. It uses a 48-bit virtual memory size.  
 
Xeon  
 
The Xeon processor follows the same general design as the Pentium D. The L1 cache on the Xeon is the same as in the Pentium D. The L2 cache on the Nocona comes in 1M or 2M sizes. Paxville uses a 2M L2 cache size only. The L2 cache is on a 200Mhz, shared bus to the off-chip memory controller. What does this mean? The L2 cache is accessed via the same bus that goes to the off-chip memory controller, though the L2 cache is on the chip. The memory bus is 800Mhz with a maximum bandwidth of 6.4G/s.  
 
Itanium 2  
 
There are three levels of cache available on-die with this processor. Both the L1 data and instruction cache are 16K and 4-way associative. The L1 instruction cache supports simultaneous demand and prefetch. What does this mean? In the same cycle, the L1 instruction cache can prefetch instructions as well as respond to requests for instructions from the cache by the core. It uses a 64-byte line, which are 4 instruction bundles. The Itanium 2 is a VLIW processor. Up to three instructions are combined into instruction bundles. What does this mean? First, there are two types of processors, those than can issue more than one instruction at a time, and those that cannot. Those that can are split into two groups, called superscalar and VLIW. The more common of the two is superscalar. VLIW stands for "very long instruction word" and the way it works is that each cycle a "bundle" which can contain several instructions is issued. The L1 data cache uses a write-through policy and can support 2 loads and 2 stores per cycle. What does this mean? In the Itanium 2 (not exclusively), there are several different execution units which allows more than one instruction to be executing in the core at the same time. The L1 data cache allows two different instructions to load and two instructions to store in the same cycle. The Itanium 2 uses a scoreboard system to facilitate a non-blocking L1 data cache. This scoreboard allows the processor to continue executing even with multiple L1 data misses by stalling the instruction issue group of the instruction that had the miss. What does this mean? The scoreboard on the Itanium 2 keeps track of earlier L1 data cache misses. As mentioned above, instructions are issued as a group on the Itanium 2. When an instruction in a group needs data that matches an entry in the scoreboard, the entire issue group is stalled, meaning it cannot not continue to execute, until the value needed becomes available. Stalling an instruction group does not cause a pipeline flush. What does this mean? The pipeline is something like an assembly line for executing an instruction. The instruction goes through stages, each of which performs some small task. We'll go into much more detail about the instruction pipeline next time. In some cases, the pipeline must be flushed, which means emptied of all currently executing instructions. When this happens, all the work done for instructions in the pipeline are lost, and the cycles wasted. In the case of a stall, on the most simple processor, this would mean that no instructions can execute because the whole pipeline gets stopped. It would be like shutting down the conveyor belt in the assembly line. In most current processors, such as the Itanium, the pipeline is more sophisticated and can allow any instruction which is ready to execute a particular stage to go ahead, this is called out-of-order execution (which we've previously explained and will go into again in a later entry).  
 
The L2 cache is a unified, 256K, 8-way associative cache which uses a 128-byte line. It has a latency as low as 5-cycles. What does this mean? The L2 cache on the Itanium can return data in 5 cycles in the best case, most likely with integer values. For larger data, such as floating-point numbers, especially double-precision or larger, it will take many more than 5 cycles. It operates out-of-order but L1 misses are stored in a FIFO for correct ordering. It can handle 4 data and 1 L3 request per cycle. What does this mean? The L2 cache on the Itanium accepts multiple requests to send or write data each cycle. It may not process these requests in order. For reading data, this is not a problem but for writing data, this presents a problem. If two instructions operate on the same value and send requests to write to that value those writes must occur in order or the value will be incorrect in memory. An example is X=5, X=6, X=7 (counters are very common in programming). At the end of this sequence, X should be 7 in the L2 cache but if the writes are processed out-of-order it could be 5 or 6. FIFO means first-in, first-out. To keep writes in order, they are not processed directly but placed in a FIFO queue and the L2 cache can write the values when it has time, always taking the value at the front of the queue (sort of like the line at the DMV) so that writes happen in order.  
 
One of the big differences in the Itanium 2 compared to other 64-bit processors is the on-die L3 cache. The L3 cache can be 3, 6, or 9M, is unified, and 12-way associative. It has a minimum latency of 12 cycles. It uses 128-byte lines, does not support partial line request, and returns lines critical-word first. What does this mean? When requesting data from the L1 or L2 cache, only the exact word needed, which is part of a line, is returned. The Itanium 2 L3 cache operates like main memory, returning whole lines at a time only. A maximum of 84.8G/s of data can be accessed on the chip. The bandwidth to main memory for the Itanium is not discussed until later in the paper, it is 6.4G/s. The Itanium 2 uses a 50-bit physical address size and a 64-bit virtual address size.  
 
Sources  
 
6. Intel Corporation – “Intel Xeon Processor-based Servers: Performance, headroom, and versatility for front-end applications, small-business servers, and High-Performance Computing”. www.intel.com, 2005  
 
7. Sun Microsystems, Inc. – “UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006  
 
8. Sun Microsystems, Inc, - “UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006  
 
9. Sun Microsystems – “UltraSPARC IV+ Architectural Overview”. www.sun.com, 2005  
 
10. J. De Gelas - "Opteron: Pushing x86 to the Limit". www.aceshardware.com, 2003  
 
11. S. Wasson - "AMD's dual-core Opteron processors: Because four is better than two". techreport.com, 2005

posted by kamundse Mar 31 2006, 11:27:48 AM PST Permalink

20060330 Thursday March 30, 2006
64-bit processors - A few questions from last time
 
 
I wanted to clarify a few concepts from last time that I got questions on.  
 
The first one is pretty fundamental to the whole discussion... what exactly is 64-bit? All data inside a computer is stored as a series of 1's and 0's. Each bit is like a digit in a normal (base-10) number. The more bits you have, the bigger the data you can represent, just like with numbers you use every day. Until the last 10 years or so, computers have been primarily 32-bit. What does this mean? It means that memory addresses, integers, etc were at most, 32-bits in size. If something can not be represented in just 32-bits, it had to be split into multiple parts. When doing an operation on these bigger numbers or addresses, the processor would have to handle that data in parts, rather than as a whole, which is slow. A 64-bit processor means the addresses, integers, and other data is at most, 64-bits. This means the processor can handle much larger data, which increases speed. Take a look at the link above for more details.  
 
Another one is simultaneous multithreading. To clarify this one we're going to take a big step back. How do programs get run at all? Way back when we people had to walk to both ways to school uphill in the snow (and they liked it), computers were not very sophisticated. They ran one program from beginning to end, then you could load and run another program from beginning to end. As you may imagine, this was fairly limiting as to what you could do with a computer and people got tired of listening to Bob and Joe fight over who's turn it was to run a program. So, the idea of scheduling was a big hit in the computer world. Scheduling is primarily the job of your operating system. At any given time, even if you have not launched any other application, there are several different programs running. Processes which are ready to run are placed in a queue. The operating system gets a program from the queue and starts it running on the processor. Every process has a time limit so others get a chance to run. A process either runs until its time limit or until something happens that causes it to sleep. Then the operating system picks another process to run and so on. It turns out that most processes spend a lot of time waiting and so when they are having their turn running, they may actually not be doing anything but sitting there. This means while they sit doing nothing, other programs which could be running are waiting. Another interesting thing about many programs is that the work they do can be split up into several independent but related tasks, threads. Take a web server for example, at a given time it may be serving pages to several users but that work is all independent of each other. Threading allows a programmer to split these tasks into multiple processes, which run independently of each other on the computer. Doing this can make the program run a lot faster. So, how can we make the processor better able to deal with these two things? Well, one way is to add more processors to the machine. This gives each process more places to run so it addresses threading, but what about the problem of processes wasting time? What if when a process was waiting, we take if off the processor and let some other process run? This is called "coarse-grained multithreading" and is a type of temporal multithreading. The Montecito processor (next generation Itanium) uses this type of threading. Another way to let processes share the processor, is similiar to how the operating system schedules processes. If we have a queue of threads ready to run, each cycle we can pick a thread and issue instructions from that thread in to the processor. The big difference from coarse-grained multithreading is that at a given time, instructions from more than one thread are in the processor core at the same time. This is called "fine-grained" multithreading. The UltraSPARC T1 uses fine-grained multithreading. The problem with temporal threading is that, for processors which can issue more than one instruction per core per cycle (which is all but the T1), in a given cycle it may not be able to issue the maximum number, which leads to waste. What do I mean by this? Say processor X can issue 5 instructions per cycle. For a thread ready to run this cycle, it may not have 5 instructions which can be issued this cycle, maybe only 3 are issued. That means this cycle, the processor wasted 2 issue slots and now inside the processor there are less instructions running than could be running. Simultaneous multithreading, or SMT, addresses this by allowing instructions from more than one thread to issue each cycle. So for the cycle we just mentioned on processor X, it could try to fill those two unused issue slots with instructions from another thread. All the remaining processors which do hardware multithreading use this type of threading. We'll discuss threading more in a future blog entry but hopefully this gives you a better idea about what SMT is.  
 
The next was what do programs need to do to take advantage of a dual- (or more) core processor. Applications should not need to do anything special. Operating systems need to be written to be able to properly use systems with multiple processors (either multi-core or multi-processor, or both). Unless you are running a *very* old operating system, your OS should already handle this. Applications which are threaded will be able to make more out of a multi-processor system, but they also gain benefits from a single processor system. Even if your application is not threaded, you will most likely you'll see an increase in system performance anyway because your application can get more time to run with more than processing core available on the system.  
 
Another one was out-of-order execution. Let's use the recipe analogy again. Imagine a program is like a recipe. To make a cake, or whatever you want to cook, you follow steps and the end result is a cake. With an in-order execution processor, each step is followed in the order it appears in the recipe. But many people who cook know that not all steps have to be in order for the cake to come out right. Say the the steps are:
  1. In a bowl, mix the dry ingredients.
  2. In another bowl, mix the wet ingredients.
  3. Combine the wet and dry mixtures and mix thoroughly.
  4. Grease cake pan.
  5. Pour batter into greased cake pan.
  6. Preheat oven to 350 degrees.
  7. Bake for 20 minutes.
Looking at these steps, we can see some that must happen in a particular order. We could not pour the batter into the pan before we grease it. We can not bake the cake without heating the oven. But there are other steps which can be reordered. We could preheat the oven earlier in the recipe. We could grease the cake pan at any time before we pour the batter in. We could mix the wet ingredients before the dry as long as we did both before combining. The same is true of a program. It is possible to take instructions and reorder them and still end up with the correct result. This is called out-of-order execution and is done by all the processors except the UltraSPARC T1 and the yet-released Montecito. How this is done is a detailed discussion for another day.  
 
Next is instruction level parallelism. Instruction level parallelism, or ILP is a measure of how many instructions in a program can be run at the same time. Take the recipe above, I can find four steps (instructions) which could be done simultaneously assuming we had enough cooks, 1, 2, 4, and 6. That is pretty much it. It is a fairly simple concept with some big fancy wording.  
 
The last is EMT64. This one is really simple, EMT64 is just the name for Intel's version of AMD's 64-bit extension to the x86 architecture.

posted by kamundse Mar 30 2006, 03:02:48 PM PST Permalink Comments [4]

20060327 Monday March 27, 2006
64-bit processors - Clocks, cores, and power usage
 
 
My original plan was to go processor by processor. My paper tries to cover all the major architectural features for each processor. In some places I was limited by information available but for most of the processors I was able to find all the same information. I want this to be understandable to people who are computer literate but are not architecure experts. As I started with the first processor, the Itanium 2, I realized I was writing a paragraph of explanation for every sentence. In the end I'd end up with five blog entries just for the Itanium 2 so I could explain what everything meant, and then rush through the rest. That didn't sound like what I wanted so instead I am going to pick one to a few architectural features and talk about all the processors.  
 
I will be using the text of my paper as I wrote it, but insert explanations and hyperlinks to explain what I am talking about. The explanations will be in a grey text, my paper is in normal black text. I'll be including a handful of my sources in each entry for those who are interested in more information.  
 
Itanium 2  
 
A cooperation between Intel and Hewlett-Packard lead to the release of the Itanium processor in 2001. The Itanium was intended to replace the x86 hardware and dominate the server and workstation markets. It was also not expected that AMD would be able to clone it. The second generation Itanium 2 was released in July 2002. Intel has since focused on a new Itanium processor, code named Montecito. HP sells Itanium 2 servers that range from single processor blades to 128 processor high-end servers. The Itanium's IA-64 architecture is completely different than the IA-32 architecture of the x86 family, though it provides backwards compatibility for 32-bit x86 applications. This paper will be focusing on the versions of the Itanium 2 after the first one (McKinley), code named Madison, Deerfield, and Fanwood. What does this mean? Hardware companies have working or code names for each processor they release. For a specific processor there may be several different versions with different features. For the Itanium 2, Intel had four different versions starting with McKinley. I have found the code name is the most convenient way to reference different versions of the same processor so expect to see them used heavily in this paper.  
 
The Itanium 2 core speed is 1.3 to 1.6 GHz and uses 130 watts of power. What does this mean? Processor clock frequency is probably one of the most misunderstood pieces of information about processors. So, what does GHz mean for a processor. A processor clock tick is the smallest moment of time for a processor. In the simplest processor, it can do one operation each clock tick. The clock frequency is how many of those ticks occur in one second. The more ticks, the more operations that can be completed in a second. So far it sounds like GHz does just mean speed. Ah, but there is more. What is a operation and how many can a given processor do each tick? These can vary significantly. Programs consist of instructions. Each instruction is like a step in a recipe. The difference is that for different processor types, the number of instuctions needed to run the same program are different. So, already we cannot know which processor will run an application faster because they are not doing the same steps. Each instruction can be broken down into many operations. Some processors break their instructions down into smaller operations than others. Having smaller operations means they can be done faster, which increases the clock speed. But, this does not mean the whole instruction is completed any faster. So, what does knowing the clock speed tell us. For the same processor, and sometimes for a processor family, you can tell which processor is faster. It just cannot be used between processors families (like comparing a Pentium to a G5) and sometimes even with in a processor family (like between the Pentium-D and the Xeon). Clock speed is also often tied to energy consumption (higher clock speed, more energy used). The Itanium 2 is one of the few single-core 64-bit processors still commonly available. There is a dual-core Itanium 2, called Hondo, available only from HP which uses two Madison cores operating at 1.1 Mhz, which is not covered in this paper due to lack of SPEC results for it. Montecito will be a dual-core processor. What does this mean? A multiple-core processor is basically like having that many seperate processors together on a single chip, which share some resources such as a memory bus. An advantage to multiple-core processors is they can communicate with each other faster than seperate processors. They also take up less physical space on the system mother board. Many dual-core processors have the same footprint size as their older, single-core predecessors. Most multiple-core processors also use less energy than if they were all seperate processors. The later versions of the Itanium 2 support two threads using coarse-grained multithreading. What does this mean? In a processor without threading, each cycle, one or more instructions from the same program are issued into the processor to be run. The problem is that often the processor is not doing any actual work because the instructions for the program are waiting for something such as data from memory (which is a very slow operation, 1000s of cycles). In order to make better use of the processor, a technique called threading was created. There are three general types of threading. In coarse-grained multithreading, CGMT, the processor will switch what program it is running instructions from when the thread encounters a long-latency event. This means at any given time, the processor is still only running one program.  
 
Opteron  
 
The eighth generation of AMD's Hammer architecture, the Opteron processor (code names SledgeHammer - 130 μm and Venus - 90 μm), was introduced in April 2003. Designed to compete with Intel's Itanium 2, the Opteron is the most powerful of AMD's 64-bit processors. It was designed for server and enterprise applications. It has arguably become the most popular x86-based 64-bit processor. A variety of computer system producers, including all of the largest enterprise-level UNIX vendors (Fujitsu, IBM, HP, and Sun), sell Opteron systems. What does μm mean? This refers to the manufacturing process for the processor. The smaller the number, the smaller the size of the circuits. This allows the processors to be smaller and use less power. See wikipedia's page about 90 nanometer for more details.  
 
The Opteron chip comes with either one or two cores with clock speeds from 1.8 to 2.8 GHz. Both the single and dual core Opteron processors can run two threads using simultaneous multithreading and supports out-of-order execution. What does this mean? In simultaneous multi-threading, instructions from more than one program (in the case of the Opteron, from two programs) issue in to the processor to be run in the same cycle. In an in-order processor, instructions for a program are executed in the processor in the same order as they occur in the program. However, many instructions in a program are independent of each other, and do not have to be executed in-order for the program to produce the correct result. This is called instruction level parallelism, or ILP. Why does it matter if instructions can be run out-of-order? There are many operations which take more than a single-cycle to execute, such a floating-point math or loads and stores from memory. With in-order execution, all instructions have to wait for these operations to finish before they can continue. Out-of-order execution allows a processor execute other instructions rather than stall the program waiting for a high-latency operations to finish. The average power consumption of an Opteron processor is 89-90 watts. What does this mean? Even if you aren't an environmentalist who's worried about our global energy usage, how much you computer uses is something you should care about. If you're Joe-Average, your computer eating power means less burgers you get to eat. My dual processor Dell around 350 watts (possibly more at peak). That is more than if I turned on every light in my house (we use florescents). I make sure that machine is off or sleeping whenever it is not in use. If your Bob-Admin, multiply that by however many machines you have. Bob-Admin also has to think about how he's going to keep his server room cool too because 50 machines using that much power make a great sauna in a few hours. If Bob-Admin puts his machines in a co-location facility, where he is paying by the sq ft and the watt, not only does he pay more for energy usage, but for space too. The racks in a co-lo can only support so much power draw per sq ft, so that means less machines per rack.  
 
Pentium D  
 
In May 2005, Intel introduced the Pentium D (code name Smithfield), a dual-core processor, which contains two essentially unmodified Pentium 4 Prescott processors. Unlike the Prescott, the Pentium D adds support for 64-bit through Intel's EMT64 technology. Although some Pentium Prescott processors utilize Intel's Hyper Threading technology, the Pentium D examined in this paper does not . In early 2006, Intel released a 65μm version of the Pentium D, code named Presler. Like Smithfield, the Presler chip does not support multithreading. There is a dual-thread Pentium D, the 3.2GHz Pentium Extreme, but no CPU2000 benchmarks have been published for that processor so it was not included in this paper. An Extreme Edition of Presler is scheduled to be released in mid-2006.  
 
The two cores in the Pentium D Smithfield are on the same die and have a clock speed of 2.8, 3.0, or 3.2Ghz. The Presler cores are each on their own die, which decreased production cost since a defect in a die affects only one core. What does this mean? Two cores on the "same die" mean that both cores are manufactured on the same integrated circuit. Cores on seperate dies mean the cores are on seperate integrated circuit, though they are still on the same chip. For machines with both cores on one die, communication time is faster but a defect in one core makes both cores not usable since the integrated circuit must be thrown away. With the seperate die approach, there is some loss in communication speed but a defect in one core means only that core must be thrown away. The cores of the Presler chip operate at 2.8, 3.0, 3.2, and 3.4GHz. With 230 million transistors, the Smithfield is significantly smaller than the dual-core Itanium processor, which has 1.7 billion transistors yet the maximum power usage for a Pentium D is about 130W - 155W and the dual-core Itanium is 100W. The cores in the Pentium D are clocked significantly lower than the single-core Prescott in order to minimize power consumption. What does this mean? Usually, the number of transistors in an integrated circuit correlate to the amount of power used however with the Pentium D compared to the Montecito, this is not the case. Each core in the Pentium D operates at a lower clock frequency than the single-core equivalent so that the power usage is still reasonable. Operating at the same frequency, the Pentium D would likely be over 200W in power usage. .  
 
Power5  
 
The IBM Power5, close relative of the G5, was released in June 2003. IBM uses the Power5 for a range of machines from single processor entry-level servers to high-end multi-processor servers. Like its predecessor, the Power4, the Power5 is a dual-core processor. Both cores are on the same die. The clock speed of the Power5 ranges from 2.0 to 2.7GHz. The power usage is about 100W . The Power5 can run two threads in each core using simultaneous multi-threading. It can also operate in single thread mode.  
 
UltraSPARC IV+  
 
Code named Panther, the fifth generation processor in the SPARC family, the UltraSPARC IV+, was designed for enterprise computing and released in September 2005. Panther is a dual-core processor that supports two threads using what Sun calls "chip multi-threading", or CMT. Sun's CMT does not quite the same definition of threading as is commonly used when talking about processors. Threading normally means running instructions for different programs in the same core. CMT in the IV+ is running a different program in each core, not in the same core. The UltraSPARC IV+ has twice the computing power over the UltraSPARC IV yet reduces the power consumption from 108W to 90W. What does this mean? The IV+ shows how much the manufacturing process can improve processor power usage. The UltraSPARC IV used a 130 μn process but the IV+ uses a 90 μn process. This allows the processor to be the same in physical size even though it is much more complex and powerful. This also helps it use less energy than the IV.  
 
UltraSPARC T1  
 
The UltraSPARC T1 , released in November of 2005, is the newest of the SPARC processor line by Sun Microsystems. The T1 has generated a lot of interest due to its departure in design from other 64-bit processors currently on the market. The T1 has eight cores operating at 1.0 or 1.2GHz. All cores on the processor operate at the same frequency, the processor is available in a 1.0 or 1.2 version. Each core can execute four threads, making the T1 a 32-way processor. Despite the large number of cores, the T1 only consumes 75W on average, 79W peak.  
 
Xeon  
 
The 64-bit Intel Pentium 4 Xeon was released in June 2004 (code named Nocona). It is designed to be an enterprise-level processor for business computing. It comes in a single and dual core model (code named Paxville, released in October 2005) and supports Intel's Hyper Threading technology. The Xeon has clock speeds from 2.83 to 3.66Ghz, the fastest of any of the processors examined. The single core Xeon uses 110-120W of power, the dual-core uses 135-150W.  
 
Sources  
 
This is not a complete list... I'll be putting a handful at the end of each entry.  
 
1. P. Kongetira, K. Aingaran, K. Olukotun - "Niagara: A 32-Way Multithreaded Sparc Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 21-29, 2005  
 
2. C. McNairy, D. Soltis - "Itanium 2 Processor Microarchitecture". IEEE Micro, March/April 2003, Vol. 23, No. 2, pg. 44-55, 2003  
 
3. R. Kalla, B. Sinharoy, J. Tendler - "IBM Power5 Chip: A Dual-Core Multithreaded Processor". IEEE Micro, March/April 2004, Vol. 24, No. 2, pg. 40-47, 2004  
 
4. C. McNairy, R. Bhatia - "Montecito: A Dual-Core Dual-Threaded Itanium Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 10-20, 2005  
 
5. C. Keltcher, K. McGrath, A. Ahmed, P. Conway - "The AMD Opteron Processor for Multiprocessor Servers". IEEE Micro, March/April 2003, Vol. 23, No. 2, pg. 66-76, 2003  
 
Power Consumption Sources  
 
Itanium 2 - http://www.intel.com/products/processor/itanium2/index.htm  
Opteron - http://www.epinions.com/content_18680811072  
Pentium D - PCStats.com and wikipedia.com  
Power5 - http://www.xlr8yourmac.com/G5/xserveG5.html  
UltraSPARC IV+ - http://www.extremetech.com/article2/0,1558,1667444,00.asp  
UltraSPARC T1 - http://www.sun.com/processors/UltraSPARC-T1/index.xml  
Xeon - www.news.com

posted by kamundse Mar 27 2006, 02:52:05 PM PST Permalink Comments [2]

20060324 Friday March 24, 2006
Using zones in a university.
 
 
Remember when you were in college and you were working on your senior project or even a class project and you had to make the decision, do it all at home (with your crappy internet connection, less powerful machine, etc) or do it on a machine at school and beg the admin to set up all the special things you needed (and be told no most of the time and when they did say yes, wait a week for it to happen)? Ok, maybe you were in college in the days when you couldn't do things at home (poor punch card folks, my condolences) so for you the problem was worse, you were totally at the mercy of the people running the machines.  
 
My university has finally solved the problem, with zones. Of course this would be just a few months before I graduate and after I am done with all of this type of work, but I am excited for the rest of the students. Now students have complete control over their (virtual) machine plus they get all the benefits of using a machine in the lab. If only we'd had this when I did my senior project (ok, I *was* the admin then so it was not too bad).  
 
One student is logging his experience, called The Bunsen Project. It looks as if he broke php already (3/27/03 - it has been fixed) but here were his first two entries after getting the zone:  
 
Virtual Machine - March 6  
 
Today I got the virtual machine from the CSL. With this, I will be able to set up a MySQL account so that I can put my project online, instead of having to resort to static box. More importantly, however, I will be able to have full administrative access to the server which will be great practice.  
 
Server Installs - March 9  
 
Well, it took several hours, but I made lots of progress today with the server. I was able to install several essential programs using the 'pkg-get' program from blastwave.org. Here are some of the programs I installed:  
 
I decided to compile and install the apache server from scratch because I figured I could get more out of it. Because of this, I've run in to some problems and my server is still not up yet :(
 
 
Doing this isn't just nice for the students, it makes the admin's life so much easier too. Now when a student or instructor needs to set up a project, he can just hand them a zone and let them go. Oh, I wish we'd had this when I worked as an admin there...

posted by kamundse Mar 24 2006, 08:46:01 AM PST Permalink Comments [1]

20060317 Friday March 17, 2006
64-bit processors - here we go.
 
If you'd asked me a year ago about processor architecture, I would have told you that it is not my area of interest and wouldn't have had much else to say about it. I think it all started with my two undergrad architecture classes, I hated them. I suppose I should have known better than to give up so fast. I had the same thing happen to me with terrible 7th grade history teacher and it took until my first college history class before I realized it was an interesting topic after all.  
 
When I started the class I knew I'd have to do a research paper. I have been mildly interested in the Niaraga processor since it came out a few months ago so I decided I would find some way to incorporate it into my research. What I ended up doing was a comparison of the architecture of the 64-bit processors you will find in current computers (current being anything released in the last 2-3 years) and their performance on some of the SPEC benchmarks.  
 
When I started I was pretty clueless about the subject. I had no idea Intel had EMT64 in the Pentium line, that there was an Itanium 2 (and a 3rd generation on the way), or that the Power5 was being used anyplace but in Apple's computers. I realized, if I am this clueless about all of this, then likely other people (both computer-geeks and regular consumers) are too. As I started to talk to other computer-savvy friends of mine, I found they were. It seems computers are like cars, most people know a fair amount about the model car they have, but only a small percent of the population are true gear-heads. I knew my two 3.0Ghz Xeon processors in my Dell were dual-core and had threading. I had no idea the Power5 did that as well, or even that the UltraSPARC IV+ was dual-core (and I work for Sun).  
 
So, I am not an architecture expert. I will probably make mistakes. I didn't take all of these processors apart and look inside, so whatever I know is based on what has been published by the hardware manufacturer or other people doing hardware research. If I make a mistake, please correct it! With that in mind, let's get started.  
 
The processors I'll be focusing on:
Alpha 21364
Itanium 2
Opteron
Pentium D
Power5
UltraSPARC IV+
UltraSPARC T1
Xeon (Nocona and Paxville)
You will notice there are a few processor families missing, such as the MIPS line of processors. I wanted to focus on processors you are likely to find in a machine you could buy today. I tried to be as fair as I could with my analysis, but I admit I had some bias going into this. I have never been a big Intel fan. For me, Windows and x86 were always "ew". I do have an of x86 box at home and love it (it is screamin' fast) so I am not totally against them anymore. I also wanted to see the Sun processors do well. I think working for Sun, it is expected I might be cheering for the home team. I can say that many things I found surprised me, others disappointed me, and my opinion of most of these processors is very different than when I started.  
 
Working Backwards  
 
Rather than go through all the analysis and then write a conclusion, I am going to start this off with my conclusion and then we can look at how I got there.  
 
If I had to pick the best overall processor (based on the research I did and the specs I looked at), it would have to be the Power5. The T1 blows it away (and every other processor) in benchmarks like SPECweb2005 and jAppServer2004 but the Power5 tops all other processors in these benchmarks and also shows great performance on the CPU2000 benchmarks as well. The T1 has no published CPU2000 results. It is not a data crunching machine so I doubt we'll ever see CINT2000 or CFP2000 results for it. Based on processor only, if I wanted to buy a web server, I'd pick the T1, for anything else, I'd choose the Power5.  
 
The Opteron would make it to my number two place. It was in the top two for overall processor speed (CINT2000) and was a close second to the Power5 for throughput for up to four processors (CINT2000Rate). It is too bad there are not 8-way and higher Opterons so we could get a really good look at its throughput scalability.  
 
The Itanium 2 and the UltraSPARC IV+ both fall in the middle. Neither processor is very fast but they both have decent thoughput and both scale fairly well. Even though neither of them are the top performers, they are the most common processors for benchmark results for machines with 36 or more processors. The next generation Itanium, Montecito, makes some movement in design in the same directions as the T1. It will be interesting to see how it performs once it is released.  
 
The Pentium family processors fall at the bottom. The single-core Xeon is fast. It had the second highest average score on the CINT2000 benchmark. The Pentium D fell in the middle for speed. Both of them have good throughput for a single processor machine. After that, things start to fall apart for the Pentiums. There are no published results for the Pentium D for more than single-CPU machines. The Xeon's throughput drops noticably at two processors and pretty much falls off a cliff above four processors. Neither machine performs very well on the SPECweb2005 or jAppServer2004 benchmarks.  
 
You might ask, but what about the T1 and the Alpha, where do they fit in? It is not really fair to put the T1 in this ranking, at least not directly. It really is not the same type of processor as the rest of these. The processors above were designed to be all-purpose processors. The T1 was designed with a specific type of application in mind, "network-facing". The T1 would be at the top for these types of applications, and would not do as well for others. Without published results, the best I could do is speculate and I want to discuss facts, not play guessing games. With the Alpha, I didn't find out until after my paper was done that there was chip released recently (2003). I have not had a chance to take a close look, but I will.  
 
I know that by putting this up, someone is going try to argue that I have not provided the details. As I said, this is the conclusion, not the whole of my research. I can only fit so much into a single blog entry. Have no fear, the details, more than you probably want, are coming in future blog entries.  
 
Tom pointed out it might look bad that I didn't put a Sun processor in the #1 place. Well, if telling the truth gets me in trouble, I can live with that. The reality is that anyone can look at these results, they are all published on SPEC's web page. I only had time to look at the integer CPU2000 benchmarks, SPECweb2005 and jAppServer2004. There are so many more to look at (and I plan to) that by the end of this, those rankings will likely change.  
 
That is all for now. Stay tuned for next time when I take a look into some of the architecture designs for these processors.

posted by kamundse Mar 17 2006, 10:29:03 AM PST Permalink Comments [2]

20060316 Thursday March 16, 2006
New software engineering conference
 
I wanted to let you all know about a really exciting new software engineering conference coming this year.  
 
Waterfall 2006 will be held in New York state in April 2006.  
 
Some of the paper authors include: K. Schwaber, J. Highsmith, R. Martin, S. Ambler, and K. Beck.  
 
Check it out!  
 
(I have not mentioned it here before, I am a big fan of agile development... you'll get to read more about my work in this area later.)

posted by kamundse Mar 16 2006, 12:31:53 PM PST Permalink

20060314 Tuesday March 14, 2006
Dear blog, Sorry I have not written in awhile.
 
 
Wow, can you say the word busy? I knew you could!  
 
Ok, so it has been over two months since I last posted. I know it is an over used excuse, but life has been really busy. My life is pretty crazy to start with between working full-time, going to grad school, and having two kids. I am also working on my thesis and that pretty much eliminates any free time that might have been left in my day.  
 
Today I took the final for what I hope is my 2nd to last class ever. I also turned in my 23-page paper about 64-bit processor design and performance. It is so nice to be done, even if it is only for two weeks until classes start up again.  
 
I have to confess, being busy is not my only reason for not blogging. I probably could have worked in a post or two since January 5th. My main source for interesting things to write about, Tom, has not had anything exciting happen lately. Oh, he's given me a couple of good bugs to look at but none of them were really exciting enough to inspire a whole blog entry. I did learn a neat way to wedge a single CPU Solaris box though. :)  
 
The fact that Tom has not had anything "interesting" happen all quarter (meaning nothing has broken, crashed spectacularly, etc) is a good thing. I think he's finally changed enough of the old way the lab was run that things are finally working correctly. Of course saying that means he'll be IM'ing me in the next 24 hours with something.  
 
All of this is not why I am blogging today. You'd think after a 23-page paper and then an hour of intense writing for my final I'd be quick to getting to the point this afternoon. Here it comes... I finally have something interesting to blog about that has nothing to do with anyone else's funny experiences with UNIX. It is true.  
 
What is this exciting new topic...  
 
     64-bit processor design and performance.  
 
For those actually reading rather than skimming, you'll notice this was the topic of my paper. My brain is just filled with SPEC results, L2 cache sizes, and interconnection network bandwidth numbers. I did not expect I'd find my graduate architecture class very exciting, but I really enjoyed it and the research for paper turned out to be really interesting.  
 
So, for the next several blog entries, I'll be talking about 64-bit processor design, performance, cost, etc.. I hate to leave you with a cliff-hanger but I will be back soon, I promise!  
 
P.S. - I dicovered the spell checker on b.s.c doesn't know the word "blog"... I thought that was funny.

posted by kamundse Mar 14 2006, 01:44:17 PM PST Permalink