See all my pictures here.
| « December 2009 |
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
| | | 1 | 2 | 3 | 4 | 5 |
6 | 7 | 8 | 9 | 10 | 11 | 12 |
13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 |
27 | 28 | 29 | 30 | 31 | | |
| | | | | | | |
| Today |
The requested Bookmark Folder does not exist: Blogroll

Monday April 24, 2006
"Never underestimate the power of the Schwartz!"
So... in light of today's news... I bring you quotes from Spaceballs (which are eerily relevant)...
 
Yogurt: I am the keeper of a greater power, a power known throughout the universe as the...
Barf: ...the Force?
Yogurt: No, the Schwartz!
 
Yogurt: Merchandising, merchandising, where the real money from the movie is made. Spaceballs-the T-shirt, Spaceballs-the Coloring Book, Spaceballs-the Lunch box, Spaceballs-the Breakfast Cereal, Spaceballs-the Flame Thrower.
[turns it on]
Dink, Dink, Dink, Dink, Dink, Dink: Ooooh!
Yogurt: [reacts to dinks] The kids love this one.
[a dink hands him a doll that looks likes Yogurt]
Yogurt: And last but not least, Spaceballs the doll, me.
[pulls string]
Doll: May the schwartz be with you!
Yogurt: [kisses the doll] Adorable.
Lone Starr: I still don't understand how I'm going to lift that big statue with this little ring.
Yogurt: Never underestimate the power of the Schwartz!
Lone Starr: Listen! We're not just doing this for money... We're doing it for a SH*T LOAD of money!
[upon going into "ludicrous speed"]
Dark Helmet: My brains are going into my feet!
Dark Helmet: WHAT? You went over my helmet?
President Skroob: Sandurz, Sandurz. You got to help me. I don't know what to do. I can't make decisions. I'm a president!
Dark Helmet: No, it's not what you think. It's much, much worse!
President Skroob: As president of Planet Spaceball, I can assure both you and your viewers that there's absolutely no air shortage whatsoever. Yes, of course. I've heard the same rumor myself. Yes, thanks for calling and not reversing the charges. Bye-bye.
[hangs up]
President Skroob: Sh*thead.
Dark Helmet: You have the ring, and I see your Schwartz is as big as mine. Let's see how well you handle it.
Dark Helmet: Sh*t! I hate it when I get my Schwartz twisted.

Wednesday April 12, 2006
Top three OS attributes?
I've been told that for Linux, efficiency trumps all other attributes. For OpenBSD, security seems like the number one attribute. Something like Exokernel would put flexibility at the top. For Solaris, robustness and compatibility are good candicates for being in the top three.
Other operating systems have attributes they explicity don't care about. Plan 9 creators decided against compatibility (with UNIX) in favor of doing everything over the "right" way.
So, what do you think the top five (increased from three) attributes for Solaris, Linux (you can even break it down by distribution), OS X, the various BSDs, and any other OS you are interested in? This is a list of the top five design goals of the OS, not necessarily what attributes it actually displays.
Here's a list (not complete) of some attributes to consider:
- robustness
- maintainability
- security
- interoperability/compatibility/standardization
- useability
- efficiency/performance/speed
- portability
- flexibility
- scalability
- correctness
Here's my first pass, I'll update this based on any comments.
| Solaris |
robustness, compatibility, scalability, security |
| Linux |
performance (changed from efficiency), portability |
| FreeBSD |
compatibility, robustness, maintainability |
| OpenBSD |
security, portability, standardization, correctness |
| NetBSD |
portability, interoperability, security |
| OS X |
useability, robustness |
| Generic μkernel |
flexibility, useability |
| Windows |
profits, lock-in, global domination (from comments) |
If an OS is not here, its because I was not able to come up with anything I felt comfortable with. Oh, and I am not only talking unix-like OSes. I did not leave Windows out intentionally, I just don't know enough about it to know what its design goals are.
For anyone who may be following it, I will get back to the 64-bit processor series in the next post.

Friday March 31, 2006
64-bit processors - Cache and Memory
Who doesn't like cache? Why, just the word makes me feel happy... ah... cache...
Opteron
The Opteron has an on-die L1 and L2 cache for each core.
What does this mean? Cache is memory available to the processor which is faster to access than the main memory on the system. There are different levels of cache which are successively slower and larger. The level 1, or L1, cache is inside the core of the processor. Most L1 caches are split into a data cache and an instruction cache. The data cache stores recently used or computed data, the instruction cache stores program instructions. It is the fastest and smallest cache. The level 2, or L2, cache is larger and slower and may not be in the processor core. For processors where the L2 cache is not in the core, it is still on the processor chip. Some processors include a level 3, or L3, cache which may or may not be on the chip. It is larger and slower than the L2 cache. If it is off-chip, it is often accessed through the memory bus. Cache is used by the processor to keep data and program instructions which have been recently used by the program for faster access in the future. Changes to the cache size and speed have significant effects on the processor's performance. Cache takes a large amount of space on the chip and increases the cost of production for the processor.
The L1 instruction and data caches are 64K split into four 8K banks and is 2-way associative.
What does this mean? The splitting of memory into banks is a common technique. It allows data to be spread across the banks for quicker access. This is called interleaving. Associativity refers to how many possible places in the cache a particular piece of data can be stored. In a cache with 1-way associativity, there is only one location a particular piece of data can be stored. This location is determined by using values such as the virtual or physical memory address of the data. Because the data can be in only one place, looking in the cache to see if the data needed is there is a fast operation. Since the cache is not very big, often more than one piece of data will map to the same single location. This means it will overwrite whatever data is already there. By increasing associativity, the number of places a piece of data can be in the cache increases. This decreases the frequency of data overwriting other data, but makes finding the data slower. For a 4-way associative cache the processor must look in as many as four places to find the data needed, in an 8-way associative cache it must look in up to eight places, and so on. A fully-associative cache means a piece of data can be stored in any location, which means the processor may need to search every line in the cache to find the data needed.
They both use 64-byte lines.
What does this mean? When data is stored in the cache, it is stored in lines. The size of a line can be different between the L1 and L2 cache. When data is fetched into the cache, it is done one line at a time. The bigger the line, the more data is fetched. This has advantages and drawbacks. Getting more data in a single fetch can speed up the processor since fetches are time consuming. But, fetching more data means overwriting more data in the cache, which can increase the need for further fetches later because needed data was overwritten.
The unified L2 cache is 1M, 16-way associative and can handle 10 simultaneous requests.
What does this mean? While the L1 cache is split between a data and instruction cache, the L2 cache is a unified cache, so both data and instructions are in the same L2 cache. Notice the associativity of the L2 cache is much higher than the L1 cache. If a piece of data is not in the L1 cache or in the L2 cache, unless the machine has an L3 cache it must be loaded from memory. This is a very slow operation (1000s of cycles). To increase the odds that needed data will be in the L2 cache, the associativity is increased. This makes it slower but it is still much faster than main memory. The most basic cache can handle a single request each cycle. The L2 cache in the Opteron can handle 10 requests each cycle.
The L2 cache uses a
LRU replacement policy.
What does this mean? LRU stands for "least recently used". This means when new data needs to be stored in the cache, it overwrites the data in the cache which has been accessed the longest time ago. The idea is that data which has not been accessed in a long time is less likely to be needed in the future as data which has been accessed more recently.
The Opteron does not have an L3 cache. The on-chip memory controller (named
Northbridge) is 128-bits and operates at the same frequency as the core. It has a maximum bandwidth of 5.3G/s.
What does this mean? To access main memory, the processor sends requests to the memory controller. The memory controller knows how to access the memory and is responsible for returning to the cache the data requested from memory. Different memory controllers can handle different sizes of data. For the opteron, this is 128-bits. It operates at the same frequency as the core, which means it sends and receives data at the same frequency as the processor operates. If it operated at half-core speed then it would be able to send or receive data only every other cycle. The maximum bandwidth refers to the amount of data can be transferred per second. In this case, it can transfer a maximum of 5.3G each second to or from the memory.
Memory access on an Opteron returns lines critical-word first.
What does this mean? When data is requested from memory, it must return in a whole line. A word is the base unit for data. Lines are made of words. When the processor requests a piece of data from memory, it can be returned in one of two ways. The line may come with all the words in order. In this case, if the processor needs the 5th word in the line, it has to wait for the first four to come back from memory before it gets the one it wants. If the number of words in a line is small, this is not a big delay. If it is large, say 100 words, the processor could have to wait a while if the word it wants is the 100th one. The other way to return the data is for the needed word to come first and the rest of the line comes after. This is critical-word first. It gets the data to the cache faster but the words in the line must be reordered at both ends of the transaction.
The processor has a 40-bit physical memory size and a 48-bit virtual memory size.
What does this mean? Physical memory is usually the RAM on the system, commonly in the installed as DIMMs, SIMMs, or RIMMs. A 40-bit physical memory size means that the processor can use memory addresses up to 40-bits in size. The larger this value is, the larger the maximum amount of memory on the system can be. In this case it is 2^40, or 1 terabyte of memory. Processors also use a type of memory addressing called virtual memory. Virtual memory is used in multitasking computers (ones that run more than one program at the same time, pretty much any computer you've used in the last 25 years) to give programs a contiguous memory space. Physical memory addresses, which may be spread all over the memory, are mapped to new virtual addresses which are in one continuous block. The virtual memory size does not need to be the same size as the physical memory size, as is seen in the case of the Opteron.
Pentium D
The Pentium D has an 8-way associative, 12K L1 instruction cache (trace cache), which uses 64-byte lines, for each core.
Compared to the Opteron, the L1 cache in the Pentium D is much smaller, 12K rather than 64K. It is does have a higher degree of associativity than the Opteron's cache. Because it is smaller, it lends itself to this higher associativity because it is a smaller cache space to search through. It uses the same line size, 64-byte, as the Opteron.
There is also a 4-way associative, 16K L1 data cache on each core.
The L1 data cache is larger than the L1 instruction cache and is less associative. It is not uncommon for processors to have different sized L1 instruction and data caches, as we'll see as we look at more processors.
The L1 data cache is
non-blocking and allows up to four cache requests.
A non-blocking cache allows the processor to issue more than request at a time. For the Pentium, it can handle up to 4 requests at a time.
The load latency is two cycles for an integer and six for floating point.
Load latency is the amount of time it takes for requested data to be returned.
Each core has a unified L2 cache that is 1M for Smithfield and 2M for Presler. The L2 cache is 8-way associative and is non-blocking with a load latency of seven cycles. The bandwidth between the L1 and L2 cache is 48G/s.
What does this mean? Again, a unified L2 cache means both instructions and data are in the same cache rather than separated as they are in the L1 cache. Although the L1 cache in the Pentium D is more associative than the L1 cache in the Opteron, its L2 cache is less associative. A maximum of 48G can be transferred each second between the L1 and L2 cache.
All the caches have a LRU replacement policy. The hardware also supports prefetching. It attempts to stay 256 bytes ahead of the current data access location.
What does this mean? Prefetching is loading data or instructions into the cache before they are requested by the processor. In the case of the Pentium D, it assumes that needed data will immediately follow data currently being requested and loads the next 256 bytes of data as well. The advantage to prefetching is that often data that will be needed in the near future is immediately following the currently requested data. By loading it before it is requested, the processor eliminates the need for the instruction to request the data from memory since it will be in the cache, which is must faster. The disadvantage to prefetching is that when data is prefetched, other data in the cache must be overwritten. If prefetching is too aggressive, it will overwrite enough existing data in the cache that it will cause extra requests to memory to return that needed data which was lost.
The memory controller for a Pentium D is not on the chip. It is accessed through the
front-side bus (FSB).
What does this mean? Unlike the Opteron, the Pentium D does not have its memory controller built into the chip. Doing this makes the processor smaller and the design simpler, and may reduce production costs. The drawback is that communication across the front-side bus is slower than on the chip. Since the memory controller must use the bus eventually to fetch data from main memory, this slowdown may not be an issue since the bottleneck of the bus will occur either way. Having an off-chip memory controller also means a multi-processor machine could share a single memory controller among multiple processors, rather than having one in each chip. Again the advantages are the same, the disadvantage is that the processors may have to wait while the memory controller is executing another processor's request. This is a general disadvantage with any shared resource.
The FSB on the Pentium D is 800Mhz with a theoretical maximum bandwidth of 6.4G/s. The Pentium D uses 40-bit physical address and 64-bit virtual address sizes.
Power5
The Power5 has a separate L1 cache for each core. The L1 instruction cache is 64K, 2-way associative, and is direct mapped from the L2 cache.
What does this mean? The L1 instruction cache direct mapped from the L2 cache means that everything in the L1 instruction cache is duplicated in the L2 cache. Since the L2 cache is shared between the cores on the Power5, this allows the cores quick read access to each other's L1 instruction cache.
The L1 data cache is 32K and 4-way associative. The L2 cache is shared between the two cores. It is 1.875M divided into three slices and is 10-way associative.
What does this mean? Unlike the Pentium D and the Opteron, the Power5 shares its L2 cache between its processors. Advantages to doing this are that the size of the processor is smaller than if it had two separate L2 caches, and it allows the cores to share data and instructions more quickly. Like the Opteron's L1 caches, the L2 cache on the Power5 is divided into slices to allow interleaving of data.
The L2 cache uses 128-byte lines and has a bandwidth to the L1 cache of 64G/s. The Power5 also has an off-chip 36M L3 cache with an on-chip directory.
What does this mean? The Power5 is one of the processors examined which uses an L3 cache. It's L3 cache is largest in size of the processors examined in this paper. An on-chip directory of the L3 cache is provided so that the processor can determine more quickly if needed data is in the L3 cache or must be fetched from main memory.
This cache is directly connected to the L2 cache. Having the L3 cache connected directly to the L2 cache rather than accessed through the memory controller speeds up access time. The bus to the L3 cache operates at half-core speed.
Unlike it's predecessor, the Power4, and the UltraSPARC IV+, the L3 cache on the Power5 is not accessed through the on-chip memory controller. In order to speed up access time to the L3 cache, the L3 cache is connected directly to the L2 cache through a back-side bus.
UltraSPARC IV+
Panther moves from the 2 levels of cache in the UltraSPARC IV to three levels of cache. It has a 64K L1 instruction cache which is divided into two 32-byte subblocks.
Again, this cache uses interleaving by splitting the cache into subblocks.
The L1 data cache is also 64K and uses a
write-through policy.
What does this mean? The job of the cache is to provide a fast access copy of data in main memory. When this data is changed from its original value, that change must be written back to main memory. There are two common ways to do this, write-through, and write-back (also called copy-back). In write-through, when data is written to the cache it is simultaneously written to main memory. The advantages to this policy are that it is simpler to implement and it keeps the cache and main memory consistent at all times. With write-back, changes to data in the cache are only sent to main memory when the changed cache line is evicted (overwritten). This leads to less traffic on the memory bus and speeds up system performance but comes with the risk that if the computer were to have an event such as power loss or a system crash, the changed data in the cache may be lost.
The L1 cache includes 2K, 64-byte line prefetch buffer accessed in parallel with the L1 instruction cache.
What does this mean? With the Pentium D, we discussed data prefetching. Some processors include the ability to prefetch program instructions. The IV+ can store up to 2K of prefetched instructions in a special prefetch buffer.
Panther’s L1 cache also includes a 2K fully associative
write cache.
What does this mean? A write cache allows the processor to continue on with other operations rather than wait for a write to complete.
The L2 cache for the UltraSPARC IV+ was moved on-chip and is shared between the cores. It is 2M, 4-way associative, and operates at half-core speed. It also is completely inclusive of all L1 caches. The L2 cache uses a copy-back policy to decrease bus traffic. The L3 cache on the Panther is 32M and 4-way associative. It has 64-byte lines and also follows a copy-back policy. L3 tags are kept on-chip.
Like the Power5, the IV+ also has an off-chip L3 cache with an on-chip directory. A difference between the Power5 and the IV+ is the Power5's use of a back-side bus for access to the L3 cache.
The L3 cache is a
victim cache, only being written to when things are evicted from the L2 cache. On a hit, the L3 line is copied back to the L2 cache and then invalidated in the L3 cache.
What does this mean? When needed data is found in the L3 cache, is it copied into the L2 cache and then removed from the L3 cache. The purpose of the L3 cache on the IV+ is to store data which has been used by the processor in the past but was evicted from the L2 cache. This data is a "victim" of being overwritten. When it is needed again, it is moved to the L2 cache so it can be accessed quicker in the future and since it is no longer a victim, it is removed from the L3 cache.
In cases where the two running threads cannot cooperate using the shared L2 and L3 cache, Panther has a mechanism for pseudo-splitting the shared caches. When split, both threads can read all of the cache but can only write to half of it.
What does this mean? How can threads not cooperate? If the threads keep overwriting each other's data in the cache, it slows both threads down. By splitting the cache into two separate areas and only allowing each thread to write to one half, the processor can prevent the two threads from clobbering each other. The reason for not having this be the default setup is that doing so cuts the sizes of the L2 and L3 cache in half from the view of each core. When they are not clobbering each other, having the larger cache sizes significantly increases system performance. In the extreme case, if one thread was idle, the other thread could be utilizing all the cache rather than being restricted to half.
UltraSPARC T1
Each core has a 16K, 4-way associative L1 instruction cache that uses 32-byte lines. The L1 data cache, also in each core, is only 8K, 4-way associative, uses 16-byte lines, and has a write-through policy.
One thing you may notice right away is the big difference in size of the L1 cache from the UltraSPARC IV+. This is just one of the many big changes in the T1 from the rest of the current SPARC processor line. The T1 does not just differ significantly from other SPARCs, but from the other 64-bit processors as well. Although its L1 instruction cache is larger than the Pentium D, and the same size as the Itanium, its data cache is noticeably smaller than any other processor.
The L2 cache is shared between cores and is accessed through a
crossbar interconnection network.
What does this mean? An interconnection network allows communication between the core and resources such as other cores, memory, cache, I/O, etc.. In the case of the T1, with eight cores sharing the L2 cache and communicate with each other, a standard linear connection network (basically a wire connecting all eight cores to each other and the L2 cache) would get bogged down quickly. A crossbar is a type of connection switch which can handle more traffic. Imagine it is a grid of wires connecting each core and the L2 cache. This provides multiple paths for data to get from point A to point B without having a collision with data going from point C to point D, or even to point B.
The crossbar provides more than 200G/s of bandwidth.
Here is another difference between the T1 and other processors. The bandwidth on the crossbar interconnect is three to four times as much as the on-chip bandwidth of the other processors. The T1 does have eight cores to support, rather than two, so this bandwidth size is not surprising.
The L2 cache is 3M banked four ways, 12-way associative, and uses 64-byte lines. Data is interleaved across the banks in 64-byte granularity. The L2 cache has a directory of all eight L1 caches. There are four on-chip memory controllers shared by the eight cores and accessed through the crossbar. The memory bus on the T1 is significantly larger than other processors with a bandwidth of 20G/s.
Recall that the bandwidth sizes for the other processors were less than half of this, with the next largest being the the UltraSPARC IV+ at 9.6G/s.
The T1 uses 40-bit physical addresses split into two sections, memory and
I/O addresses, based on bit 39.
What does this mean? A 40-bit physical memory size means that memory addresses are 40 bits long. The last bit (the range is 0 to 39, not 1 to 40) tells the processor whether this is a memory or I/O address. What is an I/O address? An I/O address is an address that belongs to one of the system's I/O devices. This could be video, network, or other devices. Remember the T1 was designed to be "network facing" so it expects to do a lot of I/O functions.
It uses a 48-bit virtual memory size.
Xeon
The Xeon processor follows the same general design as the Pentium D. The L1 cache on the Xeon is the same as in the Pentium D. The L2 cache on the Nocona comes in 1M or 2M sizes. Paxville uses a 2M L2 cache size only. The L2 cache is on a 200Mhz, shared bus to the off-chip memory controller.
What does this mean? The L2 cache is accessed via the same bus that goes to the off-chip memory controller, though the L2 cache is on the chip.
The memory bus is 800Mhz with a maximum bandwidth of 6.4G/s.
Itanium 2
There are three levels of cache available on-die with this processor. Both the L1 data and instruction cache are 16K and 4-way associative. The L1 instruction cache supports simultaneous demand and prefetch.
What does this mean? In the same cycle, the L1 instruction cache can prefetch instructions as well as respond to requests for instructions from the cache by the core.
It uses a 64-byte line, which are 4 instruction bundles.
The Itanium 2 is a VLIW processor. Up to three instructions are combined into instruction bundles. What does this mean? First, there are two types of processors, those than can issue more than one instruction at a time, and those that cannot. Those that can are split into two groups, called superscalar and VLIW. The more common of the two is superscalar. VLIW stands for "very long instruction word" and the way it works is that each cycle a "bundle" which can contain several instructions is issued.
The L1 data cache uses a write-through policy and can support 2 loads and 2 stores per cycle.
What does this mean? In the Itanium 2 (not exclusively), there are several different execution units which allows more than one instruction to be executing in the core at the same time. The L1 data cache allows two different instructions to load and two instructions to store in the same cycle.
The Itanium 2 uses a scoreboard system to facilitate a non-blocking L1 data cache. This scoreboard allows the processor to continue executing even with multiple L1 data misses by stalling the instruction issue group of the instruction that had the miss.
What does this mean? The scoreboard on the Itanium 2 keeps track of earlier L1 data cache misses. As mentioned above, instructions are issued as a group on the Itanium 2. When an instruction in a group needs data that matches an entry in the scoreboard, the entire issue group is stalled, meaning it cannot not continue to execute, until the value needed becomes available.
Stalling an instruction group does not cause a
pipeline flush.
What does this mean? The pipeline is something like an assembly line for executing an instruction. The instruction goes through stages, each of which performs some small task. We'll go into much more detail about the instruction pipeline next time. In some cases, the pipeline must be flushed, which means emptied of all currently executing instructions. When this happens, all the work done for instructions in the pipeline are lost, and the cycles wasted. In the case of a stall, on the most simple processor, this would mean that no instructions can execute because the whole pipeline gets stopped. It would be like shutting down the conveyor belt in the assembly line. In most current processors, such as the Itanium, the pipeline is more sophisticated and can allow any instruction which is ready to execute a particular stage to go ahead, this is called out-of-order execution (which we've previously explained and will go into again in a later entry).
The L2 cache is a unified, 256K, 8-way associative cache which uses a 128-byte line. It has a latency as low as 5-cycles.
What does this mean? The L2 cache on the Itanium can return data in 5 cycles in the best case, most likely with integer values. For larger data, such as floating-point numbers, especially double-precision or larger, it will take many more than 5 cycles.
It operates out-of-order but L1 misses are stored in a
FIFO for correct ordering. It can handle 4 data and 1 L3 request per cycle.
What does this mean? The L2 cache on the Itanium accepts multiple requests to send or write data each cycle. It may not process these requests in order. For reading data, this is not a problem but for writing data, this presents a problem. If two instructions operate on the same value and send requests to write to that value those writes must occur in order or the value will be incorrect in memory. An example is X=5, X=6, X=7 (counters are very common in programming). At the end of this sequence, X should be 7 in the L2 cache but if the writes are processed out-of-order it could be 5 or 6. FIFO means first-in, first-out. To keep writes in order, they are not processed directly but placed in a FIFO queue and the L2 cache can write the values when it has time, always taking the value at the front of the queue (sort of like the line at the DMV) so that writes happen in order.
One of the big differences in the Itanium 2 compared to other 64-bit processors is the on-die L3 cache. The L3 cache can be 3, 6, or 9M, is unified, and 12-way associative. It has a minimum latency of 12 cycles. It uses 128-byte lines, does not support partial line request, and returns lines critical-word first.
What does this mean? When requesting data from the L1 or L2 cache, only the exact word needed, which is part of a line, is returned. The Itanium 2 L3 cache operates like main memory, returning whole lines at a time only.
A maximum of 84.8G/s of data can be accessed on the chip.
The bandwidth to main memory for the Itanium is not discussed until later in the paper, it is 6.4G/s.
The Itanium 2 uses a 50-bit physical address size and a 64-bit virtual address size.
Sources
6. Intel Corporation – “Intel Xeon Processor-based Servers: Performance, headroom, and versatility for front-end applications, small-business servers, and High-Performance Computing”. www.intel.com, 2005
7. Sun Microsystems, Inc. – “UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006
8. Sun Microsystems, Inc, - “UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006
9. Sun Microsystems – “UltraSPARC IV+ Architectural Overview”. www.sun.com, 2005
10. J. De Gelas - "Opteron: Pushing x86 to the Limit". www.aceshardware.com, 2003
11. S. Wasson - "AMD's dual-core Opteron processors: Because four is better than two". techreport.com, 2005

Thursday March 30, 2006
64-bit processors - A few questions from last time
I wanted to clarify a few concepts from last time that I got questions on.
The first one is pretty fundamental to the whole discussion... what exactly is
64-bit? All data inside a computer is stored as a series of 1's and 0's. Each bit is like a digit in a normal (base-10) number. The more bits you have, the bigger the data you can represent, just like with numbers you use every day. Until the last 10 years or so, computers have been primarily 32-bit. What does this mean? It means that memory addresses, integers, etc were at most, 32-bits in size. If something can not be represented in just 32-bits, it had to be split into multiple parts. When doing an operation on these bigger numbers or addresses, the processor would have to handle that data in parts, rather than as a whole, which is slow. A 64-bit processor means the addresses, integers, and other data is at most, 64-bits. This means the processor can handle much larger data, which increases speed. Take a look at the link above for more details.
Another one is simultaneous multithreading. To clarify this one we're going to take a big step back. How do programs get run at all? Way back when we people had to walk to both ways to school uphill in the snow (and they liked it), computers were not very sophisticated. They ran one program from beginning to end, then you could load and run another program from beginning to end. As you may imagine, this was fairly limiting as to what you could do with a computer and people got tired of listening to Bob and Joe fight over who's turn it was to run a program. So, the idea of
scheduling was a big hit in the computer world. Scheduling is primarily the job of your operating system. At any given time, even if you have not launched any other application, there are several different programs running. Processes which are ready to run are placed in a queue. The operating system gets a program from the queue and starts it running on the processor. Every process has a time limit so others get a chance to run. A process either runs until its time limit or until something happens that causes it to sleep. Then the operating system picks another process to run and so on. It turns out that most processes spend a lot of time waiting and so when they are having their turn running, they may actually not be doing anything but sitting there. This means while they sit doing nothing, other programs which could be running are waiting. Another interesting thing about many programs is that the work they do can be split up into several independent but related tasks, threads. Take a web server for example, at a given time it may be serving pages to several users but that work is all independent of each other.
Threading allows a programmer to split these tasks into multiple processes, which run independently of each other on the computer. Doing this can make the program run a lot faster. So, how can we make the processor better able to deal with these two things? Well, one way is to add more processors to the machine. This gives each process more places to run so it addresses threading, but what about the problem of processes wasting time? What if when a process was waiting, we take if off the processor and let some other process run? This is called "coarse-grained multithreading" and is a type of
temporal multithreading. The Montecito processor (next generation Itanium) uses this type of threading. Another way to let processes share the processor, is similiar to how the operating system schedules processes. If we have a queue of threads ready to run, each cycle we can pick a thread and issue instructions from that thread in to the processor. The big difference from coarse-grained multithreading is that at a given time, instructions from more than one thread are in the processor core at the same time. This is called "fine-grained" multithreading. The UltraSPARC T1 uses fine-grained multithreading. The problem with temporal threading is that, for processors which can issue more than one instruction per core per cycle (which is all but the T1), in a given cycle it may not be able to issue the maximum number, which leads to waste. What do I mean by this? Say processor X can issue 5 instructions per cycle. For a thread ready to run this cycle, it may not have 5 instructions which can be issued this cycle, maybe only 3 are issued. That means this cycle, the processor wasted 2 issue slots and now inside the processor there are less instructions running than could be running.
Simultaneous multithreading, or SMT, addresses this by allowing instructions from more than one thread to issue each cycle. So for the cycle we just mentioned on processor X, it could try to fill those two unused issue slots with instructions from another thread. All the remaining processors which do hardware multithreading use this type of threading. We'll discuss threading more in a future blog entry but hopefully this gives you a better idea about what SMT is.
The next was what do programs need to do to take advantage of a dual- (or more) core processor. Applications should not need to do anything special. Operating systems need to be written to be able to properly use systems with multiple processors (either multi-core or multi-processor, or both). Unless you are running a *very* old operating system, your OS should already handle this. Applications which are threaded will be able to make more out of a multi-processor system, but they also gain benefits from a single processor system. Even if your application is not threaded, you will most likely you'll see an increase in system performance anyway because your application can get more time to run with more than processing core available on the system.
Another one was out-of-order execution. Let's use the recipe analogy again. Imagine a program is like a recipe. To make a cake, or whatever you want to cook, you follow steps and the end result is a cake. With an in-order execution processor, each step is followed in the order it appears in the recipe. But many people who cook know that not all steps have to be in order for the cake to come out right. Say the the steps are:
- In a bowl, mix the dry ingredients.
- In another bowl, mix the wet ingredients.
- Combine the wet and dry mixtures and mix thoroughly.
- Grease cake pan.
- Pour batter into greased cake pan.
- Preheat oven to 350 degrees.
- Bake for 20 minutes.
Looking at these steps, we can see some that must happen in a particular order. We could not pour the batter into the pan before we grease it. We can not bake the cake without heating the oven. But there are other steps which can be reordered. We could preheat the oven earlier in the recipe. We could grease the cake pan at any time before we pour the batter in. We could mix the wet ingredients before the dry as long as we did both before combining. The same is true of a program. It is possible to take instructions and reorder them and still end up with the correct result. This is called
out-of-order execution and is done by all the processors except the UltraSPARC T1 and the yet-released Montecito. How this is done is a detailed discussion for another day.
Next is instruction level parallelism.
Instruction level parallelism, or ILP is a measure of how many instructions in a program can be run at the same time. Take the recipe above, I can find four steps (instructions) which could be done simultaneously assuming we had enough cooks, 1, 2, 4, and 6. That is pretty much it. It is a fairly simple concept with some big fancy wording.
The last is EMT64. This one is really simple,
EMT64 is just the name for Intel's version of AMD's 64-bit extension to the x86 architecture.

Monday March 27, 2006
64-bit processors - Clocks, cores, and power usage
My original plan was to go processor by processor. My paper tries to cover all the major architectural features for each processor. In some places I was limited by information available but for most of the processors I was able to find all the same information. I want this to be understandable to people who are computer literate but are not architecure experts. As I started with the first processor, the Itanium 2, I realized I was writing a paragraph of explanation for every sentence. In the end I'd end up with five blog entries just for the Itanium 2 so I could explain what everything meant, and then rush through
the rest. That didn't sound like what I wanted so instead I am going to pick one to a few architectural features and talk about all the processors.
I will be using the text of my paper as I wrote it, but insert explanations and hyperlinks to explain what I am talking about. The explanations will be in a grey text, my paper is in normal black text. I'll be including a handful of my sources in each entry for those who are interested in more information.
Itanium 2
A cooperation between Intel and Hewlett-Packard lead to the release of the Itanium processor in 2001. The Itanium was intended to replace the x86 hardware and dominate the server and workstation markets. It was also not expected that AMD would be able to clone it. The second generation Itanium 2 was released in July 2002. Intel has since focused on a new Itanium processor, code named Montecito. HP sells Itanium 2 servers that range from single processor blades to 128 processor high-end servers. The Itanium's IA-64
architecture is completely different than the IA-32 architecture of the x86 family, though it provides backwards compatibility for 32-bit x86 applications. This paper will be focusing on the versions of the Itanium 2 after the first one (McKinley), code named Madison, Deerfield, and Fanwood.
What does this mean? Hardware companies have working or code names for each processor they release. For a specific processor there may be several different versions with different features. For the Itanium 2, Intel had four different versions starting with McKinley. I have found the code name is the most convenient way to reference different versions of the same processor so expect to see them used heavily in this paper.
The Itanium 2 core speed is 1.3 to 1.6 GHz and uses 130 watts of power.
What does this mean? Processor clock frequency is probably one of the most misunderstood pieces of information about processors. So, what does GHz mean for a processor. A processor clock tick is the smallest moment of time for a processor. In the simplest processor, it can do one operation each clock tick. The clock frequency is how many of those ticks occur in one second. The more ticks, the more operations that can be completed in a second. So far it sounds like GHz does just mean speed. Ah, but there is more. What is a operation and how many can a given processor do each tick? These can vary significantly. Programs consist of instructions. Each instruction is like a step in a recipe. The difference is that for different processor types, the number of instuctions needed to run the same program are different. So, already we cannot know which processor will run an application faster because they are not doing the same steps. Each instruction can be broken down into many operations. Some processors break their instructions down into smaller operations than others. Having smaller operations means they can be done faster, which increases the clock speed. But, this does not mean the whole instruction is completed any faster. So, what does knowing the clock speed tell us. For the same processor, and sometimes for a processor family, you can tell which processor is faster. It just cannot be used between
processors families (like comparing a Pentium to a G5) and sometimes even with in a processor family (like between the Pentium-D and the Xeon). Clock speed is also often tied to energy consumption (higher clock speed, more energy used).
The Itanium 2 is one of the few single-core 64-bit processors still commonly available. There is a
dual-core Itanium 2, called Hondo, available only from HP which uses two Madison cores operating at 1.1 Mhz, which is not covered in this paper due to lack of SPEC results for it. Montecito will be a dual-core processor.
What does this mean? A multiple-core processor is basically like having that many seperate processors together on a single chip, which share some resources such as a memory bus. An advantage to multiple-core processors is they can communicate with each other faster than seperate processors. They also take up less physical space on the system mother board. Many dual-core processors have the same footprint size as their older,
single-core predecessors. Most multiple-core processors also use less energy than if they were all seperate processors.
The later versions of the Itanium 2 support two threads using
coarse-grained multithreading.
What does this mean? In a processor without threading, each cycle, one or more instructions from the same program are issued into the processor to be run. The problem is that often the processor is not doing any actual work because the instructions for the program are waiting for something such as data from memory (which is a very slow operation, 1000s of cycles). In order to make better use of the processor, a technique called threading was created. There are three general types of threading. In coarse-grained multithreading, CGMT, the processor will switch what program it is running instructions from when the thread encounters a long-latency event. This means at any given time, the processor is still only running one program.
Opteron
The eighth generation of AMD's
Hammer architecture, the Opteron processor (code names SledgeHammer - 130 μm and Venus - 90 μm), was introduced in April 2003. Designed to compete with Intel's Itanium 2, the Opteron is the most powerful of AMD's 64-bit processors. It was designed for server and enterprise applications. It has arguably become the most popular x86-based 64-bit processor. A variety of computer system producers, including all of the largest enterprise-level UNIX vendors (Fujitsu, IBM, HP, and Sun), sell Opteron systems.
What does μm mean? This refers to the manufacturing process for the processor. The smaller the number, the smaller the size of the circuits. This allows the processors to be smaller and use less power. See wikipedia's page about 90 nanometer for more details.
The Opteron chip comes with either one or two cores with clock speeds from 1.8 to 2.8 GHz. Both the single and dual core Opteron processors can run two threads using
simultaneous multithreading and supports
out-of-order execution.
What does this mean? In simultaneous multi-threading, instructions from more than one program (in the case of the Opteron, from two programs) issue in to the processor to be run in the same cycle. In an in-order processor, instructions for a program are executed in the processor in the same order as they occur in the program. However, many instructions in a program are independent of each other, and do not have to be executed in-order for the program to produce the correct result. This is called
instruction level parallelism, or ILP. Why does it matter if instructions can be run out-of-order? There are many operations which take more than a single-cycle to execute, such a floating-point math or loads and stores from memory. With in-order execution, all instructions have to wait for these operations to finish before they can continue. Out-of-order execution allows a processor execute other instructions rather than stall the program waiting for a high-latency operations to finish.
The average power consumption of an Opteron processor is 89-90 watts.
What does this mean? Even if you aren't an environmentalist
who's worried about our global energy usage, how much you computer uses is
something you should care about. If you're Joe-Average, your computer eating
power means less burgers you get to eat. My dual processor Dell around
350 watts (possibly more at peak). That is more than if I turned on
every light in my house (we use florescents). I make sure that machine is
off or sleeping whenever it is not in use. If your Bob-Admin, multiply that
by however many machines you have. Bob-Admin also has to think about how he's
going to keep his server room cool too because 50 machines using that much
power make a great sauna in a few hours. If Bob-Admin puts his machines in
a co-location facility, where he is paying by the sq ft and the watt, not only
does he pay more for energy usage, but for space too. The racks in a co-lo can
only support so much power draw per sq ft, so that means less machines per
rack.
Pentium D
In May 2005, Intel introduced the Pentium D (code name Smithfield), a
dual-core processor, which contains two essentially unmodified Pentium 4
Prescott processors. Unlike the Prescott, the Pentium D adds support for
64-bit through Intel's
EMT64 technology.
Although some Pentium Prescott processors utilize Intel's
Hyper Threading
technology, the Pentium D examined
in this paper does not . In early 2006, Intel released a 65μm version
of the Pentium D, code named Presler. Like Smithfield, the Presler chip does
not support multithreading.
There is a dual-thread Pentium
D, the 3.2GHz Pentium Extreme, but no CPU2000 benchmarks have been published
for that processor so it was not included in this paper. An Extreme Edition
of Presler is scheduled to be released in mid-2006.
The two cores in the Pentium D Smithfield are on the same die and have a clock
speed of 2.8, 3.0, or 3.2Ghz. The Presler cores are each on their own die,
which decreased production cost since a defect in a die affects only one core.
What does this mean? Two cores on the "same die" mean that
both cores are manufactured on the same integrated circuit. Cores on seperate
dies mean the cores are on seperate integrated circuit, though they are still
on the same chip. For machines with both cores on one die, communication time
is faster but a defect in one core makes both cores not usable since the
integrated circuit must be thrown away. With the seperate die approach, there
is some loss in communication speed but a defect in one core means only that
core must be thrown away.
The cores of the Presler chip operate at 2.8, 3.0, 3.2, and 3.4GHz. With 230
million transistors, the Smithfield is significantly smaller than the
dual-core Itanium processor, which has 1.7 billion transistors yet the maximum
power usage for a Pentium D is about 130W - 155W and the dual-core Itanium is
100W. The cores in the Pentium D are clocked significantly lower than the
single-core Prescott in order to minimize power consumption.
What does this mean? Usually, the number of transistors
in an integrated circuit correlate to the amount of power used however with the
Pentium D compared to the Montecito, this is not the case. Each core in the
Pentium D operates at a lower clock frequency than the single-core equivalent
so that the power usage is still reasonable. Operating at the same frequency,
the Pentium D would likely be over 200W in power usage.
.
Power5
The IBM Power5, close relative of the G5, was released in June 2003. IBM uses the Power5 for
a range of machines from single processor entry-level servers to high-end
multi-processor servers. Like its predecessor, the Power4, the Power5 is a
dual-core processor. Both cores are on the same die. The clock speed of the
Power5 ranges from 2.0 to 2.7GHz. The power usage is about 100W . The Power5 can run two threads in each core using
simultaneous multi-threading. It can also operate in single thread mode.
UltraSPARC IV+
Code named Panther, the fifth generation processor in the SPARC family, the
UltraSPARC IV+, was designed for enterprise computing and released in September 2005. Panther is a dual-core processor that supports two threads using what Sun calls "chip multi-threading", or CMT.
Sun's CMT does not quite the same definition of threading as is commonly used when
talking about processors. Threading normally means running instructions for different
programs in the same core. CMT in the IV+ is running a different program in each
core, not in the same core.
The UltraSPARC IV+ has twice the computing power over the UltraSPARC IV yet
reduces the power consumption from 108W to 90W.
What does this mean? The IV+ shows how much the manufacturing process can
improve processor power usage. The UltraSPARC IV used a 130 μn process
but the IV+ uses a 90 μn process. This allows the processor to be the same
in physical size even though it is much more complex and powerful. This also
helps it use less energy than the IV.
UltraSPARC T1
The UltraSPARC T1 , released in November of 2005, is the newest of the SPARC
processor line by Sun Microsystems. The T1 has generated a lot of interest
due to its departure in design from other 64-bit processors currently on the
market. The T1 has eight cores operating at 1.0 or 1.2GHz.
All cores on the processor operate at the same frequency,
the processor is available in a 1.0 or 1.2 version.
Each core can execute four threads, making the T1 a 32-way processor. Despite the large
number of cores, the T1 only consumes 75W on average, 79W peak.
Xeon
The 64-bit Intel Pentium 4 Xeon was released in June 2004 (code named Nocona). It is designed to be an enterprise-level processor for business computing. It
comes in a single and dual core model (code named Paxville, released in
October 2005) and supports Intel's Hyper Threading technology. The Xeon has
clock speeds from 2.83 to 3.66Ghz, the fastest of any of the processors
examined. The single core Xeon uses 110-120W of power, the dual-core uses
135-150W.
Sources
This is not a complete list... I'll be putting a handful
at the end of each entry.
1. P. Kongetira, K. Aingaran, K. Olukotun - "Niagara: A 32-Way Multithreaded Sparc Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 21-29, 2005
2. C. McNairy, D. Soltis - "Itanium 2 Processor Microarchitecture". IEEE Micro,
March/April 2003, Vol. 23, No. 2, pg. 44-55, 2003
3. R. Kalla, B. Sinharoy, J. Tendler - "IBM Power5 Chip: A Dual-Core
Multithreaded Processor". IEEE Micro, March/April 2004, Vol. 24, No. 2, pg.
40-47, 2004
4. C. McNairy, R. Bhatia - "Montecito: A Dual-Core Dual-Threaded Itanium Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 10-20, 2005
5. C. Keltcher, K. McGrath, A. Ahmed, P. Conway - "The AMD Opteron Processor for Multiprocessor Servers". IEEE Micro, March/April 2003, Vol. 23, No. 2, pg. 66-76, 2003
Power Consumption Sources
Itanium 2 -
http://www.intel.com/products/processor/itanium2/index.htm
Opteron -
http://www.epinions.com/content_18680811072
Pentium D -
PCStats.com and
wikipedia.com
Power5 -
http://www.xlr8yourmac.com/G5/xserveG5.html
UltraSPARC IV+ -
http://www.extremetech.com/article2/0,1558,1667444,00.asp
UltraSPARC T1 -
http://www.sun.com/processors/UltraSPARC-T1/index.xml
Xeon -
www.news.com

Friday March 24, 2006
Using zones in a university.
Remember when you were in college and you were working on your senior project or even a class project and you had to make the decision, do it all at home (with your crappy internet connection, less powerful machine, etc) or do it on a machine at school and beg the admin to set up all the special things you needed (and be told no most of the time and when they did say yes, wait a week for it to happen)? Ok, maybe you were in college in the days when you couldn't do things at home (poor punch card folks, my condolences) so for you the problem was worse, you were totally at the mercy of the people running the machines.
My university has finally solved the problem, with zones. Of course this would be just a few months before I graduate and after I am done with all of this type of work, but I am excited for the rest of the students. Now students have complete control over their (virtual) machine plus they get all the benefits of using a machine in the lab. If only we'd had this when I did my senior project (ok, I *was* the admin then so it was not too bad).
One student is logging his experience, called
The Bunsen Project. It looks as if he broke php already (3/27/03 - it has been fixed) but here were his first two entries after getting the zone:
- Virtual Machine - March 6
Today I got the virtual machine from the CSL. With this, I
will be able to set up a MySQL account so that I can put my project
online, instead of having to resort to static box. More importantly,
however, I will be able to have full administrative access to the
server which will be great practice.
Server Installs - March 9
Well, it took several hours, but I made lots of progress
today with the server. I was able to install several essential programs
using the 'pkg-get' program from
blastwave.org.
Here are some of the programs I installed:
I decided to compile and install the apache server from
scratch because I figured I could get more out of it. Because of
this, I've run in to some problems and my server is still not up yet :(
Doing this isn't just nice for the students, it makes the admin's life so much easier too. Now when a student or instructor needs to set up a project, he can just hand them a zone and let them go. Oh, I wish we'd had this when I worked as an admin there...

Friday March 17, 2006
64-bit processors - here we go.
If you'd asked me a year ago about processor architecture, I would have told you that it is not my area of interest and wouldn't have had much else to say about it. I think it all started with my two undergrad architecture classes, I hated them. I suppose I should have known better than to give up so fast. I had the same thing happen to me with terrible 7th grade history teacher and it took until my first college history class before I realized it was an interesting topic after all.
When I started the class I knew I'd have to do a research paper. I have been mildly interested in the Niaraga processor since it came out a few months ago so I decided I would find some way to incorporate it into my research. What I ended up doing was a comparison of the architecture of the 64-bit processors you will find in current computers (current being anything released in the last 2-3 years) and their performance on some of the SPEC benchmarks.
When I started I was pretty clueless about the subject. I had no idea Intel had EMT64 in the Pentium line, that there was an Itanium 2 (and a 3rd generation on the way), or that the Power5 was being used anyplace but in Apple's computers. I realized, if I am this clueless about all of this, then likely other people (both computer-geeks and regular consumers) are too. As I started to talk to other computer-savvy friends of mine, I found they were. It seems computers are like cars, most people know a fair amount about the model car they have, but only a small percent of the population are true gear-heads. I knew my two 3.0Ghz Xeon processors in my Dell were dual-core and had threading. I had no idea the Power5 did that as well, or even that the UltraSPARC IV+ was dual-core (and I work for Sun).
So, I am not an architecture expert. I will probably make mistakes. I didn't take all of these processors apart and look inside, so whatever I know is based on what has been published by the hardware manufacturer or other people doing hardware research. If I make a mistake, please correct it! With that in mind, let's get started.
The processors I'll be focusing on:
- Alpha 21364
- Itanium 2
- Opteron
- Pentium D
- Power5
- UltraSPARC IV+
- UltraSPARC T1
- Xeon (Nocona and Paxville)
You will notice there are a few processor families missing, such as the MIPS line of processors. I wanted to focus on processors you are likely to find in a machine you could buy today. I tried to be as fair as I could with my analysis, but I admit I had some bias going into this. I have never been a big Intel fan. For me, Windows and x86 were always "ew". I do have an of x86 box at home and love it (it is screamin' fast) so I am not totally against them anymore. I also wanted to see the Sun processors do well. I think working for Sun, it is expected I might be cheering for the home team. I can say that many things I found surprised me, others disappointed me, and my opinion of most of these processors is very different than when I started.
Working Backwards
Rather than go through all the analysis and then write a conclusion, I am going to start this off with my conclusion and then we can look at how I got there.
If I had to pick the best overall processor (based on the research I did and the specs I looked at), it would have to be the Power5. The T1 blows it away (and every other processor) in benchmarks like
SPECweb2005 and
jAppServer2004 but the Power5 tops all other processors in these benchmarks and also shows great performance on the CPU2000 benchmarks as well. The T1 has no published CPU2000 results. It is not a data crunching machine so I doubt we'll ever see CINT2000 or CFP2000 results for it. Based on processor only, if I wanted to buy a web server, I'd pick the T1, for anything else, I'd choose the Power5.
The Opteron would make it to my number two place. It was in the top two for overall processor speed (CINT2000) and was a close second to the Power5 for throughput for up to four processors (CINT2000Rate). It is too bad there are not 8-way and higher Opterons so we could get a really good look at its throughput scalability.
The Itanium 2 and the UltraSPARC IV+ both fall in the middle. Neither processor is very fast but they both have decent thoughput and both scale fairly well. Even though neither of them are the top performers, they are the most common processors for benchmark results for machines with 36 or more processors. The next generation Itanium, Montecito, makes some movement in design in the same directions as the T1. It will be interesting to see how it performs once it is released.
The Pentium family processors fall at the bottom. The single-core Xeon is fast. It had the second highest average score on the CINT2000 benchmark. The Pentium D fell in the middle for speed. Both of them have good throughput for a single processor machine. After that, things start to fall apart for the Pentiums. There are no published results for the Pentium D for more than single-CPU machines. The Xeon's throughput drops noticably at two processors and pretty much falls off a cliff above four processors. Neither machine performs very well on the SPECweb2005 or jAppServer2004 benchmarks.
You might ask, but what about the T1 and the Alpha, where do they fit in? It is not really fair to put the T1 in this ranking, at least not directly. It really is not the same type of processor as the rest of these. The processors above were designed to be all-purpose processors. The T1 was designed with a specific type of application in mind, "network-facing". The T1 would be at the top for these types of applications, and would not do as well for others. Without published results, the best I could do is speculate and I want to discuss facts, not play guessing games. With the Alpha, I didn't find out until after my paper was done that there was chip released recently (2003). I have not had a chance to take a close look, but I will.
I know that by putting this up, someone is going try to argue that I have not provided the details. As I said, this is the conclusion, not the whole of my research. I can only fit so much into a single blog entry. Have no fear, the details, more than you probably want, are coming in future blog entries.
Tom pointed out it might look bad that I didn't put a Sun processor in the #1 place. Well, if telling the truth gets me in trouble, I can live with that. The reality is that anyone can look at these results, they are all published on
SPEC's web page. I only had time to look at the integer CPU2000 benchmarks, SPECweb2005 and jAppServer2004. There are so many more to look at (and I plan to) that by the end of this, those rankings will likely change.
That is all for now. Stay tuned for next time when I take a look into some of the architecture designs for these processors.

Thursday March 16, 2006
New software engineering conference
I wanted to let you all know about a really exciting new software engineering conference coming this year.
Waterfall 2006 will be held in New York state in April 2006.
Some of the paper authors include: K. Schwaber, J. Highsmith, R. Martin, S. Ambler, and K. Beck.
Check it out!
(I have not mentioned it here before, I am a big fan of agile development... you'll get to read more about my work in this area later.)

Tuesday March 14, 2006
Dear blog, Sorry I have not written in awhile.
Wow, can you say the word busy? I knew you could!
Ok, so it has been over two months since I last posted. I know it is an over used excuse, but life has been really busy. My life is pretty crazy to start with between working full-time, going to grad school, and having two kids. I am also working on my thesis and that pretty much eliminates any free time that might have been left in my day.
Today I took the final for what I hope is my 2nd to last class ever. I also turned in my 23-page paper about 64-bit processor design and performance. It is so nice to be done, even if it is only for two weeks until classes start up again.
I have to confess, being busy is not my only reason for not blogging. I probably could have worked in a post or two since January 5th. My main source for interesting things to write about, Tom, has not had anything exciting happen lately. Oh, he's given me a couple of good bugs to look at but none of them were really exciting enough to inspire a whole blog entry. I did learn a neat way to wedge a single CPU Solaris box though. :)
The fact that Tom has not had anything "interesting" happen all quarter (meaning nothing has broken, crashed spectacularly, etc) is a good thing. I think he's finally changed enough of the old way the lab was run that things are finally working correctly. Of course saying that means he'll be IM'ing me in the next 24 hours with something.
All of this is not why I am blogging today. You'd think after a 23-page paper and then an hour of intense writing for my final I'd be quick to getting to the point this afternoon. Here it comes... I finally have something interesting to blog about that has nothing to do with anyone else's funny experiences with UNIX. It is true.
What is this exciting new topic...
64-bit processor design and performance.
For those actually reading rather than skimming, you'll notice this was the topic of my paper. My brain is just filled with SPEC results, L2 cache sizes, and interconnection network bandwidth numbers. I did not expect I'd find my graduate architecture class very exciting, but I really enjoyed it and the research for paper turned out to be really interesting.
So, for the next several blog entries, I'll be talking about 64-bit processor design, performance, cost, etc.. I hate to leave you with a cliff-hanger but I will be back soon, I promise!
P.S. - I dicovered the spell checker on b.s.c doesn't know the word "blog"... I thought that was funny.

Thursday January 05, 2006
JDS + Sunray = SLOW
Tom: well a whole classroom full of people running JDS on the sunrays
pretty much kills them
Kristin: oh bummer
Tom: ran the memory right out
Kristin: wow
Tom: the systems spend too much time paging
Tom: cde and they run fine
Kristin: hmm
Tom: it sucked
Tom: we had to get everyone in JDS to log out and log back in
Tom: I'm not a big fan of sunrays today
Kristin: what is the sunray server?
Tom: there are 2 v240's
Kristin: those are new arent they?
Tom: yes
Kristin: and how many sunrays in the lab?
Tom: 32
Top on one of the servers:
last pid: 8832; load averages: 6.64, 7.17, 4.44
15:23:41
496 processes: 475 sleeping, 15 running, 3 zombie, 3 on cpu
CPU states: 11.8% idle, 66.9% user, 21.3% kernel, 0.0% iowait, 0.0% swap
Memory: 2048M real, 39M free, 3553M swap in use, 1973M swap free
Tom: PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
8798 fffffff 15 47 0 128M 53M run 0:03 5.12% java
8658 aaaa 8 52 0 113M 66M sleep 0:05 4.52% mozilla-bin
5563 fffffff 1 21 0 80M 35M cpu/0 0:09 3.69% Xsun
8828 fffffff 15 1 0 146M 44M run 0:01 3.67% java
4024 aaaa 1 53 0 84M 30M sleep 0:13 3.15% Xsun
7624 qqqqqqqq 8 59 0 110M 45M run 0:07 2.07% mozilla-bin
26536 oooooooo 3 24 0 122M 63M run 0:44 2.03% mozilla-bin
26362 mmmmmmmm 1 51 0 36M 28M sleep 0:05 1.86% Xsun
25426 oooooooo 1 59 0 97M 39M sleep 0:36 1.85% Xsun
24654 qqqqqqqq 1 30 0 41M 30M sleep 0:13 1.32% Xsun
8225 fffffff 19 59 0 149M 41M sleep 0:04 1.22% java
20619 cccccccc 1 43 0 76M 27M sleep 0:10 1.15% Xsun
7329 fffffff 1 53 0 64M 32M cpu/0 0:01 0.91% metacity
6838 ggggg 19 58 0 144M 36M sleep 0:05 0.87% java
3824 sssss 19 56 0 149M 39M run 0:09 0.81% java
5226 aaaa 19 47 0 149M 39M sleep 0:07 0.70% java
2083 wwwwww 19 59 0 144M 39M run 0:09 0.69% java
vmstat output:
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr m0 m1 m2 m1 in sy cs us sy id
0 0 0 687760 977984 3 21 4 2 3 0 20 1 1 1 0 366 737 553 0 1 99
13 0 1 3148808 60592 921 7577 2580 0 0 0 0 225 113 114 18 2850 22816
5966 70 30 0
10 0 1 3254736 151744 1062 6150 4451 8 8 0 0 115 59 57 441 3456 19750
3917 67 33 0
24 0 1 3256128 159080 760 5836 1204 0 0 0 0 107 54 54 17 1531 20498
3289 78 22 0
7 0 1 3248296 147712 869 9398 982 0 0 0 0 95 49 49 19 1786 18958 3722 69 31 0
So, they are getting more RAM and hoping that will make things better. People were not happy to be told they have to run CDE until further notice. It is things like this that are not going to leave the students with a good impression of Solaris and Sun hardware.
I like the idea behind the Sunray. Tom tells me they are really easy to admin. I like the fact that I can stick my little card into one and get all my stuff just like I left it. Someday I want to do that on any computer, anywhere. But, if we're going to get to that point, then we, the developers, need to get out of the "it runs fine on my desktop" mindset.
I had never really considered the memory usage of JDS compared to CDE before. As soon as he told me about this, I had to go find out some more information. The first web page I came across was a fellow Sun blogger,
John Rice. He had been following the initial tweaking of Gnome 2.0 and wanted to see how the memory usage was now. He determined that JDS was using around 35M of private and 45M of shared memory. Looking at the top output from the Sunray server, that seems reasonable. Each of those Xsun instances is using anywhere from 36MB to 97MB, with most in the 70-90MB range, with 30MB or so more reserved. John believes that all this memory usage in JDS is because its init functions are pulling in every library under the Sun.
I started peeking at other web pages and found another interesting discussion at
osnews.com. It says:
- "We received reports that GNOME was orders of magnitude
slower than CDE on Sun Rays. To verify and measure this, I designed
and ran some performance tests in order to compare the time and
bandwidth usage of GNOME (JDS) with that of CDE on Sun Rays. The tests
measure the time it takes to display data using various desktop
applications: Browser, StarOffice and Terminal."
This is another Sun person,
Johan Steyn. He basically uses tcpdump to count bytes sent between the Sunray and the server. What I found as interesting were some of the comments such as
"That's why you run CDE on your old SPARCstations, and Gnome/JDS on your dual-AMD64 boxen."
and
"Personally its very close to being a perfect environment
for me. i would like better fast user switching. I don't think it is
bloated. its rare that my system ever goes to the page file." It's rare that his system has to page? If a desktop system might be taxed enough by JDS to start paging, does that mean we expect Sunray server supporting 15 Sunrays to be at least, if not more, powerful than 15X a current desktop? I think expecting our customers to buy
a couple of T2000's (at $16k-26k a piece) to support a lab of 32 Sunrays might be a bit unreasonable.
It is not just Gnome using up memory. Looking at the top output, java and mozilla are using up even more memory. I took a peek at my own system, running Solaris Express s10_52 (yea I know, I need to upgrade... in my copious spare time). From my system I see:
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
355 kristin 1 49 0 63M 58M sleep 8:17 0.59% Xsun
1320 kristin 6 59 0 75M 63M sleep 2:22 0.51% mozilla-bin
511 kristin 1 49 0 16M 10M sleep 0:14 0.09% gnome-terminal
497 kristin 1 59 0 56M 8352K sleep 1:24 0.07% metacity
501 kristin 1 59 0 62M 13M sleep 8:44 0.05% gnome-panel
495 kristin 1 59 0 3720K 2016K sleep 0:26 0.01% gnome-smproxy
503 kristin 4 59 0 71M 20M sleep 0:04 0.01% nautilus
505 kristin 1 59 0 54M 5960K sleep 0:27 0.00% galf-server
479 kristin 1 59 0 10M 7488K sleep 0:00 0.00% gconfd-2
362 root 7 59 0 2640K 1448K sleep 1:22 0.00% mibiisa
576 root 2 49 0 4456K 2480K sleep 0:39 0.00% automountd
693 kristin 13 39 10 99M 67M sleep 0:24 0.00% java
456 kristin 1 59 0 1880K 592K sleep 0:15 0.00% dsdm
65 root 7 59 0 3464K 184K sleep 0:13 0.00% picld
481 kristin 1 59 0 4960K 2848K sleep 0:03 0.00% xscreensaver
Xsun is using about the same but my mozilla, which has been running all day and is also running my email, is using almost half the memory as the mozillas on Tom's Sunrays. My java, which is 1.5, is using significantly less memory as well. Tom testing running mozilla again on the Sunray server for me and at startup it was using 98Mb, after loading 4 web pages it was at 104MB. Just adding up the memory (RES column) on my system by me (and ignoring numbers under 1MB), I get 231 MB. Imagine that multiplied by 32 users, and we have 7392 MB. Maybe only 1/3 of that is private, that is about what John Rice found, then we have 2464 MB just in private memory needed for the 32 users (I sure hope my math is just wrong). I can see why 4GB across his two servers might not be enough memory. Let's hope no one launches StarOffice.
All of this goes back to what I said at the beginning. I am sure the developers for each of these applications look at the memory usage (I sure hope they look!) and think, 100MB, that is not a big deal. Alone, it is not. Added with every other 100MB application people run just to have a minimally functioning desktop (web broswer, email, file browser) and then adding in common tools like a word processor or IM client and it starts to add up.
So, how many Sunrays should we expect a server to support? According to
an article on unixville.com
"The recommended server configuration for an average office with 50 active Sun Ray appliances is a Sun E250 with dual processors, 1 Gig of RAM, dual 100-Mbps Fast Ethernet controllers, and two disks to spread the swap space onto.". 50 users and 1 GB of RAM? Not if we want them to run JDS. Okay, the article is 1 1/2 years old, but has the memory usage really changed that much in 1 1/2 years? My build, s10_52 is from Feb 2004, before that article and an E250 with 1GB could not come close to supporting 50 Sunrays running Gnome + Mozilla + Java.
What is my point? There is a major disconnect between the folks considering the system requirements for applications such as Gnome and the Sunray technology, or really any thin client. I don't want to see Sunrays left behind with CDE. Can JDS be changed to use less memory? A lot of JDS fans like what the "bloat" provides them; anti-aliased fonts, pretty graphics, lots of functionality. JDS is certainly not the only GUI suffering from memory hog syndrome. OS X is a major memory hog, but it sure is pretty. What is the solution? Well, I guess step one is get more memory for those Sunray servers. I think the long term solution is for developers to stop using memory just because and start being smart about it. Yea, an individual machine may have 1 GB of memory to use, but we have to remember these applications are used other places besides our dual opteron desktop boxes. Customer feedback, its is where its at... you can't get enough.

Wednesday December 14, 2005
Recipe of the Day - Grumpy Sysadmin.
Grumpy Sysadmin
Ingredients:
- 1 Solaris sysadmin
- 2 Solaris 10 sunray servers
- 1 Solaris Fingerprint Database, not up-to-date
- 1 Web Designer who doesn't know basic HCI principles
1. Have web designer add a date stamp to the fingerprint database web page in a format that is hard to read and easy to miss.
2. Have sysadmin enter servers' md5 signatures on web-based Solaris Fingerprint Database.
3. Allow sysadmin temperature to rise as he begins to believe his machines have been rooted when he sees they do not match the fingerprint database.
4. Mix well.
5. Allow to stew for 48 hours.
You will know it is done when the sysadmin realizes the date for the fingerprint database is over a month old and does not include the latest S10 kernel patches and that he's wasted the last 2 days trying to figure out how he was rooted and then reinstalling the servers.
Serve chilled. Makes great leftovers.

Tuesday December 13, 2005
Hey college students, read this.
I never thought I'd find myself saying, "Darn, if only I didn't work at Sun", but I just did. What could cause me to say such a thing about the job I love so? I am ineligible to enter the
Solaris 10 University Challenge Contest.
No, I am not pushing this just because I work here. I really think this is cool. As soon as I heard we were doing this, I was thinking about entering my own thesis. When I got home that day, I told Tom he should enter his thesis. Unfortunately for him, he is ineligible too. We were both pretty disappointed we could not enter.
What is so great about this
contest? To graduate, you have to do a senior project or a thesis. If you're like me, you are going to spend a good amount of your time on it (6 months on my senior project, 18 months on my thesis). At the end you'll have that piece of paper to hang on the wall, but why stop there? How about
$5,000 and an Ultra 20 Workstation? How about also getting
$100,000 in free Sun products for your school? All of this for something you're going to be doing anyway. Check out the
rules and then get your project submitted by
6/10/06.
For now, my dreams of the Kristin Amundsen Software Engineering Lab will once again have to go on hold. Maybe I'll win the lottery... I guess I have to play to win. So do you, and I think odds are much better for the S10 University Challenge.

Monday December 12, 2005
Nine years old.
That is how old my first "baby" is today. It sure doesn't feel like it has been that long. It is hard to imagine she's half-way to being an adult. Next year, double digits!! In the car she told me the most special birthdays are 1, 10, and 100. The first because you are finally a number, the second because you are double digits, and the last because it is three digits and because almost no one makes it that long. Just wait until she hears about 16, 18, and 21!
 
"Happy day toooo youuuuu, Autumn" - Laura (her two-year-old sister)

Friday December 09, 2005
Teaching old dogs new tricks.
So, one of the problems Tom encounters when he makes changes to the systems at FooU is old instructors. By old, I don't necessarily mean age, but out of the industry and in education for a long time. These instructors, as expert as they may be in their niche of computer science, don't understand a lot of the changes in the industry. I don't have any "proof" of this other than my own (and a few other's) observations. It seems to me they just don't see the big picture anymore.
One story he told me, which I don't have in an IM chat log, was a conversation with one instructor shortly after the change this summer to universal home directories for all unix systems in the department. Until last summer, every person had a seperate account and home directory on each server in the department. Some systems did point to the main department server, a Solaris box, just for usernames and passwords, but the accounts were seperate. This instructor noticed that a file he'd created while on one of the Linux servers was in his account on the main Solaris server as well. Tom tried to say, "Yes, they are the same account now," but he was not getting it. He knew the passwords were shared, but he just was not hearing that now the actual account, files and all, were the same on Linux and Solaris. He did finally get it.
Here's another one that just happened today:
Tom: (Instructor X) thinks centralized home directories is "a step backwards"
Kristin: backwards???
Kristin: how does he figure that?
Tom: back to the days of centralized servers he says
Tom: I don't know
Tom: can't make everyone happy
Kristin: well in a sense he is right
Kristin: but i think that is the way computers are kind of going now
Kristin: we're moving away from all the keep everything on your
personal computer model back to a model where the physical machine is
almost irrelevant
Tom: yes
Kristin: i think that is a good thing
Kristin: someday i want to have it where i can sit down at any
computer anywhere and do anything
Kristin: have all my data
Kristin: already my email is like that
Kristin: tell him the network is the computer
Tom: :)
Tom: not for some people
Kristin: aparantly not
Kristin: did he really like it better when he had N accounts on N
systems and all his files spread across them using ftp to get some
file onto the machine he needs it on all the time?
Tom: yes he did like it better
Kristin: i hate saying to myself "hmm what machine did I have that
file on" and then having to look in each account until I found it
Kristin: and keep N copies of the same files so they were on each
machine when I needed them
Tom: I say now your files are centralized and you know where they are
Kristin: the overhead in managing all that...
Tom: he said: ftp was easy
Kristin: ftp is easy... not needing to is even easier!!
Kristin: what is the problem with a single home directory for all machines?
Kristin: did he mention any thing he felt was a drawback other than it
reminded him of the old days?
Tom: he compiles things on linux and they don't run on solaris, he is
also worried about the network slowness of downloading every file he
uses to the local machine
Tom: its not a local file
Kristin: well the hope is that network speeds get fast enough where
the time it takes from the perceptions of a human are
indistinguishable
Kristin: i mean, should we go back over every other thing we do now
all the time with computers that would have been "too slow" 20 yrs ago
to do
Kristin: maybe we should get rid of all interpreted languages... too
much time compiling on the fly every run
Tom: I don't notice any difference on a nfs home directory
Kristin: me either
Kristin: and i am sure he doesnt either
Why does it matter what some instructors think? They don't get it, why should I care? There are a couple of reasons that come to my mind.
The first one is that what instructors think, they pass on to the students. Each quarter they graduate a couple hundred students. Each will go to their new employer with whatever they have learned and experienced in school. I don't want them going there with the idea that Unix is doing things "a step backwards". I've seen the presence of Unix shrink and Windows grow in the Comp. Sci. Dept. at FooU over the last 12 years. It's finally turning around.
The other reason to care is that if this is what they see, this is what other customers and older system administrators will see too. How do you educate this group? They've been doing things the same way for years and something about change makes them uncomfortable. The previous Unix admin for FooU was this person. He is a nice guy and he knew his stuff. He kept the labs running... running just like it was still 1987 in 2003. It is any wonder the instructors and students got tired of using the Unix machines for classes and started switching Windows? Even I would pick Windows XP over Solaris 2.3. How do we get these guys into the 21st century (this is an honest serious question, I am seaching for the answer)?

Thursday December 01, 2005
Another blog in the crowd.
#include <stdio.h>
int main (void)
{
printf("Hello world.\n");
exit(0);
}
Now that the obligatory geeking out of the way, I can introduce myself. My name is Kristin, I've been a *nix geek since 1992. I started got my very first unix account on my university's brand new AIX machines. I grew up in an Apple household and despite being used to pretty graphics, I took to the command like like a fish to water. "Ah, the absolute power!!" (Those of you with kids may recognize that reference.)
I bought my first NeXT Cube in 1993 and started collecting slabs as well by 1994. I became the Solaris admin for the computer science department a couple of years later and my first desktop machine at work was a DimensionCube. I hacked the backplane on it so it could run 2 other motherboards (sadly not as one machine though that was NeXTs goal they never realized). Unfortunately, as much as I loved my beautiful black cube, its 33 Mhz processor just could not keep up and I switched to a PC running Red Hat with AfterStep. My enjoyment of eye candy from my formative years was never lost and I switched to Enlightenment eventually. I always had an Apple laptop at home as well... I could not leave my first computer love.
I started at Sun straight out of college (which took me 7 years, but I did have a baby in that time too). My job at Sun involves standards conformance testing of Solaris and branding it to the Single UNIX Specification. It really is a lot more interesting that it sounds and it gives me a chance to work in all different areas of the OS.
Here are some other random details about me...
I have two daughters, Autumn, who will be 9 in 11 days, and Laura, who is 2 1/2. I am a full-time work-at-home employee. I live a mile from the Pacific Ocean half-way between Los Angeles and San Jose. I love it here. I am back in school getting my Master's in computer science, which will hopefully be done in June. My thesis is about agile software development, so expect to see stuff about that here too.
My inspiration for this blog was not work directly related to my job at Sun or my thesis but the conversations with my spouse, who is the UNIX Specialist for the computer science department of a large university (that I attend). I realized that through him, the students, and faculty I gain an amazing insight into how people use (or don't use) Solaris outside of Sun. He has agreed to let me post snippets from some of our IM conversations here. So, here are some from conversations over the last few weeks, enjoy...
On fcntl with nfsv4 between Linux and Solaris:
Tom: AHH!
Tom: linux sucks
Kristin: ?
Kristin: yes, sorry about that
Tom: can you fix it?
Kristin: uh... well sort of
Kristin: i can put solaris in its place
Kristin: that will probably fix your problem
Tom: little problems, like fcntl locking doesn't work
Kristin: uh that is more than a little problem
Tom: how can they release code with that big a problem
Kristin: i have no idea
Tom: its only in nfsv4
Kristin: heh, a little used part of the OS i'm sure
Kristin: i mean, who really uses nfs anyway?
Tom: it may not be as bad as I think
Kristin: well tell me what you find... i have to go
Follow-up to that conversation a few weeks later:
Kristin: hey did you ever get fcntl to work on linux?
Kristin: with nfs?
Tom: yes
Tom: I got a newer version of the kernel
Tom: and then it worked
Linux server misbehaves:
Tom: my linux server is behaving weird
Kristin: well it is linux
Tom: the cpu is 99% idle, yet the load is at 25
Kristin: that is weird
Tom: and there are processes that I just cannot kill
Kristin: that is not right
Solaris zones + smpatch doesn't work:
Tom: Here lets make up a conversation for your blog
Tom: make sure the appropriate people at sun see it:
Kristin: ok
Tom: Tom: This sucks... smpatch doesn't work with
machines that have zones.
Tom: Kristin: yeah, thats bad
Tom: Tom: Well when is sun going to fix it... cause its hard to use
zones when you cant patch the $@%^ machine
Tom: Kristin: I hope its soon
Tom: Tom: yeah, me too
Tom: there... blog that
Kristin: hehehe
Kristin: they are aware of it
Kristin: and i agree that someone needs to fix it, like yesterday
Using zones for a couple software engineering classes:
Tom: So some of the software engineering classes next quarter are going
to use zones to do their projects
Kristin: nice
Tom: which means that at least one member of the group needs to be a sysadmin
Kristin: interesting...
Tom: I'm trying to write a one to two page intro for them
Kristin: you need to keep track of how it goes
Kristin: it would be fun to post updates on how it goes
Tom: do you think someone at sun might be interested in how it goes
Kristin: yes
(a little later)
Tom: I wonder if this is too much for students?
Kristin: what? zones?
Tom: Should the really be partially administering their own machine?
Kristin: i think they should know how to do that but I may be biased