See all my pictures here.
| « November 2009 |
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | | | | | |
| | | | | | | |
| Today |
The requested Bookmark Folder does not exist: Blogroll

Thursday January 22, 2009
SLO Life just got slower...
It's been a fun 8 3/4 years here at Sun. Today I will be given the chance to sit on the beach a lot more. Good luck to all my colleagues who are also leaving. Anytime you want to sit on the beach, let me know. To all my colleagues who are staying, keep on making the best products out there. I look forward to enjoying more time using them in the future.
This blog will continue at http://kamundse.blogspot.com.
Thanks again!

Wednesday January 14, 2009
Darth Test
I am sure these must be making the rounds through techie blogs & SlashDot today but I thought they were too fun not to post.


I saw them on Flickr HERE.

Tuesday January 13, 2009
At least it is working
I did get apache running eventually yesterday. Giving the machine a static IP fixed the error. I am glad it is working but I know (or hope anyway) there must be a way to make it work with dhcp. I will save that for another day and just be glad it is working now. Small steps...

Monday January 12, 2009
The Web Stack hates me
To catch up on where we left off with my previous blog post... my server is now awol. Well, I have not lost track of it completely but I have no idea what is happening with it. [unhappy face] Amber was never able to figure out how to boot it off the CD. One of the other directors had a friend in the area who is a techie and she asked him to take a look at it. He met Amber to get the server, and that was the last I heard of it. Sigh. Some days I feel like we're never going to get that machine up and running. I do have a conference call with our new co-lo on Wednesday and I am hoping they can work with me to get this all resolved, assuming I can get the guy who has the server now to get it to them anytime.
Despite this setback, things are moving forward. We had a fundraiser to get the money to buy a new server. The server I have been trying to set up was donated by a friend and was supposed to get us off our current server to something that can sustain us right now. We need a new server in order to grow. Our members did an amazing job and we exceeded our goal so it looks like we'll be buying a new server soon.
To prepare for the move to OpenSolaris, we (my husband and I) installed 11.2008 on one of our home machines. Today I am trying to get the Web Stack up and running on OpenSolaris and get the configuration like what we have on the existing server. I am way too used to Solaris so of course I figured this would be really quick, just edit a few conf files and start it up. Nope... where is apache? Duh, it is not there, I need to add it.
Ok... time to read a little bit so I headed over
opensolaris.com and clicked on the big "Use" icon -> Customize Your System. Perfect, what I want to do is the first thing listed. This is going to be quick and easy. Why do I let myself ever think that? As soon as I do, something always happens. Why do the instructions start with the software already on my computer? The first step is "extract the .tar.gz file...". What .tar.gz file? We seem to have missed at least one step here.
Grumble, do I need to do pkg install? What is the package name? That does not make sense with the instructions to extract a tar.gz file. I suppose I can see reasons why you do would want to have all the instructions on how to download it in the install & config guide... but this is html, would it be so hard to put a single line with a link to those instructions just in case people missed them? I never found anyplace that tells you what you need to do and ended up asking my husband what I needed to get. What would I do without him around?
A "pkg install amp" later and I have apache and friends. Let's try starting it... scvadm enable and hmm... it is in maintenance mode. You would think the -v (for verbose) option to svcadm might actually help you figure this out, but I guess verbose is one line saying "svc:network/http:apache22 enabled" to some engineer out there. Except it wasn't enabled, so that was pretty much useless as well as not being verbose. So I try scvs -x and see EXIT_ERR_FATAL. Ew, no fun. Maybe a log file? You might try looking in /var/log/apache2 like I did, but nope. The directory exists, but it is empty (of course, because otherwise it wouldn't fit with my day thus far). Alright, read the httpd.conf file. Well at least I did find the error log. Hmm... why can't it figure out its own IP address? I ask the oracle of everything I do not know, my husband, and he suggests dhcp might be the reason. I pray that is not it because that is just stupid. If my 6 year old Mac laptop can do it, then OpenSolaris better be able to.
Is 2pm too early to start drinking because I need one about now?
It is a good thing I have to stop now to get the kids from school... because when you start yelling at the computer, you know it is time to take a little break.

Tuesday December 02, 2008
Setback
The goal was to get the server to the new co-lo by the end of November. We missed it. A big part of that was the Thanksgiving holiday and pretty much losing all of last week. But, another part of it is that it is really hard to do tech support. I have never envied the job of a phone tech support person, and last night just reminded me how frustrating it can be.
The member of my tech team, we'll call her Amber, who has the server right now was going to install OpenSolaris on the machine yesterday. Right after dinner I got a message from her, did I have a few minutes? She kept getting a login prompt but she did not have a username/password. That is not right, there shouldn't be a login prompt yet, we have not installed anything. She is certain she told the machine to boot off the CD-ROM. A few questions later about what various screens look like, did she see a pretty blue screen with the word "OpenSolaris" on it anyplace, etc.? No. Grub menu? Yes. Ok, we're not booting off the CD, but we're getting Linux booting off the hard drive.
Reboot... darn it the screen flashes too fast... reboot... reboot... reboot, and then she finally gets enough of a view of the initial screens to get in to the bios. Have you ever tried to help someone use a BIOS page for a machine you've never used before? Ick. It is not fun. I do not have experience with many varieties of PC hardware and this server is a brand I know nothing about. We manage to establish that the CD-ROM is set in the 2nd position for boot order but after 2 hours and many reboots we cannot seem to get it changed so that the machine will boot from the CD. Amber is 3 hrs ahead in time so we had to abandon the efforts for the night so she could get some sleep.
So, tonight we'll try again.

Friday November 21, 2008
Step one...
I've been trying to come up with a good pseudonym for the organization that I am the technical director for so that I can talk about it here without confusing what I am writing with whatever I am doing in my real job for Sun. How about Unnamed Social Network Site... or USNS for short? I suppose it really does not matter and that is easy enough to type.
A friend of mine donated a spare 1U server to USNS recently. We are currently leasing a server with our co-lo and it is not meeting our needs anymore. The donated server is not "new", but it was free and it should meet our current workloads. Like most non-profits, we operate on a shoestring budget and not having to pay to lease a server every month, especially one that cannot handle peak usage, is a big improvement for us.
Step one for the migration from the old server to the new server (and from Ubuntu to OpenSolaris) is getting the new server ready to be taken to the co-lo. Since the co-lo is on the other side of the country and we have no remote console set up, I need the OS installed before the machine is delivered to them. One of the members of my technical team, who lives in the same area of the country as the co-lo, will be doing this for me and then delivering the machine. It will be a great test of all the changes that have been made to the download and install process for OpenSolaris. I think the process is really easy, but I've been working with Solaris for a long time. Will someone with no Solaris/OpenSolaris experience see it the same way? I am about to find out. If all goes well, we'll have the new server online by the end of the month.

Thursday November 13, 2008
New possibilities
I guess saying "it's been a while" would be an understatement. I am sorry blog, I have ignored you for so long. I guess part of it was I have been busy but I think most of it was that I ran out of fun things to say. Most of what I do for work (unix conformance) is not very exciting to anyone but the handful of us that do. My main source of fun topics was my husband, who was working as a Solaris (and Linux) sysadmin for a university. Now he has moved to a new position focused strictly on email and not as general sysadmin, so his stories are less related to Solaris and not nearly as fun. (Though he did notice the big drop in spam this week as McColo was taken offline, that was pretty interesting.)
I finally will have some interesting stories of my own to relate. I have joined the board of directors for a non-profit as the technical director for the organization. No, I am still doing my day job. I guess this would count as a glorified hobby since it is all volunteer time. This hobby is going to allow me to completely design and direct the development of a new infrastructure for our organization of 29,000 members, 109 different local sites (with their own associated web forums), and 400,000 page hits a day to the web server (and everything growing fast). I've convinced them to take the plunge and we're moving to OpenSolaris for the entire network. I will finally get to use in a production environment all the cool tools I've barely been able to make use of, and many I've only just gotten to read about.
Step one is to migrate the existing set up from Ubuntu to OpenSolaris, and to new hardware in a new co-location facility. On paper, it seems like it should not be too difficult. All the key apps (apache, mysql, postfix, and phpbb) and their configuration work under Solaris. But, I know how migrations go and I am sure some adventures are in store for me.
So, hold on to your hats and glasses... things are likely to get interesting from here on out!

Thursday December 07, 2006
I had a really funny story here.
I had a really good/funny story about Tom's new job here. He started a new job, still at the University but now working for the group that supports the whole campus. He was hired to be their Solaris expert. The CIO of the campus used to be Tom's boss in his old job so he knew what Tom could do. I am thankful he took the job because I think if he had not they would have eliminated Solaris completely. The problem with where he works is they don't know Solaris and so they do not know how to do things correctly. The result is they have made a lot of mistakes which make the systems difficult to admin and makes them run less than optimal.
He made me take the funny story down but that is ok because it was the lesson the story teaches that made me post it. Not knowing they are doing things wrong, they are left just thinking Solaris is not as good as Linux. So, instead you get our conversation about why I posted it:
Kristin: you make great stories, http://blogs.sun.com/kamundse
Tom: don't do that
Kristin: no one from (your work) reads my blog
Kristin: ok ok, i think it is funny, and useful
Tom: it is, but I have to work here
Kristin: i dont just put it on because it is funny, it is important
Kristin: these guys could ditch Solaris because they think it sucks
Kristin: because they can't use it right
Kristin: so why is that? why are they doing it all wrong?
Tom: they are used to Linux, they want to force solaris to work like linux and they don't want to learn the advanced solaris tricks
Kristin: yes
Tom: hence the /etc/hosts file that solaris doesn't deal with correctly
Tom: not using the solaris patch process, but rather making it look as much like linux as they can
Tom: add to that disabling auto_home
Kristin: i think Sun people need to know how the system are being used
Kristin: they need to know how people expect things to work and then we either need to provide linux-like interfaces or really get the message out about how to do things right
Kristin: ok... can i put this on my blog?
Tom: ok

Monday May 22, 2006
RMI 2, Kristin 1
When we last saw our hero, she had fought with RMI for 3 hours only to emerge victorious. Alas, the victory was short-lived. Happy that her code was working, she became busy with another project for awhile and had not tried to run it again until this unfortunate day.
It is a sunny Monday morning, the sounds of the ocean and singing of birds creating a relaxing background music, when our hero sits down to look at her code after a month. She first thinks to herself, "now, what was I doing with this?", and the painful memories of a Friday one month earlier begin to surface. She makes no changes to her code and tries running it again through Netbeans. That still does not work. No problem, she says to herself, unaware that the evil overlord, RMI is waiting to attack with his ultimate weapon, the Exception. She tries to start it from the command line using the exact same command which had finally been successful last time.
The horror, the shock, the emptiness in the pit of her stomach destroyed the peaceful setting as she read:
port is 1099
RMI registry started.
Starting server...
Server failed to start.
java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:1099 connect,resolve)
How can this be?? How has RMI yet again found a way to destroy our hero's happiness and force her to spend yet another day trying to simply get the server to run?
This sad story is still unfolding. Our hero is right now avoiding pulling her hair out by distracting herself writing silly blog entries. How will it end? Stay tuned!
The story continued:
Our hero's mental stability has begun to crumble. Again, for no reason that can be determined, the code mysteriously started working even though nothing was changed. The command which now works is:
java -Djava.security.manager -Djava.security.policy=policyfile -cp $HOME/nbprojects/TRProto2/dist/TRProto2.jar server.TestRunnerServer
Speed readers may not notice the subtle difference. Last time,
-Djava.security.policy=common/policyfile worked, this time, our hero had to remove the
common/ and use a policyfile outside of the
.jar file. As we leave our hero for the day, she is questioning both her heroness and whether to finally believe that now it does indeed work.
Hopefully there will be no need to check back with our hero. Perhaps RMI has once and for all been defeated. Those who have experienced RMI know better. They know somewhere, somehow, RMI is planning its revenge, waiting to strike when least expected.

Friday April 28, 2006
I like Java, but some days...
...some days I really want to smack some Java engineer... or drop-kick my computer downstairs.
So, of all the things I ever do with Java, RMI has to be the most frustrating. It is frustrating because it is so cool and yet so hard to get running! I've used RMI on several projects and still I cannot seem to ever get it to just run on the first try. I am not talking about anything complex here like by two versions of Java are slightly off so the serialization doesn't work (which has happened to me... and between minor revisions of the same major release, 1.4.2). No, I am talking about something simple like creating a server with one function, which is just a stub still, and just making sure it can start up.
I wasted over three hours on this so I thought I'd blog my experience. Maybe I can save someone else the same waste of time. I wish I could say I got all the problems solved but at least I can start the server.
My first problem was with Netbeans. It is completely not obvious what you need to do to make it so you can run
rmic on your Remote classes. I spent a good 30 minutes trying to figure this out.
To run
rmic on your Remote class:
My second problem was also with Netbeans and I have not solved it yet. Once you have your code compiled and have run rmic, naturally you want to run the code. That didn't work so well for me. The first run resulted in:
Server failed to start.
java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:1099 connect,resolve)
java.security.AccessControlException: access denied (java.net.SocketPermission 127.0.0.1:1099 connect,resolve)
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:264)
...
at sun.rmi.registry.RegistryImpl_Stub.rebind(Unknown Source)
at java.rmi.Naming.rebind(Naming.java:160)
at server.TestRunnerServer.main(TestRunnerServer.java:143)
(Netbeans stack traces are upside down)
Quickly I though to myself that I probably need to add something to the java command since I do when I run RMI code at the command line. Normally I run RMI code with the following:
java -Djava.security.manager -Djava.security.policy=PATH_TO_POLICYFILE -cp MYJAR MYCLASS
So I decided to start there. I had put the policyfile in the common package, the class I want to run is in the server package. I added to "Arguments" in Properties:Run
-Djava.security.manager -Djava.security.policy=common/policyfile, I figured the classpath should not need setting. It did not help, the same stack trace happens. Well, maybe that was the wrong place to put the -D args, so I tried the "VM Options" which seems to be what the
Netbeans Tutorial was saying to do. Still the same problem.
At this point I decided to make sure it was not something with netbeans causing the problem so I tried to run my code from the command line. That did not work either but I did not save what the stack trace was. It was time to start looking on the
Java Forums and on Google for some ideas. The first page I came across was talking about how his problem was the
/etc/hosts on his machine had an entry like:
127.0.0.1 localhost mymachine
This was causing his stuff to try to connect to
mymachine when it should have been
localhost. That was not my problem but it made me start to wonder if maybe VPN or NIS was somehow confusing RMI. So, I moved the code to a machine at work rather than my home machine. The machine at work is the place it will run when deployed so I figured what better place to get it started.
With the code moved to my work machine, I tried to run it again, and still it failed. I did not save this stack trace either but I was wondering if maybe my code that automatically starts the registry was not working.
The code to start the registry:
try
{
Registry registry = LocateRegistry.createRegistry(port);
System.out.println("RMI registry started.");
}
catch (ExportException ee)
{
System.out.println("Failed to find RMI registry: port " + port
+ " already in use.");
}
catch (RemoteException re)
{
System.out.println("Failed to find registry");
if (debug)
{
System.err.println(re);
re.printStackTrace();
}
System.exit(1);
}
So, I started the rmiregistry by hand at the command line. That made the error change, which is progress of a sort. Now I was getting:
java.rmi.ServerException: RemoteException occurred in server thread; nested exception is:
java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
java.lang.ClassNotFoundException: server.TestRunnerServer_Stub
java.rmi.ServerException: RemoteException occurred in server thread; nested exception is:
java.rmi.UnmarshalException: error unmarshalling arguments; nested exception is:
java.lang.ClassNotFoundException: server.TestRunnerServer_Stub
So, off to look up that error. It seems people get it pretty often because I found lots of places talking about how to fix it. Here is a summary of the different suggestions:
- Check your files permission carefully. The directory containing the files will be
also required a+rx. Also make sure the path is correct.
- Use the parameter -Dxxx like in the following call:
java -classpath=..
-Djava.rmi.server.codebase=file://<IP>/<Drive-Letter>:/<Path to the
root directory of your packages>/<Class with main>
Do not forget to use the trailing '/'.
- If you have CLASSPATH set to .;xxx;yyy;zzz just start rmiregistry from the directory
where you have server object (to make . work for both rmiregistry and your server
object)
- Explode the jar and call the class file directly.
- Put the .class files in a web directory.
Run rmiregistry.
java -Djava.rmi.server.codebase=http://www.yourmachine.com/~<your-username>/
-Djava.security.policy=java.policy YourClass
rmi://:<port-number>/<a-name>
- You also need to set the classpath for the registry, with -J-Dclasspath=xxx
- Your rmic command needs a -d option specifying the output directory, which should be
the same as the classpath.
- You could try this instead:
javaw -classpath xxx
sun.rmi.registry.RegistryImpl
- I believe the remote class and stub must be visible to the RMIRegistry not just the
remote application. So you will have to unpack at least those two and put them in
a path the rmiregistry can find.
- Important! No classes shall be visible for rmiregistry.
You might get problems later by having it like that.
I think your problem is the codebase, set the codebase to the stubs and I think it
will work.
-Djava.rmi.server.codebase=file:/myhome/classes
- The solution: do NOT use a COMPRESSED jar file with an RMI server.
- I put my Server in a package and when running with "java" the stub could not be found.
java -cp .
java -classpath %CLASSPATH%;d:\my\class\path;
sun.rmi.registry.RegistryImpl
- I've taken to starting the registry in code in my server using
LocateRegistry.createRegistry() and documenting in my server documentation that any
already running registry MUST have the server jar (which contains the stub) on its
classpath.
java -classpath /jars/x.jar:/jars/y.jar
-Djava.rmi.codebase=file:/export/home/guy_t/
-
-Djava.rmi.server.codebase=file:/home/crease/java/basic/rmi_first/first.jar
- If the client and server do not share the same file system (even via NFS), then -Djava.rmi.server.codebase=file:///xyz will not work.
Using ftp instead, should work, but I have never used it. Read the doc, as I understand FTP always need a user name and password .. I do not know how that is shared.
com.foo.bar.MainClass
Woah, is your head spinning yet? Ok, has anyone developing Java thought that if there can be this many ways to mess up just finding the
_Stub that maybe something is not right here? So, here I am trying every one of these "solutions" (some really can't be called anything less than cheesy hack). And in the end, none of them worked.
Now, keep in mind through all of this, I have not changed a line of actual code, just played with my CLASSPATH, the args I am sending to the java command, extracting my class files out of the jar, etc.. I do not even know why, perhaps the gods finally took pity on me after 3 hours of this torture, but I decided to kill rmiregistry and then use just the following command:
java -Djava.security.manager -Djava.security.policy=common/policyfile -cp $HOME/nbprojects/TRProto2/dist/TRProto2.jar server.TestRunnerServer
And it worked. W??!?!?!?! I know what the problem was, the rmiregistry process I started did not have the classpath. But, I tried running it a few times before I started running rmiregistry by hand, and it didn't work. What is more, I did the same thing on my home machine where this started, and it worked there too!
Sadly Netbeans is still not working, but at least it runs. If I weren't so ready to be done for the weekend and enjoy my Friday I might be upset that there seems to be no reason why it suddenly works now.
Hey, yea, it is friday night... what the heck am I doing blogging? Enjoy your weekend!

Wednesday April 12, 2006
Top three OS attributes?
I've been told that for Linux, efficiency trumps all other attributes. For OpenBSD, security seems like the number one attribute. Something like Exokernel would put flexibility at the top. For Solaris, robustness and compatibility are good candicates for being in the top three.
Other operating systems have attributes they explicity don't care about. Plan 9 creators decided against compatibility (with UNIX) in favor of doing everything over the "right" way.
So, what do you think the top five (increased from three) attributes for Solaris, Linux (you can even break it down by distribution), OS X, the various BSDs, and any other OS you are interested in? This is a list of the top five design goals of the OS, not necessarily what attributes it actually displays.
Here's a list (not complete) of some attributes to consider:
- robustness
- maintainability
- security
- interoperability/compatibility/standardization
- useability
- efficiency/performance/speed
- portability
- flexibility
- scalability
- correctness
Here's my first pass, I'll update this based on any comments.
| Solaris |
robustness, compatibility, scalability, security |
| Linux |
performance (changed from efficiency), portability |
| FreeBSD |
compatibility, robustness, maintainability |
| OpenBSD |
security, portability, standardization, correctness |
| NetBSD |
portability, interoperability, security |
| OS X |
useability, robustness |
| Generic μkernel |
flexibility, useability |
| Windows |
profits, lock-in, global domination (from comments) |
If an OS is not here, its because I was not able to come up with anything I felt comfortable with. Oh, and I am not only talking unix-like OSes. I did not leave Windows out intentionally, I just don't know enough about it to know what its design goals are.
For anyone who may be following it, I will get back to the 64-bit processor series in the next post.

Friday March 31, 2006
64-bit processors - Cache and Memory
Who doesn't like cache? Why, just the word makes me feel happy... ah... cache...
Opteron
The Opteron has an on-die L1 and L2 cache for each core.
What does this mean? Cache is memory available to the processor which is faster to access than the main memory on the system. There are different levels of cache which are successively slower and larger. The level 1, or L1, cache is inside the core of the processor. Most L1 caches are split into a data cache and an instruction cache. The data cache stores recently used or computed data, the instruction cache stores program instructions. It is the fastest and smallest cache. The level 2, or L2, cache is larger and slower and may not be in the processor core. For processors where the L2 cache is not in the core, it is still on the processor chip. Some processors include a level 3, or L3, cache which may or may not be on the chip. It is larger and slower than the L2 cache. If it is off-chip, it is often accessed through the memory bus. Cache is used by the processor to keep data and program instructions which have been recently used by the program for faster access in the future. Changes to the cache size and speed have significant effects on the processor's performance. Cache takes a large amount of space on the chip and increases the cost of production for the processor.
The L1 instruction and data caches are 64K split into four 8K banks and is 2-way associative.
What does this mean? The splitting of memory into banks is a common technique. It allows data to be spread across the banks for quicker access. This is called interleaving. Associativity refers to how many possible places in the cache a particular piece of data can be stored. In a cache with 1-way associativity, there is only one location a particular piece of data can be stored. This location is determined by using values such as the virtual or physical memory address of the data. Because the data can be in only one place, looking in the cache to see if the data needed is there is a fast operation. Since the cache is not very big, often more than one piece of data will map to the same single location. This means it will overwrite whatever data is already there. By increasing associativity, the number of places a piece of data can be in the cache increases. This decreases the frequency of data overwriting other data, but makes finding the data slower. For a 4-way associative cache the processor must look in as many as four places to find the data needed, in an 8-way associative cache it must look in up to eight places, and so on. A fully-associative cache means a piece of data can be stored in any location, which means the processor may need to search every line in the cache to find the data needed.
They both use 64-byte lines.
What does this mean? When data is stored in the cache, it is stored in lines. The size of a line can be different between the L1 and L2 cache. When data is fetched into the cache, it is done one line at a time. The bigger the line, the more data is fetched. This has advantages and drawbacks. Getting more data in a single fetch can speed up the processor since fetches are time consuming. But, fetching more data means overwriting more data in the cache, which can increase the need for further fetches later because needed data was overwritten.
The unified L2 cache is 1M, 16-way associative and can handle 10 simultaneous requests.
What does this mean? While the L1 cache is split between a data and instruction cache, the L2 cache is a unified cache, so both data and instructions are in the same L2 cache. Notice the associativity of the L2 cache is much higher than the L1 cache. If a piece of data is not in the L1 cache or in the L2 cache, unless the machine has an L3 cache it must be loaded from memory. This is a very slow operation (1000s of cycles). To increase the odds that needed data will be in the L2 cache, the associativity is increased. This makes it slower but it is still much faster than main memory. The most basic cache can handle a single request each cycle. The L2 cache in the Opteron can handle 10 requests each cycle.
The L2 cache uses a
LRU replacement policy.
What does this mean? LRU stands for "least recently used". This means when new data needs to be stored in the cache, it overwrites the data in the cache which has been accessed the longest time ago. The idea is that data which has not been accessed in a long time is less likely to be needed in the future as data which has been accessed more recently.
The Opteron does not have an L3 cache. The on-chip memory controller (named
Northbridge) is 128-bits and operates at the same frequency as the core. It has a maximum bandwidth of 5.3G/s.
What does this mean? To access main memory, the processor sends requests to the memory controller. The memory controller knows how to access the memory and is responsible for returning to the cache the data requested from memory. Different memory controllers can handle different sizes of data. For the opteron, this is 128-bits. It operates at the same frequency as the core, which means it sends and receives data at the same frequency as the processor operates. If it operated at half-core speed then it would be able to send or receive data only every other cycle. The maximum bandwidth refers to the amount of data can be transferred per second. In this case, it can transfer a maximum of 5.3G each second to or from the memory.
Memory access on an Opteron returns lines critical-word first.
What does this mean? When data is requested from memory, it must return in a whole line. A word is the base unit for data. Lines are made of words. When the processor requests a piece of data from memory, it can be returned in one of two ways. The line may come with all the words in order. In this case, if the processor needs the 5th word in the line, it has to wait for the first four to come back from memory before it gets the one it wants. If the number of words in a line is small, this is not a big delay. If it is large, say 100 words, the processor could have to wait a while if the word it wants is the 100th one. The other way to return the data is for the needed word to come first and the rest of the line comes after. This is critical-word first. It gets the data to the cache faster but the words in the line must be reordered at both ends of the transaction.
The processor has a 40-bit physical memory size and a 48-bit virtual memory size.
What does this mean? Physical memory is usually the RAM on the system, commonly in the installed as DIMMs, SIMMs, or RIMMs. A 40-bit physical memory size means that the processor can use memory addresses up to 40-bits in size. The larger this value is, the larger the maximum amount of memory on the system can be. In this case it is 2^40, or 1 terabyte of memory. Processors also use a type of memory addressing called virtual memory. Virtual memory is used in multitasking computers (ones that run more than one program at the same time, pretty much any computer you've used in the last 25 years) to give programs a contiguous memory space. Physical memory addresses, which may be spread all over the memory, are mapped to new virtual addresses which are in one continuous block. The virtual memory size does not need to be the same size as the physical memory size, as is seen in the case of the Opteron.
Pentium D
The Pentium D has an 8-way associative, 12K L1 instruction cache (trace cache), which uses 64-byte lines, for each core.
Compared to the Opteron, the L1 cache in the Pentium D is much smaller, 12K rather than 64K. It is does have a higher degree of associativity than the Opteron's cache. Because it is smaller, it lends itself to this higher associativity because it is a smaller cache space to search through. It uses the same line size, 64-byte, as the Opteron.
There is also a 4-way associative, 16K L1 data cache on each core.
The L1 data cache is larger than the L1 instruction cache and is less associative. It is not uncommon for processors to have different sized L1 instruction and data caches, as we'll see as we look at more processors.
The L1 data cache is
non-blocking and allows up to four cache requests.
A non-blocking cache allows the processor to issue more than request at a time. For the Pentium, it can handle up to 4 requests at a time.
The load latency is two cycles for an integer and six for floating point.
Load latency is the amount of time it takes for requested data to be returned.
Each core has a unified L2 cache that is 1M for Smithfield and 2M for Presler. The L2 cache is 8-way associative and is non-blocking with a load latency of seven cycles. The bandwidth between the L1 and L2 cache is 48G/s.
What does this mean? Again, a unified L2 cache means both instructions and data are in the same cache rather than separated as they are in the L1 cache. Although the L1 cache in the Pentium D is more associative than the L1 cache in the Opteron, its L2 cache is less associative. A maximum of 48G can be transferred each second between the L1 and L2 cache.
All the caches have a LRU replacement policy. The hardware also supports prefetching. It attempts to stay 256 bytes ahead of the current data access location.
What does this mean? Prefetching is loading data or instructions into the cache before they are requested by the processor. In the case of the Pentium D, it assumes that needed data will immediately follow data currently being requested and loads the next 256 bytes of data as well. The advantage to prefetching is that often data that will be needed in the near future is immediately following the currently requested data. By loading it before it is requested, the processor eliminates the need for the instruction to request the data from memory since it will be in the cache, which is must faster. The disadvantage to prefetching is that when data is prefetched, other data in the cache must be overwritten. If prefetching is too aggressive, it will overwrite enough existing data in the cache that it will cause extra requests to memory to return that needed data which was lost.
The memory controller for a Pentium D is not on the chip. It is accessed through the
front-side bus (FSB).
What does this mean? Unlike the Opteron, the Pentium D does not have its memory controller built into the chip. Doing this makes the processor smaller and the design simpler, and may reduce production costs. The drawback is that communication across the front-side bus is slower than on the chip. Since the memory controller must use the bus eventually to fetch data from main memory, this slowdown may not be an issue since the bottleneck of the bus will occur either way. Having an off-chip memory controller also means a multi-processor machine could share a single memory controller among multiple processors, rather than having one in each chip. Again the advantages are the same, the disadvantage is that the processors may have to wait while the memory controller is executing another processor's request. This is a general disadvantage with any shared resource.
The FSB on the Pentium D is 800Mhz with a theoretical maximum bandwidth of 6.4G/s. The Pentium D uses 40-bit physical address and 64-bit virtual address sizes.
Power5
The Power5 has a separate L1 cache for each core. The L1 instruction cache is 64K, 2-way associative, and is direct mapped from the L2 cache.
What does this mean? The L1 instruction cache direct mapped from the L2 cache means that everything in the L1 instruction cache is duplicated in the L2 cache. Since the L2 cache is shared between the cores on the Power5, this allows the cores quick read access to each other's L1 instruction cache.
The L1 data cache is 32K and 4-way associative. The L2 cache is shared between the two cores. It is 1.875M divided into three slices and is 10-way associative.
What does this mean? Unlike the Pentium D and the Opteron, the Power5 shares its L2 cache between its processors. Advantages to doing this are that the size of the processor is smaller than if it had two separate L2 caches, and it allows the cores to share data and instructions more quickly. Like the Opteron's L1 caches, the L2 cache on the Power5 is divided into slices to allow interleaving of data.
The L2 cache uses 128-byte lines and has a bandwidth to the L1 cache of 64G/s. The Power5 also has an off-chip 36M L3 cache with an on-chip directory.
What does this mean? The Power5 is one of the processors examined which uses an L3 cache. It's L3 cache is largest in size of the processors examined in this paper. An on-chip directory of the L3 cache is provided so that the processor can determine more quickly if needed data is in the L3 cache or must be fetched from main memory.
This cache is directly connected to the L2 cache. Having the L3 cache connected directly to the L2 cache rather than accessed through the memory controller speeds up access time. The bus to the L3 cache operates at half-core speed.
Unlike it's predecessor, the Power4, and the UltraSPARC IV+, the L3 cache on the Power5 is not accessed through the on-chip memory controller. In order to speed up access time to the L3 cache, the L3 cache is connected directly to the L2 cache through a back-side bus.
UltraSPARC IV+
Panther moves from the 2 levels of cache in the UltraSPARC IV to three levels of cache. It has a 64K L1 instruction cache which is divided into two 32-byte subblocks.
Again, this cache uses interleaving by splitting the cache into subblocks.
The L1 data cache is also 64K and uses a
write-through policy.
What does this mean? The job of the cache is to provide a fast access copy of data in main memory. When this data is changed from its original value, that change must be written back to main memory. There are two common ways to do this, write-through, and write-back (also called copy-back). In write-through, when data is written to the cache it is simultaneously written to main memory. The advantages to this policy are that it is simpler to implement and it keeps the cache and main memory consistent at all times. With write-back, changes to data in the cache are only sent to main memory when the changed cache line is evicted (overwritten). This leads to less traffic on the memory bus and speeds up system performance but comes with the risk that if the computer were to have an event such as power loss or a system crash, the changed data in the cache may be lost.
The L1 cache includes 2K, 64-byte line prefetch buffer accessed in parallel with the L1 instruction cache.
What does this mean? With the Pentium D, we discussed data prefetching. Some processors include the ability to prefetch program instructions. The IV+ can store up to 2K of prefetched instructions in a special prefetch buffer.
Panther’s L1 cache also includes a 2K fully associative
write cache.
What does this mean? A write cache allows the processor to continue on with other operations rather than wait for a write to complete.
The L2 cache for the UltraSPARC IV+ was moved on-chip and is shared between the cores. It is 2M, 4-way associative, and operates at half-core speed. It also is completely inclusive of all L1 caches. The L2 cache uses a copy-back policy to decrease bus traffic. The L3 cache on the Panther is 32M and 4-way associative. It has 64-byte lines and also follows a copy-back policy. L3 tags are kept on-chip.
Like the Power5, the IV+ also has an off-chip L3 cache with an on-chip directory. A difference between the Power5 and the IV+ is the Power5's use of a back-side bus for access to the L3 cache.
The L3 cache is a
victim cache, only being written to when things are evicted from the L2 cache. On a hit, the L3 line is copied back to the L2 cache and then invalidated in the L3 cache.
What does this mean? When needed data is found in the L3 cache, is it copied into the L2 cache and then removed from the L3 cache. The purpose of the L3 cache on the IV+ is to store data which has been used by the processor in the past but was evicted from the L2 cache. This data is a "victim" of being overwritten. When it is needed again, it is moved to the L2 cache so it can be accessed quicker in the future and since it is no longer a victim, it is removed from the L3 cache.
In cases where the two running threads cannot cooperate using the shared L2 and L3 cache, Panther has a mechanism for pseudo-splitting the shared caches. When split, both threads can read all of the cache but can only write to half of it.
What does this mean? How can threads not cooperate? If the threads keep overwriting each other's data in the cache, it slows both threads down. By splitting the cache into two separate areas and only allowing each thread to write to one half, the processor can prevent the two threads from clobbering each other. The reason for not having this be the default setup is that doing so cuts the sizes of the L2 and L3 cache in half from the view of each core. When they are not clobbering each other, having the larger cache sizes significantly increases system performance. In the extreme case, if one thread was idle, the other thread could be utilizing all the cache rather than being restricted to half.
UltraSPARC T1
Each core has a 16K, 4-way associative L1 instruction cache that uses 32-byte lines. The L1 data cache, also in each core, is only 8K, 4-way associative, uses 16-byte lines, and has a write-through policy.
One thing you may notice right away is the big difference in size of the L1 cache from the UltraSPARC IV+. This is just one of the many big changes in the T1 from the rest of the current SPARC processor line. The T1 does not just differ significantly from other SPARCs, but from the other 64-bit processors as well. Although its L1 instruction cache is larger than the Pentium D, and the same size as the Itanium, its data cache is noticeably smaller than any other processor.
The L2 cache is shared between cores and is accessed through a
crossbar interconnection network.
What does this mean? An interconnection network allows communication between the core and resources such as other cores, memory, cache, I/O, etc.. In the case of the T1, with eight cores sharing the L2 cache and communicate with each other, a standard linear connection network (basically a wire connecting all eight cores to each other and the L2 cache) would get bogged down quickly. A crossbar is a type of connection switch which can handle more traffic. Imagine it is a grid of wires connecting each core and the L2 cache. This provides multiple paths for data to get from point A to point B without having a collision with data going from point C to point D, or even to point B.
The crossbar provides more than 200G/s of bandwidth.
Here is another difference between the T1 and other processors. The bandwidth on the crossbar interconnect is three to four times as much as the on-chip bandwidth of the other processors. The T1 does have eight cores to support, rather than two, so this bandwidth size is not surprising.
The L2 cache is 3M banked four ways, 12-way associative, and uses 64-byte lines. Data is interleaved across the banks in 64-byte granularity. The L2 cache has a directory of all eight L1 caches. There are four on-chip memory controllers shared by the eight cores and accessed through the crossbar. The memory bus on the T1 is significantly larger than other processors with a bandwidth of 20G/s.
Recall that the bandwidth sizes for the other processors were less than half of this, with the next largest being the the UltraSPARC IV+ at 9.6G/s.
The T1 uses 40-bit physical addresses split into two sections, memory and
I/O addresses, based on bit 39.
What does this mean? A 40-bit physical memory size means that memory addresses are 40 bits long. The last bit (the range is 0 to 39, not 1 to 40) tells the processor whether this is a memory or I/O address. What is an I/O address? An I/O address is an address that belongs to one of the system's I/O devices. This could be video, network, or other devices. Remember the T1 was designed to be "network facing" so it expects to do a lot of I/O functions.
It uses a 48-bit virtual memory size.
Xeon
The Xeon processor follows the same general design as the Pentium D. The L1 cache on the Xeon is the same as in the Pentium D. The L2 cache on the Nocona comes in 1M or 2M sizes. Paxville uses a 2M L2 cache size only. The L2 cache is on a 200Mhz, shared bus to the off-chip memory controller.
What does this mean? The L2 cache is accessed via the same bus that goes to the off-chip memory controller, though the L2 cache is on the chip.
The memory bus is 800Mhz with a maximum bandwidth of 6.4G/s.
Itanium 2
There are three levels of cache available on-die with this processor. Both the L1 data and instruction cache are 16K and 4-way associative. The L1 instruction cache supports simultaneous demand and prefetch.
What does this mean? In the same cycle, the L1 instruction cache can prefetch instructions as well as respond to requests for instructions from the cache by the core.
It uses a 64-byte line, which are 4 instruction bundles.
The Itanium 2 is a VLIW processor. Up to three instructions are combined into instruction bundles. What does this mean? First, there are two types of processors, those than can issue more than one instruction at a time, and those that cannot. Those that can are split into two groups, called superscalar and VLIW. The more common of the two is superscalar. VLIW stands for "very long instruction word" and the way it works is that each cycle a "bundle" which can contain several instructions is issued.
The L1 data cache uses a write-through policy and can support 2 loads and 2 stores per cycle.
What does this mean? In the Itanium 2 (not exclusively), there are several different execution units which allows more than one instruction to be executing in the core at the same time. The L1 data cache allows two different instructions to load and two instructions to store in the same cycle.
The Itanium 2 uses a scoreboard system to facilitate a non-blocking L1 data cache. This scoreboard allows the processor to continue executing even with multiple L1 data misses by stalling the instruction issue group of the instruction that had the miss.
What does this mean? The scoreboard on the Itanium 2 keeps track of earlier L1 data cache misses. As mentioned above, instructions are issued as a group on the Itanium 2. When an instruction in a group needs data that matches an entry in the scoreboard, the entire issue group is stalled, meaning it cannot not continue to execute, until the value needed becomes available.
Stalling an instruction group does not cause a
pipeline flush.
What does this mean? The pipeline is something like an assembly line for executing an instruction. The instruction goes through stages, each of which performs some small task. We'll go into much more detail about the instruction pipeline next time. In some cases, the pipeline must be flushed, which means emptied of all currently executing instructions. When this happens, all the work done for instructions in the pipeline are lost, and the cycles wasted. In the case of a stall, on the most simple processor, this would mean that no instructions can execute because the whole pipeline gets stopped. It would be like shutting down the conveyor belt in the assembly line. In most current processors, such as the Itanium, the pipeline is more sophisticated and can allow any instruction which is ready to execute a particular stage to go ahead, this is called out-of-order execution (which we've previously explained and will go into again in a later entry).
The L2 cache is a unified, 256K, 8-way associative cache which uses a 128-byte line. It has a latency as low as 5-cycles.
What does this mean? The L2 cache on the Itanium can return data in 5 cycles in the best case, most likely with integer values. For larger data, such as floating-point numbers, especially double-precision or larger, it will take many more than 5 cycles.
It operates out-of-order but L1 misses are stored in a
FIFO for correct ordering. It can handle 4 data and 1 L3 request per cycle.
What does this mean? The L2 cache on the Itanium accepts multiple requests to send or write data each cycle. It may not process these requests in order. For reading data, this is not a problem but for writing data, this presents a problem. If two instructions operate on the same value and send requests to write to that value those writes must occur in order or the value will be incorrect in memory. An example is X=5, X=6, X=7 (counters are very common in programming). At the end of this sequence, X should be 7 in the L2 cache but if the writes are processed out-of-order it could be 5 or 6. FIFO means first-in, first-out. To keep writes in order, they are not processed directly but placed in a FIFO queue and the L2 cache can write the values when it has time, always taking the value at the front of the queue (sort of like the line at the DMV) so that writes happen in order.
One of the big differences in the Itanium 2 compared to other 64-bit processors is the on-die L3 cache. The L3 cache can be 3, 6, or 9M, is unified, and 12-way associative. It has a minimum latency of 12 cycles. It uses 128-byte lines, does not support partial line request, and returns lines critical-word first.
What does this mean? When requesting data from the L1 or L2 cache, only the exact word needed, which is part of a line, is returned. The Itanium 2 L3 cache operates like main memory, returning whole lines at a time only.
A maximum of 84.8G/s of data can be accessed on the chip.
The bandwidth to main memory for the Itanium is not discussed until later in the paper, it is 6.4G/s.
The Itanium 2 uses a 50-bit physical address size and a 64-bit virtual address size.
Sources
6. Intel Corporation – “Intel Xeon Processor-based Servers: Performance, headroom, and versatility for front-end applications, small-business servers, and High-Performance Computing”. www.intel.com, 2005
7. Sun Microsystems, Inc. – “UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006
8. Sun Microsystems, Inc, - “UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006
9. Sun Microsystems – “UltraSPARC IV+ Architectural Overview”. www.sun.com, 2005
10. J. De Gelas - "Opteron: Pushing x86 to the Limit". www.aceshardware.com, 2003
11. S. Wasson - "AMD's dual-core Opteron processors: Because four is better than two". techreport.com, 2005

Thursday March 30, 2006
64-bit processors - A few questions from last time
I wanted to clarify a few concepts from last time that I got questions on.
The first one is pretty fundamental to the whole discussion... what exactly is
64-bit? All data inside a computer is stored as a series of 1's and 0's. Each bit is like a digit in a normal (base-10) number. The more bits you have, the bigger the data you can represent, just like with numbers you use every day. Until the last 10 years or so, computers have been primarily 32-bit. What does this mean? It means that memory addresses, integers, etc were at most, 32-bits in size. If something can not be represented in just 32-bits, it had to be split into multiple parts. When doing an operation on these bigger numbers or addresses, the processor would have to handle that data in parts, rather than as a whole, which is slow. A 64-bit processor means the addresses, integers, and other data is at most, 64-bits. This means the processor can handle much larger data, which increases speed. Take a look at the link above for more details.
Another one is simultaneous multithreading. To clarify this one we're going to take a big step back. How do programs get run at all? Way back when we people had to walk to both ways to school uphill in the snow (and they liked it), computers were not very sophisticated. They ran one program from beginning to end, then you could load and run another program from beginning to end. As you may imagine, this was fairly limiting as to what you could do with a computer and people got tired of listening to Bob and Joe fight over who's turn it was to run a program. So, the idea of
scheduling was a big hit in the computer world. Scheduling is primarily the job of your operating system. At any given time, even if you have not launched any other application, there are several different programs running. Processes which are ready to run are placed in a queue. The operating system gets a program from the queue and starts it running on the processor. Every process has a time limit so others get a chance to run. A process either runs until its time limit or until something happens that causes it to sleep. Then the operating system picks another process to run and so on. It turns out that most processes spend a lot of time waiting and so when they are having their turn running, they may actually not be doing anything but sitting there. This means while they sit doing nothing, other programs which could be running are waiting. Another interesting thing about many programs is that the work they do can be split up into several independent but related tasks, threads. Take a web server for example, at a given time it may be serving pages to several users but that work is all independent of each other.
Threading allows a programmer to split these tasks into multiple processes, which run independently of each other on the computer. Doing this can make the program run a lot faster. So, how can we make the processor better able to deal with these two things? Well, one way is to add more processors to the machine. This gives each process more places to run so it addresses threading, but what about the problem of processes wasting time? What if when a process was waiting, we take if off the processor and let some other process run? This is called "coarse-grained multithreading" and is a type of
temporal multithreading. The Montecito processor (next generation Itanium) uses this type of threading. Another way to let processes share the processor, is similiar to how the operating system schedules processes. If we have a queue of threads ready to run, each cycle we can pick a thread and issue instructions from that thread in to the processor. The big difference from coarse-grained multithreading is that at a given time, instructions from more than one thread are in the processor core at the same time. This is called "fine-grained" multithreading. The UltraSPARC T1 uses fine-grained multithreading. The problem with temporal threading is that, for processors which can issue more than one instruction per core per cycle (which is all but the T1), in a given cycle it may not be able to issue the maximum number, which leads to waste. What do I mean by this? Say processor X can issue 5 instructions per cycle. For a thread ready to run this cycle, it may not have 5 instructions which can be issued this cycle, maybe only 3 are issued. That means this cycle, the processor wasted 2 issue slots and now inside the processor there are less instructions running than could be running.
Simultaneous multithreading, or SMT, addresses this by allowing instructions from more than one thread to issue each cycle. So for the cycle we just mentioned on processor X, it could try to fill those two unused issue slots with instructions from another thread. All the remaining processors which do hardware multithreading use this type of threading. We'll discuss threading more in a future blog entry but hopefully this gives you a better idea about what SMT is.
The next was what do programs need to do to take advantage of a dual- (or more) core processor. Applications should not need to do anything special. Operating systems need to be written to be able to properly use systems with multiple processors (either multi-core or multi-processor, or both). Unless you are running a *very* old operating system, your OS should already handle this. Applications which are threaded will be able to make more out of a multi-processor system, but they also gain benefits from a single processor system. Even if your application is not threaded, you will most likely you'll see an increase in system performance anyway because your application can get more time to run with more than processing core available on the system.
Another one was out-of-order execution. Let's use the recipe analogy again. Imagine a program is like a recipe. To make a cake, or whatever you want to cook, you follow steps and the end result is a cake. With an in-order execution processor, each step is followed in the order it appears in the recipe. But many people who cook know that not all steps have to be in order for the cake to come out right. Say the the steps are:
- In a bowl, mix the dry ingredients.
- In another bowl, mix the wet ingredients.
- Combine the wet and dry mixtures and mix thoroughly.
- Grease cake pan.
- Pour batter into greased cake pan.
- Preheat oven to 350 degrees.
- Bake for 20 minutes.
Looking at these steps, we can see some that must happen in a particular order. We could not pour the batter into the pan before we grease it. We can not bake the cake without heating the oven. But there are other steps which can be reordered. We could preheat the oven earlier in the recipe. We could grease the cake pan at any time before we pour the batter in. We could mix the wet ingredients before the dry as long as we did both before combining. The same is true of a program. It is possible to take instructions and reorder them and still end up with the correct result. This is called
out-of-order execution and is done by all the processors except the UltraSPARC T1 and the yet-released Montecito. How this is done is a detailed discussion for another day.
Next is instruction level parallelism.
Instruction level parallelism, or ILP is a measure of how many instructions in a program can be run at the same time. Take the recipe above, I can find four steps (instructions) which could be done simultaneously assuming we had enough cooks, 1, 2, 4, and 6. That is pretty much it. It is a fairly simple concept with some big fancy wording.
The last is EMT64. This one is really simple,
EMT64 is just the name for Intel's version of AMD's 64-bit extension to the x86 architecture.

Monday March 27, 2006
64-bit processors - Clocks, cores, and power usage
My original plan was to go processor by processor. My paper tries to cover all the major architectural features for each processor. In some places I was limited by information available but for most of the processors I was able to find all the same information. I want this to be understandable to people who are computer literate but are not architecure experts. As I started with the first processor, the Itanium 2, I realized I was writing a paragraph of explanation for every sentence. In the end I'd end up with five blog entries just for the Itanium 2 so I could explain what everything meant, and then rush through
the rest. That didn't sound like what I wanted so instead I am going to pick one to a few architectural features and talk about all the processors.
I will be using the text of my paper as I wrote it, but insert explanations and hyperlinks to explain what I am talking about. The explanations will be in a grey text, my paper is in normal black text. I'll be including a handful of my sources in each entry for those who are interested in more information.
Itanium 2
A cooperation between Intel and Hewlett-Packard lead to the release of the Itanium processor in 2001. The Itanium was intended to replace the x86 hardware and dominate the server and workstation markets. It was also not expected that AMD would be able to clone it. The second generation Itanium 2 was released in July 2002. Intel has since focused on a new Itanium processor, code named Montecito. HP sells Itanium 2 servers that range from single processor blades to 128 processor high-end servers. The Itanium's IA-64
architecture is completely different than the IA-32 architecture of the x86 family, though it provides backwards compatibility for 32-bit x86 applications. This paper will be focusing on the versions of the Itanium 2 after the first one (McKinley), code named Madison, Deerfield, and Fanwood.
What does this mean? Hardware companies have working or code names for each processor they release. For a specific processor there may be several different versions with different features. For the Itanium 2, Intel had four different versions starting with McKinley. I have found the code name is the most convenient way to reference different versions of the same processor so expect to see them used heavily in this paper.
The Itanium 2 core speed is 1.3 to 1.6 GHz and uses 130 watts of power.
What does this mean? Processor clock frequency is probably one of the most misunderstood pieces of information about processors. So, what does GHz mean for a processor. A processor clock tick is the smallest moment of time for a processor. In the simplest processor, it can do one operation each clock tick. The clock frequency is how many of those ticks occur in one second. The more ticks, the more operations that can be completed in a second. So far it sounds like GHz does just mean speed. Ah, but there is more. What is a operation and how many can a given processor do each tick? These can vary significantly. Programs consist of instructions. Each instruction is like a step in a recipe. The difference is that for different processor types, the number of instuctions needed to run the same program are different. So, already we cannot know which processor will run an application faster because they are not doing the same steps. Each instruction can be broken down into many operations. Some processors break their instructions down into smaller operations than others. Having smaller operations means they can be done faster, which increases the clock speed. But, this does not mean the whole instruction is completed any faster. So, what does knowing the clock speed tell us. For the same processor, and sometimes for a processor family, you can tell which processor is faster. It just cannot be used between
processors families (like comparing a Pentium to a G5) and sometimes even with in a processor family (like between the Pentium-D and the Xeon). Clock speed is also often tied to energy consumption (higher clock speed, more energy used).
The Itanium 2 is one of the few single-core 64-bit processors still commonly available. There is a
dual-core Itanium 2, called Hondo, available only from HP which uses two Madison cores operating at 1.1 Mhz, which is not covered in this paper due to lack of SPEC results for it. Montecito will be a dual-core processor.
What does this mean? A multiple-core processor is basically like having that many seperate processors together on a single chip, which share some resources such as a memory bus. An advantage to multiple-core processors is they can communicate with each other faster than seperate processors. They also take up less physical space on the system mother board. Many dual-core processors have the same footprint size as their older,
single-core predecessors. Most multiple-core processors also use less energy than if they were all seperate processors.
The later versions of the Itanium 2 support two threads using
coarse-grained multithreading.
What does this mean? In a processor without threading, each cycle, one or more instructions from the same program are issued into the processor to be run. The problem is that often the processor is not doing any actual work because the instructions for the program are waiting for something such as data from memory (which is a very slow operation, 1000s of cycles). In order to make better use of the processor, a technique called threading was created. There are three general types of threading. In coarse-grained multithreading, CGMT, the processor will switch what program it is running instructions from when the thread encounters a long-latency event. This means at any given time, the processor is still only running one program.
Opteron
The eighth generation of AMD's
Hammer architecture, the Opteron processor (code names SledgeHammer - 130 μm and Venus - 90 μm), was introduced in April 2003. Designed to compete with Intel's Itanium 2, the Opteron is the most powerful of AMD's 64-bit processors. It was designed for server and enterprise applications. It has arguably become the most popular x86-based 64-bit processor. A variety of computer system producers, including all of the largest enterprise-level UNIX vendors (Fujitsu, IBM, HP, and Sun), sell Opteron systems.
What does μm mean? This refers to the manufacturing process for the processor. The smaller the number, the smaller the size of the circuits. This allows the processors to be smaller and use less power. See wikipedia's page about 90 nanometer for more details.
The Opteron chip comes with either one or two cores with clock speeds from 1.8 to 2.8 GHz. Both the single and dual core Opteron processors can run two threads using
simultaneous multithreading and supports
out-of-order execution.
What does this mean? In simultaneous multi-threading, instructions from more than one program (in the case of the Opteron, from two programs) issue in to the processor to be run in the same cycle. In an in-order processor, instructions for a program are executed in the processor in the same order as they occur in the program. However, many instructions in a program are independent of each other, and do not have to be executed in-order for the program to produce the correct result. This is called
instruction level parallelism, or ILP. Why does it matter if instructions can be run out-of-order? There are many operations which take more than a single-cycle to execute, such a floating-point math or loads and stores from memory. With in-order execution, all instructions have to wait for these operations to finish before they can continue. Out-of-order execution allows a processor execute other instructions rather than stall the program waiting for a high-latency operations to finish.
The average power consumption of an Opteron processor is 89-90 watts.
What does this mean? Even if you aren't an environmentalist
who's worried about our global energy usage, how much you computer uses is
something you should care about. If you're Joe-Average, your computer eating
power means less burgers you get to eat. My dual processor Dell around
350 watts (possibly more at peak). That is more than if I turned on
every light in my house (we use florescents). I make sure that machine is
off or sleeping whenever it is not in use. If your Bob-Admin, multiply that
by however many machines you have. Bob-Admin also has to think about how he's
going to keep his server room cool too because 50 machines using that much
power make a great sauna in a few hours. If Bob-Admin puts his machines in
a co-location facility, where he is paying by the sq ft and the watt, not only
does he pay more for energy usage, but for space too. The racks in a co-lo can
only support so much power draw per sq ft, so that means less machines per
rack.
Pentium D
In May 2005, Intel introduced the Pentium D (code name Smithfield), a
dual-core processor, which contains two essentially unmodified Pentium 4
Prescott processors. Unlike the Prescott, the Pentium D adds support for
64-bit through Intel's
EMT64 technology.
Although some Pentium Prescott processors utilize Intel's
Hyper Threading
technology, the Pentium D examined
in this paper does not . In early 2006, Intel released a 65μm version
of the Pentium D, code named Presler. Like Smithfield, the Presler chip does
not support multithreading.
There is a dual-thread Pentium
D, the 3.2GHz Pentium Extreme, but no CPU2000 benchmarks have been published
for that processor so it was not included in this paper. An Extreme Edition
of Presler is scheduled to be released in mid-2006.
The two cores in the Pentium D Smithfield are on the same die and have a clock
speed of 2.8, 3.0, or 3.2Ghz. The Presler cores are each on their own die,
which decreased production cost since a defect in a die affects only one core.
What does this mean? Two cores on the "same die" mean that
both cores are manufactured on the same integrated circuit. Cores on seperate
dies mean the cores are on seperate integrated circuit, though they are still
on the same chip. For machines with both cores on one die, communication time
is faster but a defect in one core makes both cores not usable since the
integrated circuit must be thrown away. With the seperate die approach, there
is some loss in communication speed but a defect in one core means only that
core must be thrown away.
The cores of the Presler chip operate at 2.8, 3.0, 3.2, and 3.4GHz. With 230
million transistors, the Smithfield is significantly smaller than the
dual-core Itanium processor, which has 1.7 billion transistors yet the maximum
power usage for a Pentium D is about 130W - 155W and the dual-core Itanium is
100W. The cores in the Pentium D are clocked significantly lower than the
single-core Prescott in order to minimize power consumption.
What does this mean? Usually, the number of transistors
in an integrated circuit correlate to the amount of power used however with the
Pentium D compared to the Montecito, this is not the case. Each core in the
Pentium D operates at a lower clock frequency than the single-core equivalent
so that the power usage is still reasonable. Operating at the same frequency,
the Pentium D would likely be over 200W in power usage.
.
Power5
The IBM Power5, close relative of the G5, was released in June 2003. IBM uses the Power5 for
a range of machines from single processor entry-level servers to high-end
multi-processor servers. Like its predecessor, the Power4, the Power5 is a
dual-core processor. Both cores are on the same die. The clock speed of the
Power5 ranges from 2.0 to 2.7GHz. The power usage is about 100W . The Power5 can run two threads in each core using
simultaneous multi-threading. It can also operate in single thread mode.
UltraSPARC IV+
Code named Panther, the fifth generation processor in the SPARC family, the
UltraSPARC IV+, was designed for enterprise computing and released in September 2005. Panther is a dual-core processor that supports two threads using what Sun calls "chip multi-threading", or CMT.
Sun's CMT does not quite the same definition of threading as is commonly used when
talking about processors. Threading normally means running instructions for different
programs in the same core. CMT in the IV+ is running a different program in each
core, not in the same core.
The UltraSPARC IV+ has twice the computing power over the UltraSPARC IV yet
reduces the power consumption from 108W to 90W.
What does this mean? The IV+ shows how much the manufacturing process can
improve processor power usage. The UltraSPARC IV used a 130 μn process
but the IV+ uses a 90 μn process. This allows the processor to be the same
in physical size even though it is much more complex and powerful. This also
helps it use less energy than the IV.
UltraSPARC T1
The UltraSPARC T1 , released in November of 2005, is the newest of the SPARC
processor line by Sun Microsystems. The T1 has generated a lot of interest
due to its departure in design from other 64-bit processors currently on the
market. The T1 has eight cores operating at 1.0 or 1.2GHz.
All cores on the processor operate at the same frequency,
the processor is available in a 1.0 or 1.2 version.
Each core can execute four threads, making the T1 a 32-way processor. Despite the large
number of cores, the T1 only consumes 75W on average, 79W peak.
Xeon
The 64-bit Intel Pentium 4 Xeon was released in June 2004 (code named Nocona). It is designed to be an enterprise-level processor for business computing. It
comes in a single and dual core model (code named Paxville, released in
October 2005) and supports Intel's Hyper Threading technology. The Xeon has
clock speeds from 2.83 to 3.66Ghz, the fastest of any of the processors
examined. The single core Xeon uses 110-120W of power, the dual-core uses
135-150W.
Sources
This is not a complete list... I'll be putting a handful
at the end of each entry.
1. P. Kongetira, K. Aingaran, K. Olukotun - "Niagara: A 32-Way Multithreaded Sparc Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 21-29, 2005
2. C. McNairy, D. Soltis - "Itanium 2 Processor Microarchitecture". IEEE Micro,
March/April 2003, Vol. 23, No. 2, pg. 44-55, 2003
3. R. Kalla, B. Sinharoy, J. Tendler - "IBM Power5 Chip: A Dual-Core
Multithreaded Processor". IEEE Micro, March/April 2004, Vol. 24, No. 2, pg.
40-47, 2004
4. C. McNairy, R. Bhatia - "Montecito: A Dual-Core Dual-Threaded Itanium Processor". IEEE Micro, March/April 2005, Vol. 25, No. 2, pg. 10-20, 2005
5. C. Keltcher, K. McGrath, A. Ahmed, P. Conway - "The AMD Opteron Processor for Multiprocessor Servers". IEEE Micro, March/April 2003, Vol. 23, No. 2, pg. 66-76, 2003
Power Consumption Sources
Itanium 2 -
http://www.intel.com/products/processor/itanium2/index.htm
Opteron -
http://www.epinions.com/content_18680811072
Pentium D -
PCStats.com and
wikipedia.com
Power5 -
http://www.xlr8yourmac.com/G5/xserveG5.html
UltraSPARC IV+ -
http://www.extremetech.com/article2/0,1558,1667444,00.asp
UltraSPARC T1 -
http://www.sun.com/processors/UltraSPARC-T1/index.xml
Xeon -
www.news.com

Friday March 17, 2006
64-bit processors - here we go.
If you'd asked me a year ago about processor architecture, I would have told you that it is not my area of interest and wouldn't have had much else to say about it. I think it all started with my two undergrad architecture classes, I hated them. I suppose I should have known better than to give up so fast. I had the same thing happen to me with terrible 7th grade history teacher and it took until my first college history class before I realized it was an interesting topic after all.
When I started the class I knew I'd have to do a research paper. I have been mildly interested in the Niaraga processor since it came out a few months ago so I decided I would find some way to incorporate it into my research. What I ended up doing was a comparison of the architecture of the 64-bit processors you will find in current computers (current being anything released in the last 2-3 years) and their performance on some of the SPEC benchmarks.
When I started I was pretty clueless about the subject. I had no idea Intel had EMT64 in the Pentium line, that there was an Itanium 2 (and a 3rd generation on the way), or that the Power5 was being used anyplace but in Apple's computers. I realized, if I am this clueless about all of this, then likely other people (both computer-geeks and regular consumers) are too. As I started to talk to other computer-savvy friends of mine, I found they were. It seems computers are like cars, most people know a fair amount about the model car they have, but only a small percent of the population are true gear-heads. I knew my two 3.0Ghz Xeon processors in my Dell were dual-core and had threading. I had no idea the Power5 did that as well, or even that the UltraSPARC IV+ was dual-core (and I work for Sun).
So, I am not an architecture expert. I will probably make mistakes. I didn't take all of these processors apart and look inside, so whatever I know is based on what has been published by the hardware manufacturer or other people doing hardware research. If I make a mistake, please correct it! With that in mind, let's get started.
The processors I'll be focusing on:
- Alpha 21364
- Itanium 2
- Opteron
- Pentium D
- Power5
- UltraSPARC IV+
- UltraSPARC T1
- Xeon (Nocona and Paxville)
You will notice there are a few processor families missing, such as the MIPS line of processors. I wanted to focus on processors you are likely to find in a machine you could buy today. I tried to be as fair as I could with my analysis, but I admit I had some bias going into this. I have never been a big Intel fan. For me, Windows and x86 were always "ew". I do have an of x86 box at home and love it (it is screamin' fast) so I am not totally against them anymore. I also wanted to see the Sun processors do well. I think working for Sun, it is expected I might be cheering for the home team. I can say that many things I found surprised me, others disappointed me, and my opinion of most of these processors is very different than when I started.
Working Backwards
Rather than go through all the analysis and then write a conclusion, I am going to start this off with my conclusion and then we can look at how I got there.
If I had to pick the best overall processor (based on the research I did and the specs I looked at), it would have to be the Power5. The T1 blows it away (and every other processor) in benchmarks like
SPECweb2005 and
jAppServer2004 but the Power5 tops all other processors in these benchmarks and also shows great performance on the CPU2000 benchmarks as well. The T1 has no published CPU2000 results. It is not a data crunching machine so I doubt we'll ever see CINT2000 or CFP2000 results for it. Based on processor only, if I wanted to buy a web server, I'd pick the T1, for anything else, I'd choose the Power5.
The Opteron would make it to my number two place. It was in the top two for overall processor speed (CINT2000) and was a close second to the Power5 for throughput for up to four processors (CINT2000Rate). It is too bad there are not 8-way and higher Opterons so we could get a really good look at its throughput scalability.
The Itanium 2 and the UltraSPARC IV+ both fall in the middle. Neither processor is very fast but they both have decent thoughput and both scale fairly well. Even though neither of them are the top performers, they are the most common processors for benchmark results for machines with 36 or more processors. The next generation Itanium, Montecito, makes some movement in design in the same directions as the T1. It will be interesting to see how it performs once it is released.
The Pentium family processors fall at the bottom. The single-core Xeon is fast. It had the second highest average score on the CINT2000 benchmark. The Pentium D fell in the middle for speed. Both of them have good throughput for a single processor machine. After that, things start to fall apart for the Pentiums. There are no published results for the Pentium D for more than single-CPU machines. The Xeon's throughput drops noticably at two processors and pretty much falls off a cliff above four processors. Neither machine performs very well on the SPECweb2005 or jAppServer2004 benchmarks.
You might ask, but what about the T1 and the Alpha, where do they fit in? It is not really fair to put the T1 in this ranking, at least not directly. It really is not the same type of processor as the rest of these. The processors above were designed to be all-purpose processors. The T1 was designed with a specific type of application in mind, "network-facing". The T1 would be at the top for these types of applications, and would not do as well for others. Without published results, the best I could do is speculate and I want to discuss facts, not play guessing games. With the Alpha, I didn't find out until after my paper was done that there was chip released recently (2003). I have not had a chance to take a close look, but I will.
I know that by putting this up, someone is going try to argue that I have not provided the details. As I said, this is the conclusion, not the whole of my research. I can only fit so much into a single blog entry. Have no fear, the details, more than you probably want, are coming in future blog entries.
Tom pointed out it might look bad that I didn't put a Sun processor in the #1 place. Well, if telling the truth gets me in trouble, I can live with that. The reality is that anyone can look at these results, they are all published on
SPEC's web page. I only had time to look at the integer CPU2000 benchmarks, SPECweb2005 and jAppServer2004. There are so many more to look at (and I plan to) that by the end of this, those rankings will likely change.
That is all for now. Stay tuned for next time when I take a look into some of the architecture designs for these processors.