Three kids, a dog, a cat, sunny days, ocean breezes, and way too much time online
SLO Life

 
www.flickr.com
This is a Flickr badge showing public photos from kamundse. Make your own badge here.
 

See all my pictures here.
 

 

Archives
« March 2006 »
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
15
18
19
20
21
22
23
25
26
28
29
 
       
Today
XML
Search

Links

The requested Bookmark Folder does not exist: Blogroll

 
 

Today's Page Hits: 1

All | Geeky | Linux | Personal | rand() | Sun
« Previous day (Mar 30, 2006) | Main | Next day (Apr 1, 2006) »
20060331 Friday March 31, 2006
64-bit processors - Cache and Memory
 
 
Who doesn't like cache? Why, just the word makes me feel happy... ah... cache...  
 
Opteron  
 
The Opteron has an on-die L1 and L2 cache for each core. What does this mean? Cache is memory available to the processor which is faster to access than the main memory on the system. There are different levels of cache which are successively slower and larger. The level 1, or L1, cache is inside the core of the processor. Most L1 caches are split into a data cache and an instruction cache. The data cache stores recently used or computed data, the instruction cache stores program instructions. It is the fastest and smallest cache. The level 2, or L2, cache is larger and slower and may not be in the processor core. For processors where the L2 cache is not in the core, it is still on the processor chip. Some processors include a level 3, or L3, cache which may or may not be on the chip. It is larger and slower than the L2 cache. If it is off-chip, it is often accessed through the memory bus. Cache is used by the processor to keep data and program instructions which have been recently used by the program for faster access in the future. Changes to the cache size and speed have significant effects on the processor's performance. Cache takes a large amount of space on the chip and increases the cost of production for the processor. The L1 instruction and data caches are 64K split into four 8K banks and is 2-way associative. What does this mean? The splitting of memory into banks is a common technique. It allows data to be spread across the banks for quicker access. This is called interleaving. Associativity refers to how many possible places in the cache a particular piece of data can be stored. In a cache with 1-way associativity, there is only one location a particular piece of data can be stored. This location is determined by using values such as the virtual or physical memory address of the data. Because the data can be in only one place, looking in the cache to see if the data needed is there is a fast operation. Since the cache is not very big, often more than one piece of data will map to the same single location. This means it will overwrite whatever data is already there. By increasing associativity, the number of places a piece of data can be in the cache increases. This decreases the frequency of data overwriting other data, but makes finding the data slower. For a 4-way associative cache the processor must look in as many as four places to find the data needed, in an 8-way associative cache it must look in up to eight places, and so on. A fully-associative cache means a piece of data can be stored in any location, which means the processor may need to search every line in the cache to find the data needed. They both use 64-byte lines. What does this mean? When data is stored in the cache, it is stored in lines. The size of a line can be different between the L1 and L2 cache. When data is fetched into the cache, it is done one line at a time. The bigger the line, the more data is fetched. This has advantages and drawbacks. Getting more data in a single fetch can speed up the processor since fetches are time consuming. But, fetching more data means overwriting more data in the cache, which can increase the need for further fetches later because needed data was overwritten. The unified L2 cache is 1M, 16-way associative and can handle 10 simultaneous requests. What does this mean? While the L1 cache is split between a data and instruction cache, the L2 cache is a unified cache, so both data and instructions are in the same L2 cache. Notice the associativity of the L2 cache is much higher than the L1 cache. If a piece of data is not in the L1 cache or in the L2 cache, unless the machine has an L3 cache it must be loaded from memory. This is a very slow operation (1000s of cycles). To increase the odds that needed data will be in the L2 cache, the associativity is increased. This makes it slower but it is still much faster than main memory. The most basic cache can handle a single request each cycle. The L2 cache in the Opteron can handle 10 requests each cycle. The L2 cache uses a LRU replacement policy. What does this mean? LRU stands for "least recently used". This means when new data needs to be stored in the cache, it overwrites the data in the cache which has been accessed the longest time ago. The idea is that data which has not been accessed in a long time is less likely to be needed in the future as data which has been accessed more recently. The Opteron does not have an L3 cache. The on-chip memory controller (named Northbridge) is 128-bits and operates at the same frequency as the core. It has a maximum bandwidth of 5.3G/s. What does this mean? To access main memory, the processor sends requests to the memory controller. The memory controller knows how to access the memory and is responsible for returning to the cache the data requested from memory. Different memory controllers can handle different sizes of data. For the opteron, this is 128-bits. It operates at the same frequency as the core, which means it sends and receives data at the same frequency as the processor operates. If it operated at half-core speed then it would be able to send or receive data only every other cycle. The maximum bandwidth refers to the amount of data can be transferred per second. In this case, it can transfer a maximum of 5.3G each second to or from the memory. Memory access on an Opteron returns lines critical-word first. What does this mean? When data is requested from memory, it must return in a whole line. A word is the base unit for data. Lines are made of words. When the processor requests a piece of data from memory, it can be returned in one of two ways. The line may come with all the words in order. In this case, if the processor needs the 5th word in the line, it has to wait for the first four to come back from memory before it gets the one it wants. If the number of words in a line is small, this is not a big delay. If it is large, say 100 words, the processor could have to wait a while if the word it wants is the 100th one. The other way to return the data is for the needed word to come first and the rest of the line comes after. This is critical-word first. It gets the data to the cache faster but the words in the line must be reordered at both ends of the transaction. The processor has a 40-bit physical memory size and a 48-bit virtual memory size. What does this mean? Physical memory is usually the RAM on the system, commonly in the installed as DIMMs, SIMMs, or RIMMs. A 40-bit physical memory size means that the processor can use memory addresses up to 40-bits in size. The larger this value is, the larger the maximum amount of memory on the system can be. In this case it is 2^40, or 1 terabyte of memory. Processors also use a type of memory addressing called virtual memory. Virtual memory is used in multitasking computers (ones that run more than one program at the same time, pretty much any computer you've used in the last 25 years) to give programs a contiguous memory space. Physical memory addresses, which may be spread all over the memory, are mapped to new virtual addresses which are in one continuous block. The virtual memory size does not need to be the same size as the physical memory size, as is seen in the case of the Opteron.  
 
Pentium D  
 
The Pentium D has an 8-way associative, 12K L1 instruction cache (trace cache), which uses 64-byte lines, for each core. Compared to the Opteron, the L1 cache in the Pentium D is much smaller, 12K rather than 64K. It is does have a higher degree of associativity than the Opteron's cache. Because it is smaller, it lends itself to this higher associativity because it is a smaller cache space to search through. It uses the same line size, 64-byte, as the Opteron. There is also a 4-way associative, 16K L1 data cache on each core. The L1 data cache is larger than the L1 instruction cache and is less associative. It is not uncommon for processors to have different sized L1 instruction and data caches, as we'll see as we look at more processors. The L1 data cache is non-blocking and allows up to four cache requests. A non-blocking cache allows the processor to issue more than request at a time. For the Pentium, it can handle up to 4 requests at a time. The load latency is two cycles for an integer and six for floating point. Load latency is the amount of time it takes for requested data to be returned. Each core has a unified L2 cache that is 1M for Smithfield and 2M for Presler. The L2 cache is 8-way associative and is non-blocking with a load latency of seven cycles. The bandwidth between the L1 and L2 cache is 48G/s. What does this mean? Again, a unified L2 cache means both instructions and data are in the same cache rather than separated as they are in the L1 cache. Although the L1 cache in the Pentium D is more associative than the L1 cache in the Opteron, its L2 cache is less associative. A maximum of 48G can be transferred each second between the L1 and L2 cache. All the caches have a LRU replacement policy. The hardware also supports prefetching. It attempts to stay 256 bytes ahead of the current data access location. What does this mean? Prefetching is loading data or instructions into the cache before they are requested by the processor. In the case of the Pentium D, it assumes that needed data will immediately follow data currently being requested and loads the next 256 bytes of data as well. The advantage to prefetching is that often data that will be needed in the near future is immediately following the currently requested data. By loading it before it is requested, the processor eliminates the need for the instruction to request the data from memory since it will be in the cache, which is must faster. The disadvantage to prefetching is that when data is prefetched, other data in the cache must be overwritten. If prefetching is too aggressive, it will overwrite enough existing data in the cache that it will cause extra requests to memory to return that needed data which was lost. The memory controller for a Pentium D is not on the chip. It is accessed through the front-side bus (FSB). What does this mean? Unlike the Opteron, the Pentium D does not have its memory controller built into the chip. Doing this makes the processor smaller and the design simpler, and may reduce production costs. The drawback is that communication across the front-side bus is slower than on the chip. Since the memory controller must use the bus eventually to fetch data from main memory, this slowdown may not be an issue since the bottleneck of the bus will occur either way. Having an off-chip memory controller also means a multi-processor machine could share a single memory controller among multiple processors, rather than having one in each chip. Again the advantages are the same, the disadvantage is that the processors may have to wait while the memory controller is executing another processor's request. This is a general disadvantage with any shared resource. The FSB on the Pentium D is 800Mhz with a theoretical maximum bandwidth of 6.4G/s. The Pentium D uses 40-bit physical address and 64-bit virtual address sizes.  
 
Power5  
 
The Power5 has a separate L1 cache for each core. The L1 instruction cache is 64K, 2-way associative, and is direct mapped from the L2 cache. What does this mean? The L1 instruction cache direct mapped from the L2 cache means that everything in the L1 instruction cache is duplicated in the L2 cache. Since the L2 cache is shared between the cores on the Power5, this allows the cores quick read access to each other's L1 instruction cache. The L1 data cache is 32K and 4-way associative. The L2 cache is shared between the two cores. It is 1.875M divided into three slices and is 10-way associative. What does this mean? Unlike the Pentium D and the Opteron, the Power5 shares its L2 cache between its processors. Advantages to doing this are that the size of the processor is smaller than if it had two separate L2 caches, and it allows the cores to share data and instructions more quickly. Like the Opteron's L1 caches, the L2 cache on the Power5 is divided into slices to allow interleaving of data. The L2 cache uses 128-byte lines and has a bandwidth to the L1 cache of 64G/s. The Power5 also has an off-chip 36M L3 cache with an on-chip directory. What does this mean? The Power5 is one of the processors examined which uses an L3 cache. It's L3 cache is largest in size of the processors examined in this paper. An on-chip directory of the L3 cache is provided so that the processor can determine more quickly if needed data is in the L3 cache or must be fetched from main memory. This cache is directly connected to the L2 cache. Having the L3 cache connected directly to the L2 cache rather than accessed through the memory controller speeds up access time. The bus to the L3 cache operates at half-core speed. Unlike it's predecessor, the Power4, and the UltraSPARC IV+, the L3 cache on the Power5 is not accessed through the on-chip memory controller. In order to speed up access time to the L3 cache, the L3 cache is connected directly to the L2 cache through a back-side bus.  
 
UltraSPARC IV+  
 
Panther moves from the 2 levels of cache in the UltraSPARC IV to three levels of cache. It has a 64K L1 instruction cache which is divided into two 32-byte subblocks. Again, this cache uses interleaving by splitting the cache into subblocks. The L1 data cache is also 64K and uses a write-through policy. What does this mean? The job of the cache is to provide a fast access copy of data in main memory. When this data is changed from its original value, that change must be written back to main memory. There are two common ways to do this, write-through, and write-back (also called copy-back). In write-through, when data is written to the cache it is simultaneously written to main memory. The advantages to this policy are that it is simpler to implement and it keeps the cache and main memory consistent at all times. With write-back, changes to data in the cache are only sent to main memory when the changed cache line is evicted (overwritten). This leads to less traffic on the memory bus and speeds up system performance but comes with the risk that if the computer were to have an event such as power loss or a system crash, the changed data in the cache may be lost. The L1 cache includes 2K, 64-byte line prefetch buffer accessed in parallel with the L1 instruction cache. What does this mean? With the Pentium D, we discussed data prefetching. Some processors include the ability to prefetch program instructions. The IV+ can store up to 2K of prefetched instructions in a special prefetch buffer. Panther’s L1 cache also includes a 2K fully associative write cache. What does this mean? A write cache allows the processor to continue on with other operations rather than wait for a write to complete. The L2 cache for the UltraSPARC IV+ was moved on-chip and is shared between the cores. It is 2M, 4-way associative, and operates at half-core speed. It also is completely inclusive of all L1 caches. The L2 cache uses a copy-back policy to decrease bus traffic. The L3 cache on the Panther is 32M and 4-way associative. It has 64-byte lines and also follows a copy-back policy. L3 tags are kept on-chip. Like the Power5, the IV+ also has an off-chip L3 cache with an on-chip directory. A difference between the Power5 and the IV+ is the Power5's use of a back-side bus for access to the L3 cache. The L3 cache is a victim cache, only being written to when things are evicted from the L2 cache. On a hit, the L3 line is copied back to the L2 cache and then invalidated in the L3 cache. What does this mean? When needed data is found in the L3 cache, is it copied into the L2 cache and then removed from the L3 cache. The purpose of the L3 cache on the IV+ is to store data which has been used by the processor in the past but was evicted from the L2 cache. This data is a "victim" of being overwritten. When it is needed again, it is moved to the L2 cache so it can be accessed quicker in the future and since it is no longer a victim, it is removed from the L3 cache. In cases where the two running threads cannot cooperate using the shared L2 and L3 cache, Panther has a mechanism for pseudo-splitting the shared caches. When split, both threads can read all of the cache but can only write to half of it. What does this mean? How can threads not cooperate? If the threads keep overwriting each other's data in the cache, it slows both threads down. By splitting the cache into two separate areas and only allowing each thread to write to one half, the processor can prevent the two threads from clobbering each other. The reason for not having this be the default setup is that doing so cuts the sizes of the L2 and L3 cache in half from the view of each core. When they are not clobbering each other, having the larger cache sizes significantly increases system performance. In the extreme case, if one thread was idle, the other thread could be utilizing all the cache rather than being restricted to half.  
 
UltraSPARC T1  
 
Each core has a 16K, 4-way associative L1 instruction cache that uses 32-byte lines. The L1 data cache, also in each core, is only 8K, 4-way associative, uses 16-byte lines, and has a write-through policy. One thing you may notice right away is the big difference in size of the L1 cache from the UltraSPARC IV+. This is just one of the many big changes in the T1 from the rest of the current SPARC processor line. The T1 does not just differ significantly from other SPARCs, but from the other 64-bit processors as well. Although its L1 instruction cache is larger than the Pentium D, and the same size as the Itanium, its data cache is noticeably smaller than any other processor. The L2 cache is shared between cores and is accessed through a crossbar interconnection network. What does this mean? An interconnection network allows communication between the core and resources such as other cores, memory, cache, I/O, etc.. In the case of the T1, with eight cores sharing the L2 cache and communicate with each other, a standard linear connection network (basically a wire connecting all eight cores to each other and the L2 cache) would get bogged down quickly. A crossbar is a type of connection switch which can handle more traffic. Imagine it is a grid of wires connecting each core and the L2 cache. This provides multiple paths for data to get from point A to point B without having a collision with data going from point C to point D, or even to point B. The crossbar provides more than 200G/s of bandwidth. Here is another difference between the T1 and other processors. The bandwidth on the crossbar interconnect is three to four times as much as the on-chip bandwidth of the other processors. The T1 does have eight cores to support, rather than two, so this bandwidth size is not surprising. The L2 cache is 3M banked four ways, 12-way associative, and uses 64-byte lines. Data is interleaved across the banks in 64-byte granularity. The L2 cache has a directory of all eight L1 caches. There are four on-chip memory controllers shared by the eight cores and accessed through the crossbar. The memory bus on the T1 is significantly larger than other processors with a bandwidth of 20G/s. Recall that the bandwidth sizes for the other processors were less than half of this, with the next largest being the the UltraSPARC IV+ at 9.6G/s. The T1 uses 40-bit physical addresses split into two sections, memory and I/O addresses, based on bit 39. What does this mean? A 40-bit physical memory size means that memory addresses are 40 bits long. The last bit (the range is 0 to 39, not 1 to 40) tells the processor whether this is a memory or I/O address. What is an I/O address? An I/O address is an address that belongs to one of the system's I/O devices. This could be video, network, or other devices. Remember the T1 was designed to be "network facing" so it expects to do a lot of I/O functions. It uses a 48-bit virtual memory size.  
 
Xeon  
 
The Xeon processor follows the same general design as the Pentium D. The L1 cache on the Xeon is the same as in the Pentium D. The L2 cache on the Nocona comes in 1M or 2M sizes. Paxville uses a 2M L2 cache size only. The L2 cache is on a 200Mhz, shared bus to the off-chip memory controller. What does this mean? The L2 cache is accessed via the same bus that goes to the off-chip memory controller, though the L2 cache is on the chip. The memory bus is 800Mhz with a maximum bandwidth of 6.4G/s.  
 
Itanium 2  
 
There are three levels of cache available on-die with this processor. Both the L1 data and instruction cache are 16K and 4-way associative. The L1 instruction cache supports simultaneous demand and prefetch. What does this mean? In the same cycle, the L1 instruction cache can prefetch instructions as well as respond to requests for instructions from the cache by the core. It uses a 64-byte line, which are 4 instruction bundles. The Itanium 2 is a VLIW processor. Up to three instructions are combined into instruction bundles. What does this mean? First, there are two types of processors, those than can issue more than one instruction at a time, and those that cannot. Those that can are split into two groups, called superscalar and VLIW. The more common of the two is superscalar. VLIW stands for "very long instruction word" and the way it works is that each cycle a "bundle" which can contain several instructions is issued. The L1 data cache uses a write-through policy and can support 2 loads and 2 stores per cycle. What does this mean? In the Itanium 2 (not exclusively), there are several different execution units which allows more than one instruction to be executing in the core at the same time. The L1 data cache allows two different instructions to load and two instructions to store in the same cycle. The Itanium 2 uses a scoreboard system to facilitate a non-blocking L1 data cache. This scoreboard allows the processor to continue executing even with multiple L1 data misses by stalling the instruction issue group of the instruction that had the miss. What does this mean? The scoreboard on the Itanium 2 keeps track of earlier L1 data cache misses. As mentioned above, instructions are issued as a group on the Itanium 2. When an instruction in a group needs data that matches an entry in the scoreboard, the entire issue group is stalled, meaning it cannot not continue to execute, until the value needed becomes available. Stalling an instruction group does not cause a pipeline flush. What does this mean? The pipeline is something like an assembly line for executing an instruction. The instruction goes through stages, each of which performs some small task. We'll go into much more detail about the instruction pipeline next time. In some cases, the pipeline must be flushed, which means emptied of all currently executing instructions. When this happens, all the work done for instructions in the pipeline are lost, and the cycles wasted. In the case of a stall, on the most simple processor, this would mean that no instructions can execute because the whole pipeline gets stopped. It would be like shutting down the conveyor belt in the assembly line. In most current processors, such as the Itanium, the pipeline is more sophisticated and can allow any instruction which is ready to execute a particular stage to go ahead, this is called out-of-order execution (which we've previously explained and will go into again in a later entry).  
 
The L2 cache is a unified, 256K, 8-way associative cache which uses a 128-byte line. It has a latency as low as 5-cycles. What does this mean? The L2 cache on the Itanium can return data in 5 cycles in the best case, most likely with integer values. For larger data, such as floating-point numbers, especially double-precision or larger, it will take many more than 5 cycles. It operates out-of-order but L1 misses are stored in a FIFO for correct ordering. It can handle 4 data and 1 L3 request per cycle. What does this mean? The L2 cache on the Itanium accepts multiple requests to send or write data each cycle. It may not process these requests in order. For reading data, this is not a problem but for writing data, this presents a problem. If two instructions operate on the same value and send requests to write to that value those writes must occur in order or the value will be incorrect in memory. An example is X=5, X=6, X=7 (counters are very common in programming). At the end of this sequence, X should be 7 in the L2 cache but if the writes are processed out-of-order it could be 5 or 6. FIFO means first-in, first-out. To keep writes in order, they are not processed directly but placed in a FIFO queue and the L2 cache can write the values when it has time, always taking the value at the front of the queue (sort of like the line at the DMV) so that writes happen in order.  
 
One of the big differences in the Itanium 2 compared to other 64-bit processors is the on-die L3 cache. The L3 cache can be 3, 6, or 9M, is unified, and 12-way associative. It has a minimum latency of 12 cycles. It uses 128-byte lines, does not support partial line request, and returns lines critical-word first. What does this mean? When requesting data from the L1 or L2 cache, only the exact word needed, which is part of a line, is returned. The Itanium 2 L3 cache operates like main memory, returning whole lines at a time only. A maximum of 84.8G/s of data can be accessed on the chip. The bandwidth to main memory for the Itanium is not discussed until later in the paper, it is 6.4G/s. The Itanium 2 uses a 50-bit physical address size and a 64-bit virtual address size.  
 
Sources  
 
6. Intel Corporation – “Intel Xeon Processor-based Servers: Performance, headroom, and versatility for front-end applications, small-business servers, and High-Performance Computing”. www.intel.com, 2005  
 
7. Sun Microsystems, Inc. – “UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006  
 
8. Sun Microsystems, Inc, - “UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification (Hyperprivileged Edition)”. www.opensparc.net, 2006  
 
9. Sun Microsystems – “UltraSPARC IV+ Architectural Overview”. www.sun.com, 2005  
 
10. J. De Gelas - "Opteron: Pushing x86 to the Limit". www.aceshardware.com, 2003  
 
11. S. Wasson - "AMD's dual-core Opteron processors: Because four is better than two". techreport.com, 2005

posted by kamundse Mar 31 2006, 11:27:48 AM PST Permalink