Darryl Gove's blog
Understanding UltraSPARC-T1 performance counters
There are two performance counters on the UltraSPARC-T1 processor. The first can be programmed to collect one of eight different event types. The second is hardcoded to always count instructions. Each event type has a cost associated with it, this is typically the number of cycles where the processor will stall before issuing more instructions from that thread. The counters are described in the following table, together with order of magnitude estimates for the costs of each event:
| Counter | Comment | Cost in cycles |
|---|---|---|
| SB_full | Cycles when store buffer is full | 1 |
| FP_Instr_cnt | Floating point instruction count | 30 |
| IC_miss | Instruction cache miss | 20 |
| DC_miss | Data cache miss | 20 |
| ITLB_miss | Instruction TLB miss | 100 |
| DTLB_miss | Data TLB miss | 100 |
| L2_imiss | Instruction fetches that miss L2 cache | 100 |
| L2_dmiss_ld | Loads that miss L2 cache | 100 |
| Inst_cnt | Instruction count | 1 |
The interpretation of these counters is different for the UltraSPARC-T1 than for previous generations of processors. When interpreting the results it is very important to recognise that the processor is able to handle many more cycles of stall. The reasoning is as follows:
Each processor has eight cores, each core executes four threads. Each thread can issue one instruction per cycle. This means that every cycle, three threads cannot execute an instruction, these threads can either be stalled, or waiting for the opportunity to issue an instruction.
Assume that the processor is clocked at 1.2GHz. This results in a processor that can issue 9.6 billion instructions per second. For each one of those instructions, there were three threads that could either be stalled, or waiting to issue an instruction.
Hence there's a budget of 9.6 billion instructions per second. The Instr_cnt performance counter shows how much of this budget was actually used. This is a measure of the utilisation of the processor.
There is also a budget of 9.6 * 3 = 28.8 billion 'stall' cycles; cycles where the other threads can be stalled. All the hardware counter stall events come from this budget.
In previous processors, a high stall time would indicate a performance issue with the application; but on the UltraSPARC-T1 that is no longer the case. Until the number of cycles spent stalled exceeds the 'stall' budget, there may be no impact on the performance of the application.
Posted at 10:00AM Jan 09, 2007 by Darryl Gove in Sun |


