Ravindra TalashikarRavindra Talashikar's weblog |
|
Wednesday Dec 07, 2005
UltraSPARC T1 utilization explained
UltraSPARC T1 utilization explainedWith
the introduction of Sun Fire T2000/T1000 servers using UltraSPARC T1
processor, Sun has taken
a radically different approach to building scalable servers. UltraSPARC
T1 processor
is best perceived as a system on the chip. In order to understand the
performance of any system we need to start with understanding the CPU
utilization of that system. Let us see how software and hardware thread
scheduling is done on UltraSPARC T1, why
conventional tools like mpstat don't
show the complete picture and what it really means by CPU utilization
for this T1 processor. While thinking about this issue, I wrote
"corestat" a new tool to monitor the core utilization of T1
processor and I will discuss the use of this tool too.
Let
us start with the overview of basic concepts which will help
understand the rationale for addressing the CPU utilization aspect
separately for UltraSPARC T1's CMT architecture.
CMT and UltraSPARC T1 at a glance : UltraSPARC T1 processor presents Chip Multiprocessing combined with Chip Multi threading. Processor architecture consists of eight cores with four hardware threads per core. Each core has one integer pipeline and four threads within a core share the same pipeline. There are two types of shared resources on the processor. Each core shares Level 1 (L1) Instruction and Data cache as well as the Translation Lookaside Buffer (TLB) and all the cores share the on chip Level 2 (L2) cache. L2 cache is a 12 way set associative unified (instruction and data combined) cache. Thread scheduling on UltraSPARC T1 : The Solaris Operating System kernel
treats each hardware thread of a
core as a separate
CPU which makes T1 processor look like a 32 CPU system. In reality
its a single physical processor with
32 virtual processors.
Conventional tools like mpstat
and prtdiag
report 32 CPUs on T1.
The Solaris Operating system schedules software threads onto these
virtual processors (hardware threads) very similar to a
conventional SMP system. There is a one to one
mapping of software threads onto these hardware threads and a
software
thread is always scheduled on one hardware thread till its time quantum
expires or is pre-empted by another higher priority software thread.
Hardware scheduler decides the use of the pipeline by the hardware threads sharing the same core. Every cycle the hardware thread scheduler switches threads within a core, allowing the same hardware thread to run at least every 4th cycle. There are two specific situations under which a hardware thread can get to run for more than one cycle in four consecutive cycles. These situations arise when a hardware thread becomes idle or gets stalled. Let us look into these cases more closely : What does it mean by an “idle” hardware thread on UltraSPARC T1 : Conventionally a processor is considered to be idle by the kernel when there is no runnable thread in the system which can be scheduled on that processor. On previous generation SPARC processors, an idle state related to the pipeline of the processor remaining unused. For a CMT processor like T1 if there are not enough runnable threads in the system then one or more hardware threads in a core remain idle. Main differences in behavior of an idle virtual processor (hardware thread) of T1 compared to the idle CPU in conventional SMP are :
Understanding processor utilization : For a T1 processor a thread being idle and a core becoming idle are two different things and hence need to be understood separately. Here are some commonly asked questions in this regard :
On UltraSPARC T1
Solaris
tools like mpstat only report
the state of a
hardware thread
and don't show the core utilization. Conventionally if a processor is
not
idle it is considered as busy. A stalled processor is also
conventionally
considered busy because for non CMT processors the pipeline of a
stalled processor is not available for other runnable threads in the
system. However on a T1 processor a stalled thread doesn't mean stalled
pipeline. On T1 processor vmstat
and
mpstat output should really be
interpreted as the report of
pipeline occupancy by software threads. For non CMT processors idle
time reported by mpstat or vmstat can be used to decide on
adding more load on the system. On a CMT processor like T1, we also
need to look at
the core utilization before making the same decision.
Core
utilization of a T1 corresponds to the number of instructions
executed by that core. Cpustat
is a tool available on Solaris to
monitor system behavior using hardware performance counters. T1
processor has
two hardware performance counters per thread (there are no core
specific
counters). One of the performance counters always reports instruction
count and the other can be programmed to measure other events such as
cache misses and TLB misses etc. A typical cpustat command looks like
:
cpustat -c pic0=L2_dmiss_ld,pic1=Instr_cnt 1 which will report Data cache misses in L2 cache and the instructions, executed in user mode at 1 second interval by all the enabled threads. I wrote a new tool “Corestat”
for online monitoring of core utilization. Core
utilization is reported for all the available cores by aggregating
the instructions executed by all the threads in that core. Its a
perl script which forks cpustat command at run time and then aggregates
the instruction count to derive the core utilization. A T1 core can
best execute 1 instruction/cycle and hence the maximum core utilization
is directly proportional to the frequency of the processor.
Corestat can be used in two modes :
Usage :
CPU minf mjf xcal intr ithr csw icsw migr smtx srw
syscl usr sys wt idl Here we can see each core is
executing 39% of its max
capacity. Interestingly mpstat output for the same period shows that
all the virtual CPUs are all 100% busy. Together it shows that in this
particular case even 100% busy threads can not utilize any of the core
to its max capacity due to the stalls.
From
corestat data we can get an idea about the absolute capacity of the
core available for more work or performance. Higher the percentage of
core usage means the core is getting saturated and has less head room
available for processing more load. It also means that the pipeline
is being used more efficiently. However, lower core utilization
doesn't simply mean more room for applying more load. All the virtual
CPUs
can be 100% busy and still the core utilization could be low.
Core
utilization (as seen
above from corestat)
and mpstat or vmstat need to be used together to
make decisions
about system utilization.
Here is some explanation of a few commonly observed scenarios : Vmstat reports 75% idle and core utilization is only 20% : Since vmstat reports huge idle time as well as the core usage is also low, there is head room for applying more load. Any performance gain by increasing load will depend on the characteristic of the application. Vmstat reports 100% busy and core utilization is 50% : Since vmstat reports all threads being 100% busy, there is really no more head room to schedule any more software threads. Hence the system is at its peak load. Low (i.e. 50%) core utilization indicates that the application is only utilizing each core to its 50% capacity and the cores are not saturated. Vmstat reports 75% idle but core utilization is 50% : Since core
utilization is
higher than that reported by vmstat,
this is an indication that the processor can get saturated by having
fewer software threads than the available hardware threads. It is also
an indication of a low CPI application.
In this case, scalability will be limited by core saturation and adding
more load after a certain point will not help achieve any more
performance.
As with any other system on Sun Fire
T2000 as the load increases, more threads become busy and core
utilization also goes up. Since thread saturation (i.e. virtual CPU
saturation) and core saturation are two different aspects of system
utilization, we need to monitor both simultaneously in order to
determine whether an application is likely to saturate a core by using
fewer threads. In that case, applying additional load on the system
will not deliver any more throughput. On the other hand if all the
threads get saturated but core utilization shows more head room then
that means the application has stalls and it is a high CPI application.
Application level tuning, partitioning of resources using processor
sets (psrset(1M)) or binding
of LWPs (pbind(1M)) could be
some
techniques to improve the performance in such cases.
Posted at 11:13AM Dec 07, 2005 by travi in Sun | Comments[25] |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted by Moazam on December 07, 2005 at 12:59 PM IST #
Posted by Glenn on December 14, 2005 at 03:05 AM IST #
(1) How long is a hardware thread typically stalled in a typical instruction mix?
(2) How does the kernel schedule software threads across cores and hardware threads? Does it just use some random allocation? Does it pile the SW threads onto the HW threads of the first core, then the second core, and so forth? Does it distribute the SW threads evenly across cores, insofar as possible? In a situation where all the cores have at least one HW thread active, does the kernel try to cluster threads on cores according to the Solaris process in which they are executing (keeping threads from the same process on the same core, as much as possible), to avoid or limit the amount of L1 cache and TLB thrashing, especially since the cache is unified? For the sake of argument, consider a system with 32 active SW threads, including 8 active threads in my single-process application. Will my own 8 threads be allocated 1 to a core, 2 to a core, 4 to a core, or some mixture of these? Then consider the same question when my 8 threads are the only threads on the machine.
(3) How large is each cache line? How much cache thrashing can we expect? Don't all the HW threads in a core tend to overlay each other's data in the L1 cache, reducing performance?
(4) What does all of this say about the best way to write an application for best performance on a T1? Sun sometimes says no restructuring is needed, but this is not believable. It seems that one would need to dramatically increase the number of threads that can be used in parallel by your application. Two threads would probably get you back to where you started (given the lower operating frequency of each core), and then you can climb from there if you can find appropriate parallelism in your application.
(5) How does 12 (-way set associativity of the cache) relate to 8 (cores) or 32 (HW threads)? 12 seems like a very strange number (not a power of two).
(6) This may be a dumb question, but ... are the L1 and L2 caches physical-memory or virtual-memory caches?
Posted by Glenn on December 14, 2005 at 03:48 AM IST #
Posted by Andrei Dorofeev on December 18, 2005 at 12:06 PM IST #
Posted by Ravindra Talashikar on December 19, 2005 at 12:59 PM IST #
Posted by Ravindra Talashikar on December 19, 2005 at 01:08 PM IST #
Playing around with cpustat, I can get one or the other but not both, and as far as I'm aware you can't run 2 instances of cpustat at the same time? cpustat refers to the T1 manual, but that manual is not available at the given URL.
Would you mind posting the cpustat command you're running?
thanks, Richard
Posted by Richard Gray on January 22, 2006 at 07:53 PM IST #
Posted by Ravindra Talashikar on February 02, 2006 at 05:39 PM IST #
Posted by Robert Halloran on February 03, 2006 at 09:03 PM IST #
Posted by Balaji on April 12, 2006 at 02:14 PM IST #
Posted by Mark Round on April 24, 2006 at 03:40 PM IST #
Posted by Niranjan B on May 03, 2006 at 12:52 AM IST #
Posted by Ravindra Talashikar on June 04, 2006 at 06:40 PM IST #
Posted by Kamal Srinivasan on June 06, 2006 at 07:14 AM IST #
Posted by Javier Iparraguirre on June 13, 2006 at 01:41 AM IST #
Posted by Ravindra Talashikar on June 13, 2006 at 04:39 PM IST #
Posted by Ravindra Talashikar on June 13, 2006 at 04:55 PM IST #
Posted by Shailesh on September 19, 2006 at 04:05 PM IST #
Posted by goker canitezer on November 19, 2006 at 06:23 AM IST #
Posted by Phil Freund on January 18, 2007 at 02:13 AM IST #
Argument "Can't" isn't numeric in array element at /usr/local/bin/corestat line 152, <fd_in> line 1.
But corestat works perfect on a T1000; both run with 1000Mhz. What could be the reason? I'm using v1.0 -- Kind regards, Nick
Posted by Niki Kraus on July 05, 2007 at 01:53 PM IST #
Posted by John Tavares on August 02, 2007 at 08:37 PM IST #
Hi Ravi, corestat is a great tool ,i am able to run it successfully.But i am not able to redirect the corestat output to a file.
Commnads like pipe or Redirection are not working with corestat.
Could you please provide solution to this?
I run corestat using "sudo corestat" command.
Thank YOu
Prashant
Posted by Prashant on September 26, 2007 at 11:39 AM IST #
I also want to redirect the output to a file and further analyse. But I also found that redirection doesn't work. Anyone can give me some hints?
Posted by Karen Law on February 19, 2008 at 08:18 AM IST #
Hi:
Is it possible to get this tool for T2+ processors?
Posted by PB on October 24, 2008 at 02:55 PM IST #