Wednesday December 07, 2005
Is my workload recommended for a CoolThread UltraSPARC T1 server ( T1000 - T2000 ) ?
Since the pre-release and announcement of UltraSPARC T1 systems (T1000 - T2000),
our customers coming in the Sun Solution Benchmark Center have been very interested to know if their
application will work well on UltraSPARC T1. While assessing the multi-threaded nature of a
workload is easy using standard system tools, it is less straightforward to obtain at will
the amount and proportion of floating points instructions executed by a system. Some complex
tools exist but we would like to have a simple go/no-go binary that would answer
only this question. (If you are interested in a more detailed analysis of a cpu behavior, please
ask me about a great tool called ripc )
The key information coming from our UltraSPARC T1 engineers is the choice they had to make (because
of space limitations) to have a single floating point unit shared by the 8 cores (and 32 strands).
Please note that this challenge has been solved on the next release of this processor.
They tell us that in there best estimation any workload doing more than 2% of the total amount of instructions
using floating-points will not be recommended for UltraSPARC T1. Between 1% and 2% is the gray area where
they recommend us to try because a number of the simpler FPU commands were moved to the
core and dont incur a 40 cycles penalty.
The idea of this article is to explain how to get this information and provide a simple tool
(for all UltraSPARC based systems).
The UltraSPARC III (or UltraSPARC IV core) has a maximum of four instructions that can
be fetched from cache in a clock cycle and a total of sixteen fetched instructions that
can wait for an execution unit to become available. Six parallel execution units exist on
the chip : one load/store unit, one branch unit, two identical integer Arithmetic Logical
Units, one add (and therefore substract) floating point unit named FA_PIPE (see FP 1
on the schema below and one multiply(and therefore divide) floating point unit named FM_PIPE.
(see FP 2 below).

For the UltraSparc III (and IV or IV+), multiple performance
instrumentation counters are provided to analyze the CPU performance
behavior under load but for our purpose we need to consider only three of them :
1-The total number of instructions completed not counting annulled, mispredicted or
trapped instructions. This is the Instr_cnt counter
2-The total number of instructions completed on the FA_PIPE. This is the FA_pipe_completion
counter.
3-The total number of instructions completed on the FM_PIPE. This is the FM_pipe_completion
counter.
Note that the counters 2 and 3 are also incremented for some type of VIS instructions. Therefore,
they have to be considered only as estimations.
For the UltraSPARC T1 based systems, it is simpler as the single counter FP_instr_cnt is directly provided.
As you already deducted, we will be able to determine the percentage of floationg point
operations with the formula :
%FP_ops = 100 * (FA_pipe_completion + FM_pipe_completion) / Instr_cnt
We are also able to provide this simple heuristic :
if ( %FP_ops < 1%) -> Recommended for UltraSPARC T1
else if (%FP_ops between 1% to 2%) -> Possible fit for UltraSPARC T1
else -> Not recommended for UltraSPARC T1
To do this, here is a program named pfp that you can use as pfp <duration in seconds>
If you are on a T1000 or T2000 system, please use the flag -n as this program does not detect the cpu
type in its first release.Please remember to run your workload first and while it is running,
use this program as shown below.
paris # ./pfp 30
We observed 22756679 instructions separated in 0.20% floating point and 99.80% others
This workload is recommended for UltraSPARC T1 systems.
ontario # ./pfp -n 30
We observed 342593950 instructions separated in 0.77% floating point and 99.33% others
This workload is recommended for UltraSPARC T1 systems.
If you just want the percentage of floating point instructions, you can also do
paris # ./pfp -s 30
0.20
Finally, you can also use the tool on Solaris 8 or Solaris 9 with :
Dtrace # ./pfp -ps 30
1.97
The binary of this tool can be found here.
Dec 07 2005, 05:02:21 PM PST
Permalink
Posted by Igor on December 08, 2005 at 06:50 AM PST #
Posted by Ceri Davies on December 08, 2005 at 08:48 AM PST #
Thanks
Posted by Amit Kulkarni on December 08, 2005 at 09:53 PM PST #
thanks for your comments. MrBenchmark graciously emailed me with almost the same info.
Posted by Amit Kulkarni on December 09, 2005 at 11:27 AM PST #
> (for all UltraSPARC based systems).
I try this tool on our Solaris 8 UltraSPARC systems and get:
% pfp 30
cpustat: %pic0 cannot measure event 'FA_pipe_completion' on this cpu
cpustat: two events must be specified
Usage:
cpustat [-c events] [-nhD] [interval [count]]
[...]
% uname -a
SunOS shasta 5.8 Generic_117350-12 sun4u sparc SUNW,Sun-Blade-100
We're trying to decide whether we should buy some Niagara systems.
Posted by Jim Gottlieb on December 27, 2005 at 01:43 AM PST #
Posted by Rainer Orth on January 20, 2006 at 03:20 AM PST #
Posted by Bob Rion on February 06, 2006 at 07:43 AM PST #
Posted by Michel Boitos on March 01, 2006 at 06:45 AM PST #
Posted by bbr on March 02, 2006 at 04:30 AM PST #
Posted by Bob Rion on March 23, 2006 at 09:19 AM PST #
Posted by Taj Newburn on April 06, 2006 at 01:10 PM PDT #
Posted by Olivier Chédru on April 07, 2006 at 06:21 AM PDT #
Posted by John on August 11, 2006 at 12:04 PM PDT #
Posted by MrBenchmark on August 11, 2006 at 01:52 PM PDT #
Solaris switches to integer calculations if the FPU is too high. Can you please explain what the algorithm is for deciding to switch?
Thanks,
hcc
Posted by hcc on September 25, 2007 at 09:20 AM PDT #
Great post and draw. Thank you for sharing.
Posted by links of london jewellery on November 23, 2009 at 06:40 PM PST #