Wednesday November 14, 2007
OLTP performance of the Sun SPARC Enterprise M9000 on Solaris 10 08/07
I recently published a performance comparison of the Sun Fire E25k and the new Sun SPARC Enterprise M9000.
In this article, a lot of my readers noticed the following note :
"Oracle OLTP is disappointing on the M9000 with an increase in response time at peak throughput. Upcoming release of Solaris and Oracle 10g should improve this result"
Critical bug fixes
The reason why I wrote this is because I knew that Sun engineering was working hard at fixing three key performance bugs specific to database performance on any of the M-serie systems. Here is a list of this bugs that were successfully fixed in Solaris 10 08/07 (Update 4) :
1. Bug 6451741
SPARC64 VI prefetch tuning needs to be completed
Impact : L2 cache efficiency is key to database memory performance. Corrected preferch values improve memory read and write performance.
2. Bug 6486343
Mutex performance on large M-serie system need improvement
Impact : The mutex retry and backoff algorithm needed to be retuned for M-series system due to out-of-order execution and platform specific branch prediction routines. Also improve lock concurrency on hot mermory pages
3. Bug 6487440
Memory copy operations needs tuning on M-serie systems
Impact : The least important fix but important for Oracle stored procedures , triggers and constraints
The big question was : How much of an improvement it would have on OLTP performance ?
Well, one thing is sure is that your mileage may vary but I measured on my workload a whooping 1.33
lower response times for 1.38 faster throughput (compared to Solaris 10 Update 3) . It is also interesting to notice that all the other workloads tested have not moved significantly as they are not really sensitive to the issues tackled there.
Please find below the corrected comparative charts in throughput and response time after a reminder on the workloads :
Java workloads
Not exactly.So let's try to be a little bit more specific using five different 100% Java (1.6) workloads :
- iGenCPU v3 - Fractal simulation 50% Integer / 50% floating point
- iGenRAM v3 - Lotto simulation (Memory allocation and search
- iGenBATCH v2 - Oracle 10g batch using partionning, triggers,
stored procedures and sequences
- iGenOLTP v4 -(Heavy-weight OLTP
Datapoints
The values showed hare are peak results obtained by building the complete scalability curve. The response times mentioned are average, at peak and in Milliseconds.
| E25k | M9000 | |||
| Throughput | RT (ms) | Throughput | RT (ms) | |
| iGenCPU v3 | 303 fractals/second | 105 | 728 fractals/second | 44 |
| iGenRAM v3 | 2865 lottos/ms | 55 | 4881 lottos/ms | 17 |
| iGenBatch v2 | 35 TPS | 907 | 50 TPS | 626 |
| iGenOLTP v4 | 3938 TPM | 271 | 6194 TPM | 264 |
As we are trying to compare to the frequency 1.267 factor, let's look at those results by giving a factor 1 to the E25k.
First, here is throughput :
| Throughput | E25k | M9000 |
| 'iGenCPU v3 | 1 | 2.403 |
| 'iGenRAM v3 | 1 | 1.704 |
| 'iGenBATCH v2 | 1 | 1.450 |
| 'iGenOLTP v4 | 1 | 1.573 |
| Frequency | 1 | 1.267 |
Which would be this chart :

And here is the average reponse time at peak throughput (still using a base 1 for the E25k) :
| RT | E25k | M9000 |
| iGenCPU v3 | 1 | 0.419 |
| iGenRAM v3 | 1 | 0.301 |
| iGenBATCH v2 | 1 | 0.690 |
| iGenOLTP v4 | 1 | 0.970 |
And the chart :

This new numbers are illustrating how well placed are the M-serie servers to replace the current UltraSPARC-IV servers, from the smallest Sun Fire V490 to the largest Sun Fire E25k...As long as you use at least Solaris 10 08/07 .
See you next time in the wonderful world of benchmarking...
Nov 14 2007, 05:53:42 PM PST Permalink
Solaris Vista dual-boot : No problem !
Summary of the operations :
1. The laptop had already Vista installed in C: (70G) with a D: partition (70G)
2. Using the Vista Disk Partitioner (default System tool in Vista Ultimate), I removed the D: partition
3. I downloaded Solaris Nevada build 72 and burned a DVD-R
4. I went in the Setup menu of the Ferrari 5000 and allowed boot only from the DVD
5. I booted Solaris b72 and chose the option (3) Terminal
6. I partition my disk to create a single Solaris partition with :
fdisk /dev/rdsk/c0d0p0
7. Reboot and installed Solaris. Installation was about 50 minutes.
8. Booted again from the DVD . Chose option (3)
9. Modified /a/boot/grub/menu.lst by adding :
title Windows Vista
rootnoverify (hd0,1)
chainloader +1
10. Went back in the boot menu (F2) and re-enable disk booting.
11. Rebooted and verified that I could use Solaris & Vista.
12. Booted Solaris, installed SLAMD and the iGen benchmark suite
13. Ran the iGenCPU benchmark to compare the system to others. Got 27 fractals/second at 4 threads. Nice for a laptop !
Additional note : Wireless configuration is now very easy as the wificonfig tool is part of the Nevada distribution
The only thing needed is update_drv -a -i '"pciex168,1c"' ath . No reboot necessary.
Then you can do wificonfig -i ath0 plumb ; wificonfig -i ath0 scan
Final note : All the tricks that you can found in other blogs are now irrelevant as the MBR Solaris bug was bixed in build 70.
Sep 17 2007, 03:25:33 PM PDT Permalink
Sun SPARC Enterprise M9000 vs Sun Fire E25k - Datapoints
Sun SPARC Enterprise M9000 vs Sun Fire E25k - Datapoints
A performance comparison of two high-end UNIX servers using the iGen benchmark suite
[Read More]
Aug 20 2007, 05:07:47 PM PDT
Permalink
Unbreakable Oracle 10g Release 2 : What if you have ORA-600 kcratr1_lastbwr ?
This an interesting story that happened yesterday on one of our customer site. An engineer powered off the wrong rack of equipment containing a Sun Fire X4600 running Oracle 10g Release 2. Almost no transactions were performed at time so when the system came up the customer expected the database to be up and running very quickly.
In reality this is what happened :
Tue Nov 7 11:19:42 2006
ALTER DATABASE OPEN
Tue Nov 7 11:19:42 2006
Beginning crash recovery of 1 threads
parallel recovery started with 16 processes
Tue Nov 7 11:19:44 2006
Started redo scan
Tue Nov 7 11:19:44 2006
Errors in file /xxx/oracle/oracle/product/10.2.0/db_1/admin/xxx/udump/xxx_ora_947.trc:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], []
Tue Nov 7 11:19:44 2006
Aborting crash recovery due to error 600
Tue Nov 7 11:19:44 2006
Errors in file /xxx/oracle/oracle/product/10.2.0/db_1/admin/xxxtest/udump/xxxtest_ora_947.trc:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN...
Not too pretty ! Checking the ASM configuration and the IO subsystem showed nothing wrong. So what to do if you do not have a backup handy ?
Well, here is the idea .... what would we do if we had a backup that was inconsistent ?
The recover database command will start an Oracle process which will roll forward all transactions stored in the restored archived logs necessary to make the database consistent again. The recovery process must run up to a point that corresponds with the time just before the error occurred after which the log sequence must be reset to prevent any further system changes from being applied to the database.
So we tried :
startup mount
Tue Nov 7 11:54:03 2006
Starting background process ASMB
ASMB started with pid=61, OS id=1070
Starting background process RBAL
RBAL started with pid=67, OS id=1074
Tue Nov 7 11:54:13 2006
SUCCESS: diskgroup xxxTESTDATA was mounted
Tue Nov 7 11:54:17 2006
Setting recovery target incarnation to 2
Tue Nov 7 11:54:17 2006
Successful mount of redo thread 1, with mount id 2364224219
Tue Nov 7 11:54:17 2006
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue Nov 7 11:54:32 2006
recover database
Tue Nov 7 11:54:32 2006
Media Recovery Start
parallel recovery started with 16 processes
Tue Nov 7 11:54:33 2006
Recovery of Online Redo Log: Thread 1 Group 3 Seq 4 Reading mem 0
Mem# 0 errs 0: +xxxTESTDATA/xxxtest/onlinelog/group_3.263.605819131
Tue Nov 7 11:59:25 2006
Media Recovery Complete (xxxtest)
Tue Nov 7 11:59:27 2006
Completed: ALTER DATABASE RECOVER database
alter database open
alter database open
Tue Nov 7 12:03:01 2006
Beginning crash recovery of 1 threads
parallel recovery started with 16 processes
Tue Nov 7 12:03:01 2006
Started redo scan
Tue Nov 7 12:03:01 2006
Completed redo scan
273 redo blocks read, 0 data blocks need recovery
Tue Nov 7 12:03:01 2006
Started redo application at
Thread 1: logseq 4, block 12858574
Tue Nov 7 12:03:01 2006
Recovery of Online Redo Log: Thread 1 Group 3 Seq 4 Reading mem 0
Mem# 0 errs 0: +xxxTESTDATA/xxxtest/onlinelog/group_3.263.605819131
Tue Nov 7 12:03:01 2006
Completed redo application
Tue Nov 7 12:03:01 2006
Completed crash recovery at
Thread 1: logseq 4, block 12858847, scn 824040
0 data blocks read, 0 data blocks written, 273 redo blocks read
Tue Nov 7 12:03:02 2006
Thread 1 advanced to log sequence 5
Thread 1 opened at log sequence 5
Current log# 1 seq# 5 mem# 0: +xxxTESTDATA/xxxtest/onlinelog/group_1.261.605819081
Successful open of redo thread 1
Tue Nov 7 12:03:02 2006
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Tue Nov 7 12:03:02 2006
SMON: enabling cache recovery
Tue Nov 7 12:03:03 2006
Successfully onlined Undo Tablespace 1.
Tue Nov 7 12:03:03 2006
SMON: enabling tx recovery
Tue Nov 7 12:03:03 2006
Database Characterset is UTF8
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=56, OS id=1128
Tue Nov 7 12:03:05 2006
Completed: alter database open
And we are up and running ! The real thing that Oracle should work on is the quality and clarity of their error messages.
At this point this is quite poor ...
Unbreakable database, maybe. Automatic (and simple) , not yet.
Nov 08 2006, 04:44:44 PM PST Permalink
Do you need an OLTP benchmark for ANSI v2 databases ?
Do you need a new OLTP benchmark ?
A benchmark that could be lightweight as well as heavyweight ?
That could be IO intensive or CPU intensive ?
That could run on any database and any operating system ?
That you could run on your laptop but also scale up to a 144 cpus Sun Fire E25k+ ?
That could run standalone, in client/server or in a 3 tier model ?
That would produce instantly color charts and comprehensive PDF or HTML reports ?
Well, send me an email or a comment ( I am sure you are smart enough to find my email address somewhere)
if you want it. If my mailbox becomes full - I'll see what I can do....
MrBenchmark
Oct 19 2006, 01:43:17 PM PDT Permalink
A second benchmark to compare V40z, V490 and T2000 : iGenRAM v1.2
The iGenRAM v1.2 benchmark is a Java-based memory application useful to compare the memory performance of different systems. Based on the functional requirements of the California Lotto, this application is simulating a California lotto play consisting of :
Players play Lotto tickets by choosing series of six numbers. Each thread is simulating 6 million tickets played. To store tickets, memory is allocated in Java in the form of multi-dimensional integer arrays.
The system generates a list of six winning numbers.
The system has to determine which ticket won and for what amount. For this, it needs to browse all tickets and compare to the winning combination.
Total duration of this tasks will produce the throughput in Lotto computed per seconds or iGenRAM_Thp and the iGenRAM_RT average response time.
Systems with low memory latency and scalable memory interconnect will succeed. We expect good things from the Sun Fire T2000.

Next up : iGenRAM v1.2 results on V490, V40z and T2000
Jan 02 2006, 02:28:01 PM PST Permalink
What is SWaP and iGenCPU SWaP values
- Performance: Using industry-standard
benchmarks or your own benchmark !
- Space: Measuring the height of the server in rack units (RUs).
- Power: Determining the watts consumed by the system, using data from actual benchmark runs or vendor site planning guides
The SWaP metric is calculated this way :

I recently provided iGenCPU v2.1 benchmark results for various platforms (see previous entries). I did start with this benchmark as it is the absolute worst case for the T2000. Let see how it translates into SWaP numbers :

Or as expressed in a chart :

So what is the message here ?
If you are running floating point intensive applications (the immense majority of commercial applications are not), and you need a small form factor, the AMD equipped V20Z/V40z or better the Galaxy line ( Sun Fire™ X4100 and Sun Fire™ X4200 ) are the right answer.
I hope I convinced you here that the SWaP metric was not designed to make the UltraSPARC-T1 a winner every time but to provide a critical metric for modern datacenters !
In 2006, we will look at a second microbenchmark called iGenRAM , and then explore LDAP performance (iGenLDAP), database performance (iGenOLTP) and even web performance (iGenWEB).
I am leaving for Tahoe, so see you on the Heavenly slopes or next year on this forum...
Dec 16 2005, 11:54:42 AM PST Permalink
MrBenchmark iGen benchmarks : A clarification
Thank you for your comments on my previous post.
Walter's guidelines are good and applicable to all standard benchmarks. Please note that my benchmarks are not standards.. all results are provided for your information and without any type of performance guarantee....
If you are searching for standard benchmarks results , do not hesitate to go there (BM seer blog)
Also, apple-to-apple comparisons is only a dream from my prospective. Why ? Because all of this systems (and processors) are different from the bottom-up...
I do see my comparison data as a way to provide an approximate idea of how systems rank versus each other fon a specific workload. No benchmark is universal...so you can never say System X is better/faster than System Y without being specific on the application tested.
Next up : SWaP values for iGenCPU v2.1
Dec 15 2005, 05:23:58 PM PST Permalink
iGenCPU results on UltraSPARC T1 (T2000) and UltraSPARC IV+ (V490+)
This is an update on my previous entry, adding the UltraSPARC T1 CoolThread server T2000 and the V490+.The throughput increase between the UltraSPARC IV+ and the UltraSPARC IV is fairly proportional to the clock frequency.No surprise here. The simplicity of this microbenchmark does not allow taking benfits of many of the UltraSPARC IV+ innovations.
Regarding the T2000, please note that I am not afraid to publish the results. This is not a marketing blog...Indeed, this benchmark is not recommended for the UltraSPARC T1 due to the fact that 25% of the instructions are floating point operations (see two previous blog entries).
It does not prevent us to collect the data. Please remember that we want our customers to run the right platform for their workload. So, if your workload is generating more than 2% of floating point operations, the UltraSPARC T1 is probably not what you should choose....
Note : The column threads is the number of threads used to observe the highest throughput with a response time (RT) less than 100 ms.
| Processor | Frequency | # CPU | # Cores | Ram | OS | Threads | Fractals/s | RT (ms) |
| Intel XEON | 3 Ghz | 2 | 4 (HT) | 4 GB | S10 03/05 | 3 | 31.16 | 96.26 |
| Sun UltraSPARCIIIi | 1.2Ghz | 4 | 4 | 8 GB | S10 03/05 | 4 | 53.01 | 75.4 |
| AMD OPTERON | 2.4Ghz | 4 | 4 | 8 GB | S10 HW2 | 5 | 90.18 | 55.4 |
| Sun UltraSPARC IV | 1.2Ghz | 4 | 8 | 8 GB | S10 03/05 | 8 | 98.88 | 80.8 |
| Sun UltraSPARC IV+ |
1.5Ghx |
4 |
8 |
8 GB |
S10 HW1 |
8 |
123.48 |
78.36 |
| Sun UltraSPARC T1 |
1.2Ghz |
1 |
8 |
8 GB |
S10 HW2 |
9 |
18.62 |
93.08 |

Let me know your thoughts....
Next, I will publish a description of the iGenRAM 1.6 and related benchmark results. We will see that the T1 is pretty good at this....
Dec 14 2005, 02:24:01 PM PST Permalink
iGenCPU 2.1 results - V40z, V65x, V490 and V440
As promised here are our first iGenCPU v2.1 benchmark results. See previous blog entry for the benchmark description.
As reported by my pfp tool , this benchmark is producing about 25% of floating operations and 75% others...
Therefore, absolutely not recommended for a UltraSPARC T1-based T1000 and T2000...
This table will show us the performance obtained on this benchmark for four popular Sun Microsystems servers :
The V40z single core , the V65x, the V440 and the V490 all using Solaris 10 . Please note that this servers may be available
today at a higher frequency.
Note : The column threads is the number of threads used to observe the highest throughput with a response time (RT) less than 100 ms.
| Server | Processor | Frequency | # CPU | # Cores | Ram | OS | Threads | Fractals/s | RT (ms) |
| V65x | Intel XEON | 3 Ghz | 2 | 4 (HT) | 4 GB | S10 03/05 | 3 | 31.16 | 96.26 |
| V440 | Sun UltraSPARCIIIi | 1.2Ghz | 4 | 4 | 8 GB | S10 03/05 | 4 | 53.01 | 75.4 |
| V40z | AMD OPTERON | 2.4Ghz | 4 | 4 | 8 GB | S10 HW2 | 5 | 90.18 | 55.4 |
| V490 | Sun UltraSPARC IV | 1.2Ghz | 4 | 8 | 8 GB | S10 03/05 | 8 | 98.88 | 80.8 |
.

Please use the comments section for your observations, I am sure you will have plenty...
Next, I will publish iGenCPU 2.1 results for UltraSPARC T1 and UltraSPARC IV+ and provide my observations....
Dec 12 2005, 05:26:43 PM PST Permalink
MrBenchmark benchmarks : Opteron vs UltraSPARC IV vs UltraSPARC T1
Whiners came to me saying : "MrBenchmark : Enough theory, please give us some benchmark results.."
And I said, fine...so here we are ..I will publish some informal benchmark results in this forum. And Yes I will compare
UltraSPARC IV, IV+, Opteron , UltraSPARC T1 and even Xeon !
Let me present the first microbenchmark of my serie . It is called iGenCPU. It is written in 100% pure Java. I am using Java 1.5
The iGenCPU benchmark is a JavaTM-based CPU micro-benchmark used to compare the CPU performance of different systems.
Based on a customized Java complex number library, the code is computing Benoit Mandelbrot's highly dense fractal structure using
integer and floating-point calculations. The simplicity of the code as well as its non-recursivity allow a very scalable behavior using
less than 64 Mb of memory per thread.
iGenCPU reports multiple statistics. We are mostly interested in analyzing iGenCPU_Thp (how many fractals per second can we
compute with this number of threads ?) and iGenCPU_RT (what is the average time needed to compute a complete fractal with this number of threads ?)
IgenCPU use the system this way as represented by my iTarget chart :

Next to come : our first iGenCPU benchmark result (table & diagram ): V40z (4xAMD Opteron @2.4Ghz with 8GB RAM ) vs V490 (4xUltraSPARC IV 1.2Ghz 8GB RAM)
Dec 09 2005, 04:25:28 PM PST Permalink
How to demonstrate the value of the CoolThread UltraSPARC T1 servers (T1000 - T2000) to your boss ?
Well, after a very long entry presenting my pfp tool , here is a very short one...
To demonstrate the value of the CoolThread UltraSPARC T1 servers (T1000 - T2000) to your boss
there is only one thing to do : make her/him benchmark it using Sun Sim Datacenter
(yes ! your boss is gonna run a benchmark and she/he will like it ! )
How to do it and simulate all your Datacenter with UltraSPARC T1000 or T2000 ?
Very simple, download Sim Datacenter here ,
and run it on Solaris 9 or 10 !
What, you don't have Solaris 10 on your laptop ?
Get it right now on this Solaris page ...
Easy,no ?
Dec 08 2005, 05:04:37 PM PST
Permalink
Is my workload recommended for a CoolThread UltraSPARC T1 server ( T1000 - T2000 ) ?
Since the pre-release and announcement of UltraSPARC T1 systems (T1000 - T2000),
our customers coming in the Sun Solution Benchmark Center have been very interested to know if their
application will work well on UltraSPARC T1. While assessing the multi-threaded nature of a
workload is easy using standard system tools, it is less straightforward to obtain at will
the amount and proportion of floating points instructions executed by a system. Some complex
tools exist but we would like to have a simple go/no-go binary that would answer
only this question. (If you are interested in a more detailed analysis of a cpu behavior, please
ask me about a great tool called ripc )
The key information coming from our UltraSPARC T1 engineers is the choice they had to make (because
of space limitations) to have a single floating point unit shared by the 8 cores (and 32 strands).
Please note that this challenge has been solved on the next release of this processor.
They tell us that in there best estimation any workload doing more than 2% of the total amount of instructions
using floating-points will not be recommended for UltraSPARC T1. Between 1% and 2% is the gray area where
they recommend us to try because a number of the simpler FPU commands were moved to the
core and dont incur a 40 cycles penalty.
The idea of this article is to explain how to get this information and provide a simple tool
(for all UltraSPARC based systems).
The UltraSPARC III (or UltraSPARC IV core) has a maximum of four instructions that can
be fetched from cache in a clock cycle and a total of sixteen fetched instructions that
can wait for an execution unit to become available. Six parallel execution units exist on
the chip : one load/store unit, one branch unit, two identical integer Arithmetic Logical
Units, one add (and therefore substract) floating point unit named FA_PIPE (see FP 1
on the schema below and one multiply(and therefore divide) floating point unit named FM_PIPE.
(see FP 2 below).

For the UltraSparc III (and IV or IV+), multiple performance
instrumentation counters are provided to analyze the CPU performance
behavior under load but for our purpose we need to consider only three of them :
1-The total number of instructions completed not counting annulled, mispredicted or
trapped instructions. This is the Instr_cnt counter
2-The total number of instructions completed on the FA_PIPE. This is the FA_pipe_completion
counter.
3-The total number of instructions completed on the FM_PIPE. This is the FM_pipe_completion
counter.
Note that the counters 2 and 3 are also incremented for some type of VIS instructions. Therefore,
they have to be considered only as estimations.
For the UltraSPARC T1 based systems, it is simpler as the single counter FP_instr_cnt is directly provided.
As you already deducted, we will be able to determine the percentage of floationg point
operations with the formula :
%FP_ops = 100 * (FA_pipe_completion + FM_pipe_completion) / Instr_cnt
We are also able to provide this simple heuristic :
if ( %FP_ops < 1%) -> Recommended for UltraSPARC T1
else if (%FP_ops between 1% to 2%) -> Possible fit for UltraSPARC T1
else -> Not recommended for UltraSPARC T1
To do this, here is a program named pfp that you can use as pfp <duration in seconds>
If you are on a T1000 or T2000 system, please use the flag -n as this program does not detect the cpu
type in its first release.Please remember to run your workload first and while it is running,
use this program as shown below.
paris # ./pfp 30
We observed 22756679 instructions separated in 0.20% floating point and 99.80% others
This workload is recommended for UltraSPARC T1 systems.
ontario # ./pfp -n 30
We observed 342593950 instructions separated in 0.77% floating point and 99.33% others
This workload is recommended for UltraSPARC T1 systems.
If you just want the percentage of floating point instructions, you can also do
paris # ./pfp -s 30
0.20
Finally, you can also use the tool on Solaris 8 or Solaris 9 with :
Dtrace # ./pfp -ps 30
1.97
The binary of this tool can be found here.
Dec 07 2005, 05:02:21 PM PST
Permalink
Solaris Performance Analysis Methodology (The APM) - part 1b
As promised in my previous post, I am continuing the exposé of the way I analyze
customer workloads. Please note that the way you will approach this mission can be :
- Personal - To be successful and have an happy customer (or consumer using a DTrace analogy), you may have to follow your own logic.
- Traditional - The
traditional approach is to start with a general tool (vmstat, prstat or
sar) and drill down into interesting areas.
- Based on previous analysis - You don't need to understand the whole performance picture and can focus on a specific issue.
The APM for Solaris
Part 1 (continued) - Control what's running
Based on my previous post, you now have a better idea of what's going on in the different
layers of your architecture. But, what about the quality of this applications ? The most
common reaction that I get on this question is : "We can't do anything about it". This may
be true for commercially available application like a database or a web server. In fact, I found out that for a majority of our customer, we can improve performance by doing one
very simple thing : Recompile with a modern compiler and adequate flags.
(for Java, read this as : Use the latest supported JVM with adequate flags).
Please remember that Sun Studio 11 is now FREE. You have no excuse not to use it. A
better question is how you can find out what compiler was used for a specific binary.
An old Unix System V tool named elfdump is hidden under /usr/ccs/bin with other tools like prof or lex. (usually it is not in your path - just add it ). Note that elfdump -C can demangle C++ names. For example :
elfdump -c /iGen/iGen_all |grep "SUNWspro"
Convincing your customer to re-compile is sometimes difficult but will most of the time
yield performance gains. Fundamentaly, the quality of the instructions executed
by the processors is key. Using the -fast flag (it will be automatically expanded to a set of
platform dependant flags) is a good place to start.
Well, that's it for today.
See you next time in the wonderful world of benchmarking.
Nov 17 2005, 09:47:00 AM PST Permalink
Welcome to BM Seer and Solaris Performance Methodology
Good morning all;
As it is lightly snowing this morning in St Charles (outside of Chicago),
let me give a warm welcome to a new blogger : BM Seer .
You will find there all the latest news on Sun benchmark results.
Now, I will start today a new serie on my performance analysis methodology.
I called this methodology the ASTROLABE. ( If you do not know what an
astrolabe is, check out this web site)
The intent of this approach is to be SIMPLE and POWERFUL.
This methodology has seven sections and I will expose today the first one .
The ASTROLABE performance analysis methodology.
Section one : Control what's running
You may be surprised but most of the customers I am seeing in the benchmark center
do not really control what's running in the environments.
Top three questions to answer :
1- What are the applications running in this environment
and their main characteristics ?
2- What are the main data streams and what transport mechanism
are used (tcp, udp ...) ?
3- Most importantly, what is running that we do not know about ?
A way to detect this is to run this simple DTrace audit script :
(Note that this script is Solaris 10 zone aware... )
-----
dtrace -n 'proc:::exec{printf("%s execing %s, , uid/zone = %d/%s\n",execname,args[0],uid,zonename)}'
-----
The main issue that this script will uncover is Runaway shell scripts.
They may use a very valuable chunk of your system resources.
Also, short-lived applications can be uncovered this way
A hint : if the total of the cpu reported by prstat is inferior
to the total cpu usage reported by vmstat, you should worry about this two issues.
In more than 80% of the customer workload we analyzed, performance benefits are
achieved by tuning the software stack and the customer applications
in particular, not by tuning Solaris.
See you soon in the wonderful world of benchmarking....
Nov 16 2005, 08:32:00 AM PST
Permalink
Feel free to send me your questions by email to benoit@sun.com
Part 3
Part 4
Part 5
Part 6
Part 7
See you soon in the wonderful world of pop ...
Nov 03 2005, 11:04:54 AM PST Permalink
DTrace deep dive in Southern California
Feel free to send me your questions by email to benoit@sun.com
Please find below the two DTrace presentations.
Dtrace concepts Dtrace scenarios
Let me know if our discussion was useful.
See you soon in the wonderful world of Solaris 10 ...
Oct 18 2005, 06:48:38 PM PDT Permalink
Solaris 10 deep dive in Santa Clara,CA
For all the attendees, thank you so much for coming and stay with us all day. Presenting an operating system is
not an easy endeavor and your patience and outstanding remarks were very much appreciated !
My presentation is available here in pdf format. Enjoy !
Also, send me an email at benoit@sun.com if you have any questions that could not be answered today.
Please also check Bob Netherton and Linda Kateley weblogs for their latest update and presentations.
Sep 08 2005, 08:54:03 PM PDT Permalink
DCSS in Vegas feeedback 2
More feedback from the DCSS in Vegas where I presented this morning on Solaris 10 performance
covering various topics to help tuning your environment. Such as the impact of FireEngine,
ptools, libumem, mpss and so on. Great crowd of 160 partners very much motivated to get the
very best from the best OS...
Now, the highlight of the day was the one-man-show of Brian Wilson, one of our distinguished
engineer that make Sun an exciting workplace.
If you need to close a deal and the customer is questioning SUN's strategy, bring him in and the
PO will follow ...
You know the food pyramid? Here is the BW pyramid composed of all the components of the IT infra-
-structure in the proper order. From Network to business processes and ROI.
Great stuff. Need more on this ? Let us know....
Aug 17 2005, 08:16:10 PM PDT
Permalink
DCSS in Vegas - Feedback 1
Hi all;
Attending and presenting @ Sun Microsystems Data Center Summit in Vegas this week.
Great to see all this familiar faces...
This morning was marked by a bright keynote presentation by Rich Napolitano. Finally an executive that got the point : products and technology are second on the list of success. A big first is about sales force discipline, sales tactics and a reminder of the ONE SUN attitude.
Thanks Rich for your sane back to basics reminder !
benoit
Aug 16 2005, 12:03:41 PM PDT
Permalink
OpenSolaris - This is the day !

This is the day !

And to say it again :

This is the day !
Details at :
http://www.opensolaris.org/os/about/
Yours,
MrBenchmark
Jun 14 2005, 03:34:52 PM PDT Permalink
StarOffice8 beta was good with us as it did not freeze like when I was reviewing the slides the day before.Just a couple of complaints on the french accent. Sorry about that friends. I never learned english at school (but german and italian).
See a great feedback on the bootcamp at http://www.offramp.org/~jss/afaik/
I will probably publish the DTrace performance scenarios on this line starting next week after my 3 days trip to Oregon to meet key customers there...
May 19 2005, 08:21:08 PM PDT Permalink
Oracle 10g on Solaris 10
Hidden parameters to optimize Oracle 10g on Solaris 10
Hidden parameters to optimize Oracle 10g on Solaris 10
You may have recently installed Oracle 10g on Solaris 10 and wander
into the wonderful world of Oracle hidden parameters. Every time Oracle
is producing a new vintage of the unbreakable database we get a bulkload
of new mysterious parameters. For the Oracle DBA eye, some of them have
a very explicit name (_lgwr_async_io). Some of them have names directly
extracted from a martian dictionary (see _kghdsidx_count).
Now, of course, your noble intent is to do tuning, not debugging. What about
if you obtain a very sexy
"ORA-03113: end-of-file on communication channel"
on your first 1000 users attempt ?
Well, looking into the Oracle Net Dispatcher log, you will see an helpful :
"NS Primary Error: TNS-12535: TNS:operation timed out
NS Secondary Error: TNS-12606: TNS: Application timeout occurred"
And you call Oracle and they will tell us : This is a bug, Sir. Please go
in sqlnet.ora and do not specify the SQLNET.INBOUND_CONNECT_TIMEOUT parameter.
One problem fixed....the only fixed by a documented feature that you should
not use....great start.
Starting the workload again and now you observe some FULL TABLE SCAN. Oops...
I know how to fix this one and here are the "create index" statements.
Unfortunately, the unbreakable database send you a very rude
ORA-00600 [kcbgtcr_5], or ORA-00600 [kcbgcur_3] error message.
Good thing this young lady from oracle had the coolest voice in the world
so it not a problem to call again. And a certain John answers the phone...
Excuse me, may I speak with Virgina ?...ok, I'll wait.
Yes, this is a bug again (3392439) and to fix it , just type :
"ALTER SYSTEM FLUSH BUFFER_CACHE" . Interesting... or you can put this in
your pfile "_db_cache_pre_warm=false" . Oracle is easy.
(By the way, some more 600 errors can occur on Oracle 10g for Solaris x86
and the previous parameters do not fix them. You will need the very
entertaining "_enable_NUMA_optimization = FALSE" to keep going...)
Here we are... my 1000 users are running.
Looking at Statspack and system statistics, I notice a lot of pressure on
the shared pool and latch contention.
First, I made sure I was using ISM with " _use_ism_for_pga = true" Yep...
Then, I discovered that we can now segment the shared pool into multiple separate
zones, each protected by bound latches. How to do this ?
Just say " _kghdsidx_count = 4" and you will get four of those. The maximum
is apparently seven. No idea why....And I can not find this martian dictionary.
And running again.... but oracle is still singing the latch contention hymn.
Could I have a high level of contention on certain blocks ?
To find the culprit, I queried V$LATCH_CHILDREN for the address and joined it
to V$BH to identify the blocks protected by this latch (doing so will show all
blocks that are affected by the warm block).
Two way to fix this :
- If this is on an index (use DBA_EXTENTS to find out this common case) ,
use a reverse-key index.
- If not, set _db_block_hash_buckets to the prime number just larger than twice
the number of buffers.
Do not forget you must have one LRU latch minimum for each database writer.
You can increase them with a very elegant "_db_block_lru_latches= xx"
Just tell me why this is undocumented as it appears absolute best practice ?
And here I am, running again. Now that I fixed the latch issue, the contention
has moved to the log writer. No surprise.
A new feature of Oracle 10g is log parallelism that you can obtain with :
_lgwr_async_io=false
_log_parallelism_dynamic=true
and the tuning of _log_parallelism_max
Looking further into this, it does not provide full parallelism.
And because this is not a 24x7 production system, looks like you can also
do a really,really exciting :
_log_private_parallelism=true
(Common sense could have been _log_parallelism_private=true but this
Oracle engineers like poetry too...)
Oracle did not crash (unbreakable,right ) and I am running as fast as ever.
I realized later that I really did not need to update v$pga_advice all the time
(_smm_advice_enabled=false) or enable auto tuning of undo_retention
(_undo_autotune=false) as I really need this CPU cycles for my transactions
and not for the Oracle kernel.
Finally, here I am using the 21st century software jewel, DTrace
And realize that I am not using malloc() anymore but mmap(). Great !
But can I tune the mmap byte preallocation....oh,yes. Here is our final
undocumented pearl : _realfree_heap_pagesize_hint . Only 28 letters, what
do you think ?
Unbreakable, yes ! Simple, not yet ....
May 10 2005, 09:07:44 AM PDT Permalink
out-cache results - Random IO – 8 kbytes – 2x32 Gbytes
SE3510 - Scalability analysis
This
analysis is showing us the impact of concurrency on IOPS performance
as well as scalability differences between RAID-1 and RAID-5. Please
find below the SE3510 results and charts on test R1 to R5 :
Observations : RAID-5 is in average 9% slower in read-only up to 33% in 50% read and 61% in a write-only situation. Scalability is good in all cases. The RAID-5 vs RAID-1 difference is stable in percentage for every IO pattern tested however the IOPS difference is proportional to the concurrency. For example, at 50% read , a difference of 2123 IOPS is observed between RAID-1 and RAID-5 at 64 threads, however this difference is only 1024 IOPS at 16 threads. If IO is one of the critical component of the architecture performance, you may already change the end-user experience if you are choosing RAID-5 vs RAID-1.A performance note : RAID-1 andom write raw performance at almost 10000 IOPS is outstanding.
Please remember that the intent here is not to compare different IO subsystems, but really to understand how the different RAID algorithm compare.
We already noticed that this is STORAGE DEPENDANT. In fact, if the RAID level does not affect the SE6120 performance in-cache, it does cause RAID-5 to be 20% slower on the SE3510.
What about the SE9980 ?
Observations : When you can fit in the cache, RAID-1 and RAID-5 are delivering similar level of performance . However, the central cache architecture cause more variability on the SE9980 compare to the other IO subsystems.
Next on this blog, out-cache results.
When we do not fit in-cache anymore, does the behavior change ?
Apr 18 2005, 08:47:28 PM PDT Permalink