Solaris, Performance, Stuff. Scott MacDonald

Friday Mar 06, 2009

Okay so it's time for a rant. One of the common performance scenarios we face often relates to comparative performance, i.e. this system used to run at X/sec, now it runs at Y/sec, or alternatively, host A is processing N ops/sec where host B is at M ops/sec, but they are identical systems.

Oh really?

If there is a difference in performance, then they are *NOT THE SAME*. Period.

This is just logic, folks. You might not know where the differences are, but they are still there.

My point here, is that I often have to deal with people telling me that no changes have been made to a system, or that 2 hosts are identical, yet they are seeing a performance degradation. By denying a change or difference (and it's common for people to refuse or avoid investigating in this direction) you are specifically avoiding the area where the problem must lie!

Now, such a difference could be a very subtle thing, such as load pattern, number of users, timings, etc, rather than an obvious patch/upgrade/whatever. There is often resistance investigating the actual workload characteristics because most people simply don't know, and it can be a lot of work to find out if the end user has no handle on what their systems are doing. They want the magic bullet that will just make things go faster, but without an understanding of the workload you're guessing at best. Starting off an investigation from the right mindset makes all the difference, and as I've said already on this blog - context is critical, otherwise performance statistics are just numbers.

I feel better now.

Monday Nov 10, 2008

While I realise that much has been written and blogged on the subject (a favourite can be found here), I want to add my own voice to the choir regarding context when analyzing data, and performance statistics in particular. When first starting to investigate a performance degradation, the questions we ask rarely cover any kind of monitoring or statistical data until we have framed the circumstances or details of a problem, such as:
  • How much slower than normal/expected is it?
  • How often is it seen?
  • How is this seen or experienced by end users?
  • When did it start?
  • What changes have been made to hardware, software or workload?
  • Etc, etc.
These types of questions should be the very first things collected before trying to interpret any data. A modern OS such as Solaris is a very complex dynamic system with many inter-related components. While it can be possible to make certain assumptions about some data (such as a high average disk service time for example), generally speaking statistics gathered without any context are just numbers. Here are a few examples of the sort of thing we would like to avoid:
  • "System is slow. Here is some data. Please analyze".
  • "sar/mpstat/vmstat report high %sys utilization".
  • "System is low on memory" / "Monitoring software sent an alert".
An understanding of OS and Kernel internals is required to properly investigate a performance issue and analyze statistical data. We do not expect most people to possess this knowledge as it is very specialised, but we do expect that fundamental problem definition details are provided such as those outlined at the beginning of this article.

Friday Nov 07, 2008

I thought I would share some details from a performance escalation I had a little while back, as it has since done the rounds on a few internal mail aliases and I think illustrates some useful points about the coolthreads servers (specifically the T2000, but still relevant to the others). This case revolves around comparitive performance between a Sun T2000 server and the customer's existing V490 test system, but I have handled and assisted on several other similar cases involving comparisons with both Sun and non-Sun servers (such as Xeon based boxes).

The problem:

The customer was preparing to deploy a large in-house developed Java application onto a set of 12 x T2000 servers running BEA WebLogic and were starting to run some load tests.

What they found from their testing, was that they were seeing a total transaction time of around 1.5s on the T2000 compared to 0.5s on their initial testbed V490 server. As components were added into the application layers, this transaction time then went up to an average of 6.6s - which was considered as a show-stopper by the business.

Some initial analysis and discussion with the application developers revealed that the testing was being done with a single-threaded load, which was not representative of the end solution, but they believed that if the server could not cope with a sequential load then scaling it up would make things worse - which initially seems like a very reasonable conclusion (and was pushing the customer into a distinct panic mode).

After some further discussions, the customer agreed to run some tests for me to prove the point I was trying to make about these servers being designed for parallel scalability rather than single-threaded horsepower. Their developers quickly put together some simple code that would run on their same software stack and would create 100 million Java objects, using either 1, 10 or 100 threads to handle the work. Here are the timing results from performing this test on the older V490 platform and a newer T2000 server.

Task Description                              V490 Time    T2000 Time
Create 100 million objects sequentially         400s          663s
Create 10 million objects/thread with 10 threads       406s           87s
Create 1 million objects/thread with 100 threads       404s           41s

as you can see from these simple results, a single-threaded / sequential load will indeed perform somewhat slower on a T2000 server compared to a similar non-coolthreads system. For a single thread (in this test) the T2000 was 1.6x slower, but for 100 threads (and the same overall amount of work) the T2000 was 10x faster!

We often get calls from customers or partners that they are seeing slower performance from a particular program on a coolthreads (T1/T2 equipped) server, especially when compared to something like a competitor Xeon-based solution. For some limited single-threaded applications this is indeed the case, but try running a hundred (or a thousand!) copies of it at the same time and see what happens...the Xeon server will top out after the first few, but the T1/T2 just keeps on going.

Wednesday Apr 30, 2008

Welcome to my shiny new blog!

Having been nudged several times recently to start blogging my experiences supporting Solaris and specifically diagnosing performance problems, I finally got around to setting this up. I'm also keen to hear from others working in this area, so if you have nuggets of wisdom please do get in touch.

I am a Lead Support Engineer working in the UK Solution Centre. I spend my time fixing Solaris issues (and sometimes, under duress, the occasional hardware issue), but I have a particular interest in the diagnosis of performance problems.

I intend to use this blog to share some of my experiences, methods, and tools in the vague hope that something might be useful to someone. Either that or get flamed badly by my colleagues ;-)

Thanks,

Scott.