Solaris, Performance, Stuff. Scott MacDonald

Monday Nov 10, 2008

While I realise that much has been written and blogged on the subject (a favourite can be found here), I want to add my own voice to the choir regarding context when analyzing data, and performance statistics in particular. When first starting to investigate a performance degradation, the questions we ask rarely cover any kind of monitoring or statistical data until we have framed the circumstances or details of a problem, such as:
  • How much slower than normal/expected is it?
  • How often is it seen?
  • How is this seen or experienced by end users?
  • When did it start?
  • What changes have been made to hardware, software or workload?
  • Etc, etc.
These types of questions should be the very first things collected before trying to interpret any data. A modern OS such as Solaris is a very complex dynamic system with many inter-related components. While it can be possible to make certain assumptions about some data (such as a high average disk service time for example), generally speaking statistics gathered without any context are just numbers. Here are a few examples of the sort of thing we would like to avoid:
  • "System is slow. Here is some data. Please analyze".
  • "sar/mpstat/vmstat report high %sys utilization".
  • "System is low on memory" / "Monitoring software sent an alert".
An understanding of OS and Kernel internals is required to properly investigate a performance issue and analyze statistical data. We do not expect most people to possess this knowledge as it is very specialised, but we do expect that fundamental problem definition details are provided such as those outlined at the beginning of this article.
Comments:

Once Google buys Sun, you guys should get access to their crash data reporting software and testing tools. That will make your lives a lot easier.

Posted by Arrack Osama on November 10, 2008 at 01:56 PM GMT #

Post a Comment:
  • HTML Syntax: NOT allowed