While I realise that much has been written and blogged on the subject (a favourite can be found here), I want to add my own voice to the choir regarding context when analyzing data, and performance statistics in particular.
When first starting to investigate a performance degradation, the questions we ask rarely cover any kind of monitoring or statistical data until we have framed the circumstances or details of a problem, such as:
- How much slower than normal/expected is it?
- How often is it seen?
- How is this seen or experienced by end users?
- When did it start?
- What changes have been made to hardware, software or workload?
- Etc, etc.
- "System is slow. Here is some data. Please analyze".
- "sar/mpstat/vmstat report high %sys utilization".
- "System is low on memory" / "Monitoring software sent an alert".

Once Google buys Sun, you guys should get access to their crash data reporting software and testing tools. That will make your lives a lot easier.
Posted by Arrack Osama on November 10, 2008 at 01:56 PM GMT #