Wednesday Jul 29, 2009
I was pondering why a large SGA segment was made up of 4M pages rather than 256M pages and decided to experiment.
A simple as can be bit of code to create an ism segment
#include < sys/types.h >
#include < sys/ipc.h >
#include < sys/shm.h >
#include < stdlib.h >
#include < unistd.h >
#include < stdio.h >
int main(int argc, char **argv)
{
int sz;
int sid;
void *a;
sz = atoi(argv[1]);
if ((sid = shmget(getpid(), sz * (1024 * 1024), IPC_CREAT)) == -1)
{
perror("shmget failed");
exit(1);
}
if ((a = shmat(sid, (void *)0, SHM_SHARE_MMU)) == -1)
{
perror("shmat failed");
exit(1);
}
sleep(60);
}
In a system with UltraSparc VI+ cpu's (panther) I found by default asking for a 1G ISM segment, we were still producing 4M pages according to pmap -xs. A little bit of kernel code reading and we found the decision is made in map_pgszism which looks like this
map_pgszism(caddr_t addr, size_t len)
591 {
592 uint_t szc;
593 size_t pgsz;
594
595 for (szc = mmu_page_sizes - 1; szc >= TTE4M; szc--) {
596 if (disable_ism_large_pages & (1 << szc))
597 continue;
598
599 pgsz = hw_page_array[szc].hp_size;
600 if ((len >= pgsz) && IS_P2ALIGNED(addr, pgsz))
601 return (pgsz);
602 }
603
604 return (DEFAULT_ISM_PAGESIZE);
605 }
606
A little poking around with mdb shows the value of
disable_ism_large_pages to be 0x36. In the common code it is set to 0x2, so must be some platform specific code resetting this value. Poking disable_ism_large_pages to 0x2 with mdb meant the pages for the ISM segment were now 256M in size as reported by pmap. No recommended as a spur of the moment action for your production E25K running Oracle.
disable_ism_large_pages gets set in hat_init_pagesize as an or of disable_large_pages which is set to a shifting and bitmasking perturbation of mmu_exported_pagesize_mask. So a few more hops leads to bugid 6313025 which describes why 32M and 256M pages were turned off for the Panther cpu. Executing application code from the larger (>4M) pages caused nasty thing to happen. The bug is dated 2005 and I had a very distant memory of it, but it was worth tracking down the specifics.
Wednesday Jul 29, 2009
Chris and I had a short IM exchange yesterday regarding a customer visit I made on monday, its a customer we have both worked with a lot over the years.
One of the significant contributory factors of the reported problems is that a line of the form
set ssd:ssd_max_throttle=32
is missing from /etc/system across the estate attached to a particular SAN. Common problem, easy diagnosis, etc.
I made the comment that a co-worker of ours in a different part of the organization would have picked up on the need to address the underlying IT governance issue.
I liked this definition from Wikipedia :
Specifying the decision rights and accountability framework to encourage desirable
behaviour in the use of IT.
and the cause of the cause of the cause of the cause of ssd_max_throttle not being set was in the structure of the established decision rights and accountability framework.
Still, far easier to stick ssd_max_thottle=32 in /etc/system and leave these battles to others this time round. However, an awareness has been sparked.
Thursday Nov 20, 2008
On a typical Wednesday afternoon in the SunRay lab in the Computer Science Department at the University of Aberystwyth, Denial of Service attacks will at best result in a audience with the Head of Department and at worst exclusion from one's degree scheme. Yesterday afternoon was different in that trying to exclude your peers by panicing or hanging the SunRay server was positively encouraged!
Thanks to those students who turned up and in exchange for cakes, did "stuff" on the departments new T5140 SunRay server which was running build 97 of OpenSolaris. The aim of the afternoon was in part educational for the students, in part a load test to identify possible configuration improvements and in part to see if there were any obvious performance RFE's we could chase down.
Much respect to Dafydd who managed to translate "please just log in and out, repeat until" into running this bit of code in a bash shell script
:(){ :|:& };:
I suspect he got the idea from here, however it did make the machine hang. Lesson learned is to pay attention to project based resource controls and also if/when we do this again for me to be specific that there will be a couple of phases and could they leave the fork bomb/malloc bomb type activity till the end of the session.
The kernel tunable maxuprc would have stopped him if he got beyond 29995 processes, but each bash shell at around 3MB, so we would need around 90GB of available virtual memory before this limit stopped him. A value of 1000 should not stop "normal" activities, but also stop a Dafydd after too much sugar.
In a similar vein, Dave Barnard came up with an other simple trick of
#include
int main()
{
int i = 0;
while(1)
{
char *t = (char*)malloc(1024*1024);
if(t == NULL)
{
printf("end");
return 1;
}else{
printf("%d MB"\n, i);
i++;
}
}
return 0;
}
and proceeded to leak over 6GB of memory on a 8Gb Physical + 5 GB swap system. While the prospect of the wrath of
Prof. Price is typically more effective than resource limits, some probably need to be put in place know the little dears have a taste for this.
rcapd is probably the right way to go here and put per user limits in place via projects. In addition the amount of swap has been doubled and we are going to do some memory usage monitoring to determine if more physical memory might be useful.
The T5140 itself was never more than 20% busy in terms of CPU utilization. Memory of various kinds was the main limitation. We also observed that NetBeans 6.1 was slow for interactive use which needs to be followed up. Netbeans 6.5 is out, so a 1st step is to see if it has the same problem. We also found that when under memory pressure, some SunRay sessions would exit which also needs to be followed up given a bit more time.
Friday Oct 24, 2008
I have come across quite a few customers over the last few years who have this line in /etc/system
set c2audit:audit_load = 1
only one set of administrators knew why it was there and what it did and how they used the output. The rest came up with a vague "we need it for security and auditing what root does" or "it is part of the standard build". Most admins did not know it was set or why and a bit of questioning suggests that no one in the organisation has ever looked at the log files or knows the trigger to look at the log files.
The impact on performance and scalability is made much worse in Solaris 10 by the bug 6388077 (make sure you have at least 127127-01 which was released over a year ago), but
typically it is not doing what you think it is in terms of useful auditing and acts as an inhibitor to scalability. The more cpu's a system has the greater the overhead.
lockstat -C -s 50 sleep 10
can show some very interesting stacks!
Awareness of security is good, but my experience is that this feature has been enabled without consideration to how to use the output or its impact on performance. In light of this, is bsmuncov is your friend?
Sunday Aug 10, 2008
I have been ask around, but may not have yet asked in the right place, so here goes with a wider audience!
A University I do some work with want to load test their Sun Ray setup before going live. They had some performance problems with a lab full of students logged in and want to avoid this when they put in a shiny new T5xx0 series server, so a pre-term start load test makes some sense.
Anyone got a pointer to a Sun Ray Stress test harness or load generator? comments very welcome.
Thursday Jul 24, 2008
I have been working with the MIS people at Edinburgh University and a consultant from Tribal which has been much fun indeed. I learned a number of things and was reminded of a few more my brain have choosen to put in long term storage.
- cron starts non-root processes with a NICE value of 2 hence will have a lower priority than jobs started on the command line or via SMF. The queuedefs man page explains more, but the syntax is arcane!
- Worth snooping traffic to and from the DNS server. Often shows up errors or performance opportunities in nscd.conf and resolve.conf such as having cache-enable set to no for ipnodes.
- If any type of network latency is important such as in ping-pong of packets sitting on 2 clients, map out and understand where your firewall(s) sit(s) and benchmark without the firewall to get the scope of the impact. Firewalls are often an invisible(and hard to observe) component, so are often ignored.
- Turning TCP NAGLE off via ndd is well worthwhile if ping-pong latency is a barrier.
Next race is on Saturday which is the Snowdon International Race.
Tuesday May 27, 2008
Kernel Crash dumps are a point in time snapshot of the Solaris Kernel state. The aim is to allow
post mortem analysis of the system state at the point the crash dump was taken. For system panic's and hangs, the ability to look at the system state is the primary failure analysis tool and one of the reasons Solaris
is as reliable as it is.
I think of system failures as a 2 dimensional problem. The interaction of data and code at the point in time of the failure can be analyzed with tools such as MDB which are designed for this type of post-mortem analysis.
Performance adds the 3rd dimension of time.
Autopsy is not commonly used as a tool for determining the root cause of individual productivity issues.
In a small subset of cases, poor individual productivity may be the result of a medical condition requiring
a CAT scan (the medical version of a live Kernel Crash Dump). However, these cases are very rare and such techniques would only be used with a significant body of supporting evidence.
Kernel Crash Dumps are useful for a very small subset of performance cases. Specific performance problems rooted in memory shortfall caused by a memory leak would be one example, but these are quite rare in the big scheme of things and would need supporting evidence to use the Kernel Crash Dump approach.
I have come across a number of cases in the last few months where a crash dump has been requested and only
one was possibly valid.
Before collecting the CAT scan equivalent of your system (with the associated cost) in the hope it shows up the cause of a performance problem, check the pulse, breathing and circulation 1st. If you do collect a live crash dump, make sure the supporting evidence and rational are sound.
Friday May 02, 2008
A good friend of mine who is a Systems Engineer/Engagement Architect in the UK sent me a copy of a benchmark which his customer was using to assess the performance of various types of Sparc machines. While the benchmark is simplistic, the customer had a concern over its performance on a T5220, so any concern is valid. So here is the customer benchmark
The spirit of the customer benchmark was
#!/bin/ksh
i=0
while [ $i -lt 63 ]
do
./run2_slow &
echo Starting $i
i=`expr $i + 1`
done
time ./run2_slow
which calls
#!/bin/ksh
loop=0
while [ $loop -lt 1000 ]
do
bc < /dev/null
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100
E
loop=`expr ${loop} + 1`
done
which executes in around 70 seconds.
Clive's version which required 5 minutes of very simple coding change
#!/bin/ksh
i=0
while [ $i -lt 63 ]
do
time ./run2_fast &
echo Starting $i
i=`expr $i + 1`
done
time ./run2_fast
calls
#!/bin/ksh
loop=0
while [ $loop -lt 1000 ]
do
n=0
((n=n+100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100
))
((loop=loop+1))
done
run the same number of iterations on the same machine in 1.01 seconds.
For the slower version dtrace -s /usr/demo/dtrace/whoexec.d shows that there are huge number of sequences of fork/exec where the script forks bc which forks dc and also the counter using expr requires calls to fork/exec. Less than 1% of the time spent in this script was actually calculation.
An interesting system level bottleneck did drop out where the text segment of libc was being faulted in as the process is being created as a result of a call to memcntl something like this
enoexec(5.11)$ truss -t memcntl /bin/true
memcntl(0xC4080000, 227576, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
Where you have many concurrent processes calling fork, you do end up with some lock contention in ufs_getpage like this
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
111484 18% 18% 0.00 1747392 0x30001293610 ufs_lockfs_begin_getpage+0xc8
nsec ------ Time Distribution ------ count Stack
512 | 5 ufs_getpage+0x7c
1024 | 26 fop_getpage+0x90
2048 | 43 segvn_faulta+0x114
4096 | 61 as_faulta+0x138
8192 | 131 memcntl+0x8d0
16384 | 711 syscall_trap32+0xcc
32768 |@@@ 14042
65536 |@@@@ 17764
131072 |@@@@ 17988
262144 |@@ 9247
524288 | 2818
1048576 | 3215
2097152 |@ 5994
4194304 |@@@@@@ 24221
8388608 |@@ 10884
16777216 | 3414
33554432 | 310
67108864 | 7
it would be interesting to try this on a T5220 with ZFS as a root filesystem, but where is the bottleneck? I would argue that there may be a little room for improvement in UFS, but that the benchmark is pathological. My experience of dealing with performance issues in the field over the last 6 years is that large 15K/25K and then T2000 and now T5220 are very good indeed at exposing applications which don't scale. This is an example of where a simplistic benchmark can lead to incorrect or at least very incomplete conclusions about the underlying platform. The benchmark as the customer used just measured fork/exec performance, you would not implement a business solution like that, or would you?
Getting a 70x improvement for changes to Solaris or the underlying hardware is going to be a significant challenge. A 70x speedup from application changes in this case was viable and a little consulting help
might not go a miss.
Tuesday Apr 29, 2008
In the voyage to apply the rational process to performance issues, this is useful insight.
For those of us working on performance issues in Services, the biggest hurdle we have to clear is getting a good definition of the problem the customer wants solved (BTW, "my system is slow" is not a problem definition we can work with). and not knowing if an arbitrary kstat counter is incrementing too fast. Once we get a solid problem definition to work on, we are at least 50% of the way there.
As Andy points out, in a multi-vendor world which requires a hippy holistic approach, the definition of the problem is key.
It also occurred to me that by definition a true Hippy would not rant, but go with the flow!
Thursday Feb 07, 2008
I spent a large part of a day a few weeks back on a bridge call with a large US customer for who serious I/O performance issues developed overnight. They had made no changes to the application, platform, SAN or storage for months. The SAN checked out fine. They had very serious performance issues which came down to very high latency I/O across most LUN's.
At iostat level all I could see was high service times and the storage vendor could only see low service times. As is part of the course it becomes a finger pointing exercise as both sides dig deeper and deeper into their stacks of hardware and software.
With the customers business in effect down (they are a web based business to a large extent), the political environments starts to heat up (you can even feel the heat from 3,000 miles away). Eventually one of the storage vendors engineers found that scsi packets with "out of range" LBA's ( Logical Block Address ) were being sent to the array which pointed the finger back at our platform. Some Solaris code reading resulted on my part and I concluded that if the platform was generating out of range LBA's, then it would get recorded in the Illegal Request ssderr kstat, but we did not see that counter incrementing.
As a side remark, one of the customers admin's mentioned that they had seen similar I/O issues on a set of Wintel systems since installing a new set of HBA cards. I asked what the port addresses were for the HBA's generating the errant packets and they were not from the 15k which had the performance problem. The call went silent as the picture unfolded, Sun and the storage vendor were thanked for their time and we left the call.
I wanted to really understand how Solaris really behaves when out of range lba's get generated, so I wrote some code to generate out of range LBA's. Please play with the code, but remember it uses USCSI which bypasses the checks of the filesystem, etc, so don't use it on your production server, please.
So there is a risk in a SAN environment that other hosts can impact on your mission critical business and it becomes a real challenge to find root cause. This customer lost a number of hours of sales. The Solaris side has no visibility of the out of range LBA scsi packets generated by other hosts. So one place it might be useful to use this code is to determine the effect on I/O latency of a workload if the back end storage has to handle "out of range" LBA's. In the case of our customer above, I suspect that resets were occurring within the array and this impacted performance. Should the storage be robust in terms of performance degradation to such requests or is it just a good idea to limit the size and scope of each SAN?
Sunday Jan 20, 2008
Back in August I wrote an entry Too much /proc is bad for you!. I came across an other command which needs to be used with care, fuser. A very useful command which finds and reports the processes which have a file open or files open in a file system. Like malt Whisky or a fine wine, which adds value in moderation, too much at one time and simple tasks start to take much longer.
Have a look at the code for the function dofusers. Simply it loops through every process in the process table looking at both each file in the file table of each process in turn and the vnode backing each segment in the address space of each process.
dtrace -n lockstat:::'{"[@[probefunc] = count()}' -c "fuser ."
shows in excess of 1 million locks acquired and released. While the 1 millions locks are no problem at all if uncontended, running multiple copies of fuser at the same time, can add considerable contention around a number of key locks such as pidlock and various address space locks.
Running 1 fuser every few seconds would have little overall impact. Trying to run multiple copies of fuser and throwing in a few calls to ps at the same time tends to make the smtx column in mpstat light up.
Friday Jan 04, 2008
I was asked yesterday to look at a busy system with high system time. Its Solaris 9 on a big config 25K.
This output was the top of the lockstat -C -s 50 output.
-------------------------------------------------------------------------------
Count indv cuml rcnt spin Lock Hottest Caller
132614 59% 59% 1.00 199 blist_lock[8] bio_recycle+0x224
spin ------ Time Distribution ------ count Stack
1 | 186 bio_recycle+0x224
2 | 2335 bio_getfreeblk+0x4
4 | 4247 getblk_common+0x2bc
8 |@ 7190 bread_common+0x80
16 |@@ 11570 bmap_read+0x20c
32 |@@@@ 18285 ufs_directio_read+0x2e
64 |@@@@@ 25634 rdip+0x198
128 |@@@@@@ 28613 ufs_read+0x17c
256 |@@@@@ 22918 pread+0x28c
512 |@@ 9707
1024 | 1761
2048 | 157
4096 | 11
A bit of Solaris code reading lead me from the stack above to question the value of bufhwm. I checked it out again on docs.sun.com to really understand what this value does. Its the high water mark in K of the size of allocated buffers used for UFS indirect blocks, directories and other bits of metadata.
I went back to check some basic assumptions(always a good plan) and did an Explorer review. The following line is set in /etc/system :
set bufhwm=8000
I have no idea why it was set to 8000 on this system. I have seen it set many times on many systems and have not paid much attention on this and other systems.
8000 is proposed in many places as a reasonable value. I must admit I have never needed to suggest this value is tuned and my unconcious just assumed that it was just a good idea because common wisdom said so and never made a comment when other people tuned it.
By default this value would be 2% of memory. So this system had > 200Gb which would default to around 4GB.
I expect 4gb would waste some memory, but then its a high water mark. 8mb is far too small on this size of server give that the buffer cache is used to store indirect blocks, directories, etc from a set of filesystems near 2 TB!
We can observe if buffer recycling is causing an issue using the following
echo "bfreelist$ buf" | mdb -k
echo "v::print -t struct var" | mdb -k
kstat -p -n biostats
and sar -b might also give some insight.
So the morals to repeat to myself include
- Turn off your unconcious mind when examining /etc/system. Don't assume any /etc/system setting is valid
- Never carry /etc/system tunables forward
- Put a comment in /etc/system if you set a value based on a attribute like memory size with has a potential to change citing the assumption.
Various customers who I have visited over the years comments in the form in /etc/system
# clive.king@sun.com 4/1/2008
# bufhwm value of 8000 assumes a memory size of 4gb and 600GB of UFS filesystem. revisit if size changes
# Check with kstat -p -n biostats before changing
set bufhwm=8000
At least if something goes wrong, then I can be emailed in capital letters.
Thursday Jan 03, 2008
For those who run clusters and also have busy public interfaces, this RFE is a real step forward in ce performance, even though it has been a while in the brewing. Fixed in patches including 118778-11, but there are patches for S9 and x86 as well of course. Patch has been out for over a month now.
Some fiddling around in ce.conf will be require so only the private interfaces have ce_taskq_disable set in addition to applying the patch. If you run a system with cluster and your networks are busy, this is a patch you need to understand if you will gain from apply it. The blueprint is well worth reading to understand what the various tunables for CE do.
Tuesday Sep 11, 2007
A couple of people commented that they would like a format of c0t0d0 rather than sd0 for the script in I/O latency by colour script I posted a few weeks ago.
So, change the line
@[args[1]->dev_pathname] = lquantize(this->elapsed / MS, 0, 200, 50);
to
@[args[1]->dev_statname] = lquantize(this->elapsed / MS, 0, 200, 50);
The DTrace IO provider does not appear to provide the format cxtxdx(if you know different, please add a comment), so a bit of klunky post processing is needed.
#!/usr/bin/perl -w
use strict;
my %regex= ();
system("iostat -E | awk '/Soft/ { print \$1}' > /tmp/a");
system("iostat -En | awk '/Soft/ { print \$1 }' > /tmp/b");
system("paste /tmp/a /tmp/b > /tmp/c");
open(F, "/tmp/c") || die ("no /tmp/c");
while() {
my ($sd,$ctd) = split;
$regex{$sd} = $ctd;
}
close(F);
while (<>) {
for my $sd (keys %regex) {
s/$sd\b/$regex{$sd}/g;
}
print;
}
Then run as
pfexec dtrace -s ./io_latency_by_colour.d | ./sd_to_cxtxdx.pl
and enjoy the colours and the easier to read disk format. I shall have to talk to
Brother Jon to determine if an addition to the IO provider makes sense, though given the amount of work iostat goes through to get this format, might explain why its not already part if the IO provider.
Thursday Aug 30, 2007
I have been to a couple of customer sites this year
where between ¼ and 1/6 of 20-40 cpu systems has been consumed
by various types of monitoring. Over monitoring is a contributor to
scalability issues that causes the customer to introduce additional
monitoring. If you run large systems by CPU/core count (this now
includes T2000), please read on. We introduce a few principles for
the purpose of summary along the way.
One of the downsides of any monitoring is that it
has overhead. Often cited as the Heisenburg
Principle, but should be called the Observer
Effect when discussing about computer systems. We shall put to
one side a cat in a box and any associated philosophy. Let us agree
that if you try to look at a system you will change it and the deeper
you look the more overhead you add. kstats and DTrace are a best
case, an overhead still exists. One enabled DTrace probe has a tiny
overhead, 30,000 enabled probes has more overhead. Few customers buy
our systems to run monitoring software, most use our systems to
support their business, so Principle 1 : monitor what you care
about now to solve today's business problem, not what you might care
about in future
The use of /proc often(event distribution
measured in small numbers of seconds) is a big deal. The procfs
filesystem has a goal of giving correct data at the point in time
that an observation is made. When a choice needs to be made, Solaris
architects choose correct and slower over faster and misleading. This
trade-off means, in the case of process monitoring, you can't have
both correct and performant. Its a lot of work to give a consistent
picture of a process. A ps -ef on our Sunray server with in
the region of 4000 processes causes a sum of 3226,292 kernel fbt
probes to fire. /proc is not a lightweight interface, so we
need to be selective about its use.
/proc is also very lock heavy, it needs to be
to ensure it gives a consistent picture. Every /proc operation
acquires and release proclock and pidlock,
among other locks. Being lock heavy means its very easy
to write application which don't scale if /proc is used on a
regular basis. I was involved in a hot customer issue where an early
SF15K did not scale beyond 28 cpu's for a particular in-house Java
application. The highly threaded application used the pr_fname field
of the curprpsinfo structure to get the process name for every
operation. The process name never changes and the developers had no
idea a 3rd party native library used /proc
in this way. The SMTX column in mpstat lit up and lockstat -C
shows the errant kernel call stack very clearly. Easy fix to the
application once it was pointed out and a huge drop in system time
once the library cached the program name. Which leads to principle
2:- If you need to do it often, don't do it often with procfs
The Solaris kernel does not store the size of a
processes address space at the current point in time. Its only
typically needed by tools such as ps, so the /proc interface
ps_procinfo??? calculates it as required and when needed. This is a
non trivial operation as each segment you see in the pmap
output for a processes needs to have the appropriate vnode locked,
the size taken and the vnode unlocked. It not unusual to have
processes with in excess of 100,000 mapped segments. 30 such
processes and you can see that procfs on behalf of ps
needs to do a lot of work to return the size of the address space of
a process. Couple of ps instances running at the same time
along with prstat and top and quite a bit of your very
busy multi-million dollar database server gets consumed.
The most successful approach to system monitoring
that I have observed our customers employ is to monitor at a business
level such as user experienced response time. For some application
types it takes a lot of work to get beyond measuring how often the
help desk or CIO's phone rings and the word slow is uttered. To
get useful quantitative metrics which give a useful representation of
the user experience and to provide a clear trigger when it degrades
is a highly non-trivial task. This may explain why many
organisations have relied on system level metrics such as user/system
time ratio or even I/O wt time(don't even joke about using wt)!
System level metrics only typically confuse the process of resolution
if used out of context with the business problem.. Taking system
level metrics outside the context of the flow of data to and from the
user (be it human or silicon based) typically leads to a colourful
festival of explaining incrementing values of obscure kstats, rather
than solving business problems and establishing actual root cause.
Which leads to principle 3: Use only measures and metrics you
fully understand
I was asked to pay a visit to a customer by an
account manager where staff investigating a SAP performance issue had
been flown from the UK to Hong Kong to conduct network tests with a
result of its not the network.
A morning of understanding the business need, system (people,
software, computers, networks)
and the interaction between components followed by 10 minutes of
DTrace(Truss would have done fine) showed the problem to be
the efficiency of coding of a SAP script. I can't spell SAP, but
following the flow of data between components, observing with the
intent of answering the SGRT questions of where on the
object and when in the lifecycle has not let me down yet.
Which leads to principle 4: Strong and relevant business
metrics avoid wrong turns.
There is a school of thought that suggests a system
should have no system level performance monitoring enabled on the box
itself unless a baseline is being established, a problem is being
pursued or data is being collected for a specific capacity planning
exercise. For the most part I agree, based on the observation that
continual low level system monitoring, on balance, causes more
problems than it addresses. Storage monitoring products in particular
appear adept at consuming a cpu or 2.
One of the Sun support tools, GUDS shares of the
same concerns. Its important to get the purpose of GUDS into
perspective. GUDS is intended to capture as much potentially useful
data as possible in one shot such that we get useful data for most
types of performance problem. Thus we accept that there is a non-zero
overhead and must allow for it in the analysis
and be judicious in its use. Like any tool, its the context in which
you use GUDS that counts and can add great value. We (Global
Performance V-team) often get asked you have a look at GUDS output
and diagnose a situation where the problem definitions is system
going slow. GUDS is a 1st pass tools when you don't
have a decisive problem definition or you want to gather baseline
data.
GUDS add load to the system it monitors. ::memstat
in mdb, for example, takes 1 CPU and non trivial numbers of
cross-calls to walk each page in memory and determine what the page
is used for. TNF and lockstat also add a overhead. GUDS, when used in
the right context, with the right -X option, is highly effective. As
the Welsh Rugby commentator and ex-international Jonathan
Davies notes in a different context its the top 2 inches that
count.
This leads us to principle 5: Use system level
metrics only in the context of understanding a business lead
performance issue.
I have mentioned the need for relevant business
metrics, itself a huge and complex subject, to replace obtrusive
system level monitoring as the trigger to investigate when a problem
that impacts the business arises. If you set out on a journey it
helps to know your objective, business level metrics assist in
knowing when you are making progress and when you are done. It also
curious how often capacity planning gets confused with business
metrics.
Back to /proc. Some useful one liners for finding
overhead that is just overhead.
Lets see what processes are using /proc over 60
seconds
dtrace -n
procfs::entry'{@[execname] = count()}' -n tick-60s'{exit(0)}'
For an application proc_pig, lets find the
user land stack which causes procfs to be called.
dtrace -n
procfs::entry'/<80><9D>execname == <80><9C>proc_pig<80><9D>/@[ustack()] = count()}'
-n tick-60s'{exit(0)}'
One of the DTrace demo scripts is very useful for
highlighting those monitoring processes which spawn many child
processes.
dtrace -s
/usr/demo/dtrace/whoexec.d -s tick-60s'{exit(0)}'
ps(or pgrep) are often used in scripts
to determine if a child process identified either by name or by PID
is still running. ps is a process monitoring lump hammer and
its use in process state scripts is architecturally questionable,
more so with the advent of SMF in Solaris 10.
So if you have a script that does something along
the lines of
while true
do
ps -ef | awk
'{print $2}' | egrep '^$PID$' > /dev/null
if [ $? != 0 ] ;
then
restart process
fi
sleep 10
done
to restart a process called $PID if it dies.
A step in the right direction in reducing the
overhead is to use
ps -p $2 >
/dev/null
in the ps line. If the process does not
exist, then a non-zero return code is given and when the process does
exist, the overhead of traversing every process and calculating its
address space size is avoided.
To do the proper job, let SMF take the strain and
manage it for you. The underlying contracts framework detects if a
process dies and get it restarted One of the best places to learn
about writing your own services is Bigadmin
.
This leads us to principle 6: Use the right
architecture and tools for service management
Writing a SMF service to restart a process, while
not a trivial task, is easier and less error prone than writing
efficient and correct shell script!
In summary, we have touched on a number of topics
which relate to monitoring and the resultant impact on overall system
performance. The obvious open question which is how do we generate
meaningful business metrics beyond how long a batch job takes to run
or how often the phone rings? Its a tough subject as most situations
are unique and I would be interested in real examples of useful
non-trivial business metrics in complex environments.
Monitor what you care about now to solve
today's business problem, not what you might care about in future
If you need to do it often, don't do it often
with procfs
Use only measures and metrics you fully
understand
Strong and relevant business metrics avoid
wrong turns
Use system level metrics only in the context of
understanding a business lead performance issue
Use the right architecture and tools for
service management
If you can think of any more from your experiences,
please drop me an email or add a comment.
Actually, I was the one who told him how to do a b...