|
Thursday December 30, 2004 I have started to investigate what the limiting factor on the Ultra 10 is when at 167 zones. Using svcadm, I have disabled the name service cache on every zone. That has let me boot all 171 created zones (I'll have to create more).
Here is the vmstat output of the "idle" Ultra 10:
kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr dd dd f0 s1 in sy cs us sy id 0 0 0 4920928 729616 1 4 8 6 8 0 14 0 1 0 0 402 113 161 0 1 98 0 0 0 1773368 44480 0 15 15 0 0 0 0 0 0 0 0 409 325 619 1 6 92 0 0 0 1773320 44432 0 0 0 0 0 0 0 0 0 0 0 401 356 676 2 5 92 0 0 0 1773264 44376 0 1 0 0 0 0 0 0 0 0 0 399 319 518 2 6 92 0 0 0 1773384 44496 0 1 0 0 0 0 0 0 0 0 0 405 233 500 0 4 95
16253 root 7696K 7320K cpu0 19 0 0:00:01 4.4% prstat/1
15060 root 8568K 1720K sleep 59 0 0:00:09 0.1% sshd/1
15102 root 2936K 1984K sleep 59 0 0:00:01 0.0% bash/1
223 root 4032K 2040K sleep 59 0 0:02:11 0.0% nscd/41
8826 root 6384K 1608K sleep 59 0 0:00:29 0.0% svc.startd/13
7782 root 6424K 1552K sleep 59 0 0:00:31 0.0% svc.startd/12
6988 root 6304K 1464K sleep 59 0 0:00:33 0.0% svc.startd/12
5950 root 6304K 1440K sleep 59 0 0:00:33 0.0% svc.startd/12
4380 root 6256K 1384K sleep 59 0 0:00:32 0.0% svc.startd/12
2037 root 6448K 1240K sleep 59 0 0:00:33 0.0% svc.startd/12
29486 root 6360K 1208K sleep 59 0 0:00:33 0.0% svc.startd/12
29244 root 6224K 1216K sleep 59 0 0:00:34 0.0% svc.startd/12
29002 root 6568K 1208K sleep 59 0 0:00:32 0.0% svc.startd/12
27792 root 6288K 1256K sleep 59 0 0:00:33 0.0% svc.startd/13
26582 root 6288K 1264K sleep 59 0 0:00:33 0.0% svc.startd/12
14054 root 6224K 2840K sleep 59 0 0:00:11 0.0% svc.startd/12
23098 root 6096K 1304K sleep 59 0 0:00:34 0.0% svc.startd/12
13262 root 6432K 3088K sleep 59 0 0:00:11 0.0% svc.startd/12
21415 root 6240K 1368K sleep 59 0 0:00:33 0.0% svc.startd/14
21175 root 6240K 1304K sleep 59 0 0:00:34 0.0% svc.startd/12
20450 root 6224K 1304K sleep 59 0 0:00:34 0.0% svc.startd/12
20180 root 6304K 1296K sleep 59 0 0:00:34 0.0% svc.startd/13
17449 root 6144K 1384K sleep 59 0 0:00:34 0.0% svc.startd/12
16938 root 6160K 1320K sleep 59 0 0:00:33 0.0% svc.startd/12
16674 root 6224K 1336K sleep 59 0 0:00:34 0.0% svc.startd/12
9435 root 6096K 1464K sleep 59 0 0:00:41 0.0% svc.startd/13
12657 root 6304K 1880K sleep 59 0 0:00:12 0.0% svc.startd/12
6378 root 6296K 1336K sleep 59 0 0:00:39 0.0% svc.startd/12
3843 root 6304K 1336K sleep 59 0 0:00:41 0.0% svc.startd/12
3248 root 6160K 1336K sleep 59 0 0:00:39 0.0% svc.startd/13
3090 root 6360K 1336K sleep 59 0 0:00:39 0.0% svc.startd/13
1785 root 6176K 1464K sleep 59 0 0:00:39 0.0% svc.startd/12
583 root 6584K 976K sleep 59 0 0:02:15 0.0% svc.startd/12
7 root 6480K 392K sleep 59 0 0:03:09 0.0% svc.startd/11
...
I wonder what svc.startd is doing? Not much I'm sure, but multiplied by 171 ... hmmm ... Let's throw DTrace along with my inferior intellect at the problem. I'm at the early RTFM stage of DTrace. Let's see what I can find out. First, let's see who is making the most system calls:
syscall:::entry
{
@func[execname] = count();
}
sshd 16 nscd 23 sac 42 utmpd 54 ttymon 63 automountd 124 init 150 inetd 170 svc.configd 171 dtrace 417 svc.startd 1735
syscall:::entry
/execname == "svc.startd"/
{
@func[probefunc] = count();
}
lwp_park 112 portfs 855
What has all of this accomplished? Don't know exactly :) I'm out of time and I realize that I should't necessarily be focusing on the number of times called but on the time spent in each function call. My 'D' isn't quite that good yet.
Gotta go!
(2004-12-30 11:03:27.0) Permalink Comments [2]
John, what are the symptoms that you see when you "hit the limit"? You've got an awful lot of idle there, so looking at what is getting on-proc isn't where I would be looking first.
So you can do compares between runs, you may also find the tick probe useful.
i.e.
Which will exit the script after 10 seconds, also printing out your aggregations.
If you can tell me what exactly "hitting the limit" looks like, I'd be happy to try to give you a hand.
Posted by Alan Hargreaves on December 30, 2004 at 01:34 PM PST #
I could use the hand (especially the process of determination)
.Yeah, I got side-tracked trying to figure out what was going on with that 5-8% system/user time on the "idle" box. That was a baseline for playing around with Dtrace. Right now I am enabling nscd in each zone trying to get back to where things became unstable. I'm up to 150 zones with nscd running and things are looking good (outside of massive paging wide-ranging scan rate). I've got a feeling I am going to get all 171 running so I have no idea what happened last time at 168. If I get to 171, I'll have to create more zones and move forward.
Posted by John Clingan on December 30, 2004 at 02:18 PM PST #