BACKGROUND:
One of our Customers was seeing a performance issue with their Siteminder enabled applications each day around 4:30pm. While the Customer was using Dirtracer to capture event data, the time it took for Administration staff to see the alerts made capturing the event difficult. The event, when it occurred took approximately 60 seconds from
start to finish. By the time the Dirtracers were started, the event had already
passed 9-10 minutes before.
The Customer setup their BMC Control M monitoring facility so they could
automate a 4 way Dirtracer capture at the same time when the event was noticed.
NOTE: they did not see any alerts/issues with the Directory or
Siteminder Servers at this time but they launched Dirtracer via BMC Control M
as a precaution only.
With this said, I analyzed the Dirtracers thinking there was a problem.
SYSTEM INFO:
1) The Customer captured data from 4 systems at the same time; 01, 02, 03 and system 04.
01 #dirtracer Started: 080422-174012
02 #dirtracer Started: 080422-174013
03 #dirtracer Started: 080422-174127
04 #dirtracer Started: 080422-174126
2) Connection Load. I see not all servers are equal here; 01 and 04 are
taking a lot less of a load than 02 and 03.
01 91 connections average
02 185 connections constant
03 171 connections constant
04 124 connections average
Summary Connection Snippets: taken from cn=monitor searches.
Conn Ttl Curr Read ReqQue OpsInitiated vs.
File Peak Th Cons Cons Waitr BakLog OpsCompleted
====================================================================================
01:
[080422-175815] 1984 24 525136 101 0 0 332685-332674=11
[080422-175852] 1984 24 525143 102 0 0 332906-332899=7
[080422-175928] 1984 24 525150 103 0 0 333155-333140=15
[080422-180005] 1984 24 525159 87 0 0 331583-331576=7
[080422-180043] 1984 24 525167 82 0 0 331188-331186=2
02: Summary Snippet
[080422-174222] 1984 24 502241 185 0 0 2155474-2155468=6
[080422-174227] 1984 24 502245 185 0 0 2156076-2156073=3
[080422-174232] 1984 24 502249 185 0 0 2156687-2156680=7
[080422-174238] 1984 24 502253 185 2 2 2157573-2157565=8
[080422-174243] 1984 24 502259 185 1 0 2158588-2158583=5
03:
[080422-174513] 2256 24 503749 171 0 0 722384-722378=6
[080422-174518] 2256 24 503755 171 0 0 722607-722603=4
[080422-174522] 2256 24 503759 171 1 0 722890-722886=4
[080422-174527] 2256 24 503762 171 0 0 723230-723226=4
[080422-174531] 2256 24 503767 173 0 1 723506-723498=8
04:
[080422-174633] 8111 24 497842 123 1 0 383898-383884=14
[080422-174644] 8111 24 497848 123 0 2 384516-384511=5
[080422-174655] 8111 24 497855 124 0 1 384996-384993=3
[080422-174706] 8111 24 497860 124 0 1 385318-385313=5
[080422-174717] 8111 24 497864 124 0 0 385745-385739=6
3) The CPU % of each server mostly mirrors the load differences seen from above; 01 and 04 are taking a lot less of a load than 02 and 03.
01 1% average busy
02 16% average busy
03 20% average busy
04 8% average busy
CPU (pcpu) taken from the pms.sh (ps) file.
01:
04/22-[17:41:49] ------- 12489 5370536 5331048 1.1 growth: 0 k
02:
04/22-[17:41:39] ------- 6236 2216344 2193576 15.3 growth: 8 k
03:
04/22-[17:42:41] ------- 7930 2214432 2192016 21.3 growth: -40 k
04:
04/22-[17:42:48] ------- 10554 3717880 3695216 9.1 growth: -360 k
4) Iostats (disk performance). Since the servers are using Veritas File System I
cannot tell which disk slices the DS runs on versus which disks show
higher than normal iostat's.
The DS is installed in the following location for all servers:
- /dev/vx/dsk/app/dsserver 1048576 18603 965640 2% /opt/servers/slapd-master
01 Disks peak to 61% busy (1 time) GOOD
02 Disks peak to 58% busy (1 time) GOOD
03 Disks peak to 100% busy (2 times) FAIR
04 Disks peak to 100% busy (8 times) BAD
As a result of the 100% disk busy times the average disk io service time and service wait times increase.
01 avg service time 8.8/ms - wait service 0/ms
02 avg service time 35/ms - wait service 0/ms
03 avg service time 199.5/ms - wait service 0/ms
04 avg service time 105.6/ms - wait service 373.5/ms
01
Tue Apr 22 17:41:07 2008
extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.6 0 0 c1
1.0 4.0 0.0 0.1 0.0 0.0 0.0 8.7 0 0 c2
61.0 34.0 0.6 1.4 0.0 0.9 0.0 9.9 0 10 c4
102.0 47.0 1.1 2.0 0.0 1.3 0.0 8.7 0 10 c5
0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.5 0 0 c4t0d1
61.0 32.0 0.6 1.4 0.0 0.9 0.0 10.1 0 60 c4t0d2
0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.4 0 0 c4t0d5
102.0 44.0 1.1 1.9 0.0 1.3 0.0 8.8 0 61 c5t1d2
0.0 3.0 0.0 0.1 0.0 0.0 0.0 0.7 0 0 c5t1d5
0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.6 0 1 c1t2d0
0.0 4.0 0.0 0.1 0.0 0.0 0.0 8.5 0 3 c2t2d0
1.0 0.0 0.0 0.0 0.0 0.0 0.0 9.5 0 1 c2t0d0
02
Tue Apr 22 17:40:37 2008
extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
1.0 23.0 0.0 1.2 0.0 1.0 0.0 41.0 0 2 c1
0.0 19.0 0.0 1.1 0.0 1.2 0.0 61.1 0 2 c2
3.0 701.0 0.0 5.8 0.0 24.6 0.0 34.9 0 20 c4
1.0 648.0 0.0 5.3 0.0 23.5 0.0 36.3 0 20 c5
0.0 4.0 0.0 0.0 0.0 0.0 0.0 9.6 0 1 c4t0d0
1.0 351.0 0.0 2.9 0.0 12.4 0.0 35.1 0 58 c4t0d2
2.0 290.0 0.0 2.3 0.0 9.7 0.0 33.3 0 50 c4t0d3
0.0 56.0 0.0 0.5 0.0 2.5 0.0 43.9 0 10 c4t0d5
0.0 4.0 0.0 0.1 0.0 0.0 0.0 7.7 0 1 c5t1d0
0.0 323.0 0.0 2.6 0.0 11.7 0.0 36.4 0 57 c5t1d2
1.0 269.0 0.0 2.2 0.0 9.1 0.0 33.8 0 54 c5t1d3
0.0 52.0 0.0 0.4 0.0 2.6 0.0 50.2 0 9 c5t1d5
1.0 0.0 0.0 0.0 0.0 0.0 0.0 5.4 0 1 c1t8d0
0.0 4.0 0.0 0.1 0.0 0.0 0.0 5.9 0 2 c1t2d0
0.0 9.0 0.0 0.6 0.0 0.6 0.0 61.6 0 13 c1t12d0
0.0 10.0 0.0 0.6 0.0 0.4 0.0 40.0 0 9 c1t9d0
0.0 8.0 0.0 0.5 0.0 0.6 0.0 76.2 0 13 c2t12d0
0.0 11.0 0.0 0.6 0.0 0.6 0.0 50.1 0 12 c2t9d0
03
Tue Apr 22 17:42:06 2008
extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 34.0 0.0 1.5 0.0 1.7 0.0 49.7 0 3 c1
0.0 33.0 0.0 1.7 0.0 2.1 0.0 63.3 0 3 c2
0.0 556.0 0.0 4.5 0.0 89.8 0.0 161.5 0 23 c4
2.0 279.0 0.0 3.2 0.0 12.2 0.0 43.3 0 15 c5
0.0 9.0 0.0 0.1 0.0 0.0 0.0 5.3 0 3 c4t8d0
0.0 103.0 0.0 0.9 0.0 1.1 0.0 11.1 0 38 c4t10d0
0.0 444.0 0.0 3.5 0.0 88.6 0.0 199.5 0 100 c4t11d0
2.0 249.0 0.0 2.0 0.0 11.0 0.0 43.8 0 76 c5t10d0
0.0 9.0 0.0 0.1 0.0 0.0 0.0 4.1 0 3 c5t11d0
0.0 21.0 0.0 1.1 0.0 1.1 0.0 53.6 0 13 c5t13d0
0.0 10.0 0.0 0.6 0.0 0.7 0.0 73.7 0 13 c2t11d0
0.0 8.0 0.0 0.4 0.0 0.4 0.0 52.3 0 11 c2t9d0
0.0 15.0 0.0 0.7 0.0 0.9 0.0 62.3 0 12 c2t12d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 6.8 0 1 c1t2d0
0.0 11.0 0.0 0.6 0.0 0.5 0.0 44.1 0 13 c1t11d0
0.0 14.0 0.0 0.6 0.0 0.8 0.0 55.2 0 12 c1t9d0
0.0 7.0 0.0 0.3 0.0 0.4 0.0 59.3 0 12 c1t12d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 12.0 0 1 c1t3d0
04
Tue Apr 22 17:42:23 2008
extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device
0.0 299.1 0.0 2.3 100.8 29.8 337.1 99.6 16 17 c1
0.0 349.1 0.0 2.7 107.5 29.6 308.0 84.9 16 17 c2
0.0 5.0 0.0 0.1 0.0 0.0 0.0 4.5 0 0 c4
0.0 48.0 0.0 2.6 0.0 0.6 0.0 12.5 0 2 c5
0.0 5.0 0.0 0.1 0.0 0.0 0.0 4.5 0 2 c4t8d0
0.0 2.0 0.0 0.0 0.0 0.0 0.0 12.4 0 2 c5t8d0
0.0 22.0 0.0 1.2 0.0 0.3 0.0 12.2 0 4 c5t9d0
0.0 22.0 0.0 1.2 0.0 0.3 0.0 13.7 0 4 c5t10d0
0.0 1.0 0.0 0.1 0.0 0.0 0.0 4.7 0 0 c5t11d0
0.0 1.0 0.0 0.1 0.0 0.0 0.0 1.8 0 0 c5t12d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 12.5 0 1 c2t11d0
0.0 181.1 0.0 1.4 51.9 14.6 286.7 80.7 92 100 c2t0d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 11.5 0 1 c2t13d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 4.1 0 0 c2t4d0
0.0 164.0 0.0 1.3 55.6 15.0 339.0 91.4 100 100 c2t10d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 7.4 0 1 c2t5d0
0.0 4.0 0.0 0.1 0.0 0.0 0.0 6.5 0 3 c1t11d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 7.3 0 1 c1t13d0
0.0 142.0 0.0 1.1 53.0 15.0 373.2 105.6 100 100 c1t0d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 12.3 0 1 c1t4d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 6.4 0 1 c1t3d0
0.0 148.0 0.0 1.1 47.8 14.7 322.9 99.5 92 100 c1t10d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 7.7 0 1 c1t2d0
0.0 1.0 0.0 0.0 0.0 0.0 0.0 9.5 0 1 c1t5d0
5) Other ns-slapd instances running on these servers. There are 5 total
directory server instances on each of the 4 servers so I cannot see of
the other ns-slapd instances are loading the disks to the levels seen above. Significant load on the unknown instances could impact the disk used by the known instances and affect performance.
6) Dirtracer itself is not able to
complete a normal MainLoop check in the normal time as configured; Dirtracer was confiured to loop trough each check in the MainLoop at 1 second INTERVALs.
Each loop executes a pstack, prstat, netstat, cn=monitor
search and 1 x backend cache info search (1 backend in this case). Each
of these data points are very lightweight and complete sub-second.
On Average each server completes a Loop in 5-41 seconds each. It is
unknown why these servers cannot complete these loops as fast as configured.
See the following examples:
each Loop shows the date/time stamp when the loop starts...note how many seconds had passed between each loop (should have been 1 second between loops).
01
* Loop 0 - 080422-174027
* Loop 1 - 080422-174106
* Loop 2 - 080422-174148
* Loop 3 - 080422-174228
* Loop 4 - 080422-174308
02
* Loop 0 - 080422-174030
* Loop 1 - 080422-174036
* Loop 2 - 080422-174041
* Loop 3 - 080422-174045
* Loop 4 - 080422-17405003
* Loop 0 - 080422-174143
* Loop 1 - 080422-174148
* Loop 2 - 080422-174153
* Loop 3 - 080422-174158
* Loop 4 - 080422-174203
04
* Loop 0 - 080422-174139
* Loop 1 - 080422-174150
* Loop 2 - 080422-174201
* Loop 3 - 080422-174213
* Loop 4 - 080422-174225
LDAP INFO:
1) A quick check on the Dirtracer search etimes shows each are sub-second.
On 04 only a few were 0.1 to .2 seconds.
The following were taken from access log snippets captured.
root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs/working]#cat conn-49* | awk '{print $11}' | grep etime | grep -v etime=0\.0
etime=0.246540
etime=0.118270
etime=0.110770
etime=0.143630
2) Using the logs I found there was a Siteminder Policy Server that was configured to use an improper search base (dc=com). For this data set only 01 doesn't have
any Siteminder searching using an incorrect base; 02, 03 and 04 all
to. 04 having the most.
01: None
02: 4604 dc=com searches
03: 5457 dc=com searches
04: 13749 dc=com searches
I as able to track down the IP in only 1 server (04) as the connections
in the others were already running. I need to check old logs to see if
there are other IP addresses these are coming from.
10.10.10.1
[22/Apr/2008:17:04:52 -0400] conn=496746 op=-1 msgId=-1 - fd=79 slot=79 LDAP connection from 10.10.10.1 to 10.10.10.181
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - BIND dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:17:04:52 -0400] conn=496746 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.000910 dn="uid=siteminder,ou=service,dc=xxxxxx,dc=com"
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=3 msgId=4 - RESULT err=32 tag=101 nentries=0 etime=0.000540
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid=userremoved,ou=People,dc=xxxxxx,dc=com)(objectClass=groupOfUniqueNames*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=5 msgId=6 - RESULT err=32 tag=101 nentries=0 etime=0.000480
...
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - SRCH base="dc=com" scope=2 filter="(&(uniqueMember=uid==userremoved,,ou=People,dc==xxxxxx,,dc=com)(objectClass=groupOfUniqueNames*))" attrs="cn"
[22/Apr/2008:17:04:52 -0400] conn=496746 op=7 msgId=8 - RESULT err=32 tag=101 nentries=0 etime=0.000460
...
The above connection (conn=496746) contains searches using the incorrect base 1195 times
alone.
root[/customerdata/6xxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-logs]#gc err=32 conn-496746.log
1195
3) High etimes / high etime servers. Overall server 04 is the biggest high etime
server, followed by 01.
Top 5 High Etimes
01
etime=413.555070 1
etime=300.814310 notes=F 1
etime=115.912590 1
etime=39.566400 1
etime=39.553510 notes=F 1
02
etime=8.719340 1
etime=6.210240 1
etime=6.101450 1
etime=5.814920 1
etime=5.802740 103
etime=7.325220 1
etime=5.766010 1
etime=5.558690 1
etime=5.186940 1
etime=5.147010 1
04
etime=1876.043450 notes=U 1
etime=1609.753640 notes=U 1
etime=1275.956850 notes=U 1
etime=1251.870940 notes=U 1
We now know 04's highest offending etimes come from 4 cn=* searches from
a certain application. IP 10.10.10.225
nentries=598115 4
[22/Apr/2008:16:48:50 -0400] conn=496301 op=-1 msgId=-1 - fd=47 slot=47 LDAP connection from 10.10.10.225 to 10.10.10.181
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - BIND dn="uid=appl100,ou=service,dc=xxxxxx,dc=com" method=128 version=3
[22/Apr/2008:16:48:50 -0400] conn=496301 op=0 msgId=1 - RESULT err=0 tag=97 nentries=0 etime=0.002810 dn="uid=appl100,ou=service,dc=xxxxxx,dc=com"
[22/Apr/2008:16:48:50 -0400] conn=496301 op=1 msgId=2 - SRCH base="ou=people,dc=xxxxxx
,dc=com" scope=1 filter="(cn=*)" attrs="uid givenName sn mail region modifiers uid modifyTimestamp newOSUser OSmanaged OSstatus"
[22/Apr/2008:17:15:39 -0400] conn=496301 op=1 msgId=2 - RESULT err=0 tag=101 nentries=598115 etime=1609.753640 notes=U
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=3 - UNBIND
[22/Apr/2008:17:15:40 -0400] conn=496301 op=2 msgId=-1 - closing - U1
[22/Apr/2008:17:15:40 -0400] conn=496301 op=-1 msgId=-1 - closed.
CONFIGURATION INFO:
1) The Dirtracer captures highlighted differences in cache settings. Both servers 02 and 03 have 1GB total whereas
servers 01 has 3 and 04 has 2GB cache.
01
Total DB Cache: 1,048,576 Kb (1073741824 bytes)
Total Entry Cache: 2,441,406 Kb (2500000000 bytes)
--------------------------------------------------
Total Cache Size: 3 Gb, 3,408 Mb, 3,489,982 Kb, (3573741824 bytes)
02
Total DB Cache: 1,048,576 Kb (1073741824 bytes)
Total Entry Cache: 1,024,000 Kb (1048576000 bytes)
--------------------------------------------------
Total Cache Size: 1 Gb, 2,024 Mb, 2,072,576 Kb, (2122317824 bytes)
03
Total DB Cache: 1,048,576 Kb (1073741824 bytes)
Total Entry Cache: 1,024,000 Kb (1048576000 bytes)
--------------------------------------------------
Total Cache Size: 1 Gb, 2,024 Mb, 2,072,576 Kb, (2122317824 bytes)
04
Total DB Cache: 1,048,576 Kb (1073741824 bytes)
Total Entry Cache: 1,126,400 Kb (1153433600 bytes)
--------------------------------------------------
Total Cache Size: 2 Gb, 2,124 Mb, 2,174,976 Kb, (2227175424 bytes)
2) None of the four servers uses a tmpfs (tmp file system) for its db home; this affects performance on Unix systems.
"UNIX only. Used to fix a situation on UNIX platforms where the
operating system endlessly flushes pages. This flushing can be so
excessive that performance of the entire system is severely degraded."
See: nsslapd-db-home-directory
http://docs.sun.com/source/816-6699-10/plugattr.html
Each is using the main veritas partition for the db home.
db-home-directory: /opt/servers/slapd-master/db
3) Each server used VXFS (Veritas File System) which has known fsync
(file sync) problems when flushing Dirty Pages from Memory/Cache to the
DB's. This is done mostly during the transaction checkpoint
txn_checkpoint. The resulting fsync problem results in an overall
slowdown of the DS process which will affect etimes on all other
operations while the checkpoint occurs. I have seen this on Veritas 4.4
and the customer is running an even older version.
NAME: VERITAS File System
VENDOR: VERITAS Software
PSTAMP: VERITAS-3.4FS-2003-01-29PID=110435-08
The problem can be seen using the cn=monitor search in the DS to show
read waiters and the request queue backlog. See the cn=monitor summary
snippet at the top (above). Anything more than 0 for either is not optimal.
Each of the Four servers has between 1 and 3 read waiters or request
queue backlogs at any given time. Generally each is running with 0 most
of the time.
4) I asked a while ago if the customer did the following.
- Upgraded from Directory Server Patch 4 - YES
- Used the ActivateFix script in Patch 6 - NO
I saw these errors in the logs from servers 01, 03 and 04.
[22/Apr/2008:17:47:00 -0400] - WARNING<10307> - Replication - conn=-1 op=-1 msgId=-1 - Configuration warning Tombstone purging internal search is unindexed in replica dc=xxxxxx,dc=com
It is possible they have hit bug 6175472 and are seeing some result of
the tombstone purging performance hit. Allids threshold is only 4000 on these
servers. Tombstones are DELeted entries waiting to be purged. If the
number of deleted entries has gone above 4k the objectclass index
becomes allids and the searches for the tombstone entries is unindexed
and takes a long time.
6175472 MMR: High performance degradation when purging tombstones
I verified the dse.ldif's to see if the fix was enabled. Part of the
fix places a nsMatchingRule oid in the indexes entries as seen in the
dse.ldif. The Matching Rule is NOT in any indexes of any of the servers
dse.ldif
nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150
01:
root[/customerdata/6xxxxxxx/Apr22/01-dirtracernxcspld1-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0
02:
root[/customerdata/6xxxxxxx/Apr22/02-dirtracernxcspld2-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0
03
root[/customerdata/6xxxxxxx/Apr22/03-dirtracernxcspld3-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0
04
root[/customerdata/6xxxxxxx/Apr22/04-dirtracernxcspld4-080422/ds-config-data]#grep -ic 'nsMatchingRule: 1.3.6.1.4.1.42.2.27.9.4.150' dse.ldif
0
Even though the above data was capture when there was no Directory or Siteminder server problems, the Dirtracer data captured helped to highlight the following
- problematic/incorrect directory server configuration/architecture
- possible load balancing problems; i.e. sending more connections two two server as opposed to completely load balanced.
- disk capacity/io peak issues possibly from other heavily loaded ns-slapd instance
- incorrect application configuration
- overall etimes of the servers
[LT]

