70x performance improvement in 5 minutes
Friday May 02, 2008
A good friend of mine who is a Systems Engineer/Engagement Architect in the UK sent me a copy of a benchmark which his customer was using to assess the performance of various types of Sparc machines. While the benchmark is simplistic, the customer had a concern over its performance on a T5220, so any concern is valid. So here is the customer benchmark
The spirit of the customer benchmark was
#!/bin/ksh
i=0
while [ $i -lt 63 ]
do
./run2_slow &
echo Starting $i
i=`expr $i + 1`
done
time ./run2_slow
which calls
#!/bin/ksh
loop=0
while [ $loop -lt 1000 ]
do
bc < /dev/null
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100
E
loop=`expr ${loop} + 1`
done
which executes in around 70 seconds.
Clive's version which required 5 minutes of very simple coding change
#!/bin/ksh
i=0
while [ $i -lt 63 ]
do
time ./run2_fast &
echo Starting $i
i=`expr $i + 1`
done
time ./run2_fast
calls
#!/bin/ksh
loop=0
while [ $loop -lt 1000 ]
do
n=0
((n=n+100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100* \
100/100*100/100
))
((loop=loop+1))
done
run the same number of iterations on the same machine in 1.01 seconds.
For the slower version dtrace -s /usr/demo/dtrace/whoexec.d shows that there are huge number of sequences of fork/exec where the script forks bc which forks dc and also the counter using expr requires calls to fork/exec. Less than 1% of the time spent in this script was actually calculation.
An interesting system level bottleneck did drop out where the text segment of libc was being faulted in as the process is being created as a result of a call to memcntl something like this
enoexec(5.11)$ truss -t memcntl /bin/true memcntl(0xC4080000, 227576, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0Where you have many concurrent processes calling fork, you do end up with some lock contention in ufs_getpage like this
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
111484 18% 18% 0.00 1747392 0x30001293610 ufs_lockfs_begin_getpage+0xc8
nsec ------ Time Distribution ------ count Stack
512 | 5 ufs_getpage+0x7c
1024 | 26 fop_getpage+0x90
2048 | 43 segvn_faulta+0x114
4096 | 61 as_faulta+0x138
8192 | 131 memcntl+0x8d0
16384 | 711 syscall_trap32+0xcc
32768 |@@@ 14042
65536 |@@@@ 17764
131072 |@@@@ 17988
262144 |@@ 9247
524288 | 2818
1048576 | 3215
2097152 |@ 5994
4194304 |@@@@@@ 24221
8388608 |@@ 10884
16777216 | 3414
33554432 | 310
67108864 | 7
it would be interesting to try this on a T5220 with ZFS as a root filesystem, but where is the bottleneck? I would argue that there may be a little room for improvement in UFS, but that the benchmark is pathological. My experience of dealing with performance issues in the field over the last 6 years is that large 15K/25K and then T2000 and now T5220 are very good indeed at exposing applications which don't scale. This is an example of where a simplistic benchmark can lead to incorrect or at least very incomplete conclusions about the underlying platform. The benchmark as the customer used just measured fork/exec performance, you would not implement a business solution like that, or would you?
Getting a 70x improvement for changes to Solaris or the underlying hardware is going to be a significant challenge. A 70x speedup from application changes in this case was viable and a little consulting help might not go a miss.











-...
good comment. About 30% fast with your suggestions...
BTW: ksh93 has "for(( expr1 ; expr2 ; expr3 )...