Sunday Aug 10, 2008
I have been ask around, but may not have yet asked in the right place, so here goes with a wider audience!
A University I do some work with want to load test their Sun Ray setup before going live. They had some performance problems with a lab full of students logged in and want to avoid this when they put in a shiny new T5xx0 series server, so a pre-term start load test makes some sense.
Anyone got a pointer to a Sun Ray Stress test harness or load generator? comments very welcome.
Tuesday Aug 05, 2008
Every day is a school day with many new things to learn.
A co-worker from Italy brought me this problem which a customer thought was related to an LDAP performance issues (that is an other discussion).
The test case code
#!/usr/bin/perl -w
use strict;
use Net::Ping;
use Time::HiRes qw( usleep ualarm gettimeofday tv_interval );;
my ( $host, $count, $rate ) = @ARGV;
unless ( $host && $count && $rate ) {
print("usage: ping.pl \n");
exit(1);
}
my $p = Net::Ping->new();
my $totalTime = 0;
my $pingNumber = 0;
foreach my $i (1..$count) {
$p->hires();
my ($ret, $duration, $ip) = $p->ping($host, 5.5);
printf("$pingNumber\t%.2f\n", 1000 * $duration);
$pingNumber++;
$totalTime += $duration;
$p->close();
usleep((1/$rate) * 1000000);
}
and the output
v-6000a-x6220a-gmp03(5.10)$ ./ping.pl ebusy 100 200
0 0.54
1 0.28
2 0.33
3 0.25
4 0.45
snip but assume pattern repeats
35 0.25
36 0.26
47 0.34
48 0.28
49 0.26
50 3380.21
51 0.29
52 0.27
53 0.26
54 0.23
55 0.23
snip but assume pattern repeats
93 0.23
94 0.24
95 0.26
96 0.23
97 0.24
98 0.24
99 3380.21
Solaris bug? No, it is Solaris trying to protect you again a reset DOS.
This is the tcp routine(in tcp.c) which causes this behavior
/*
* If this routine returns B_TRUE, TCP can generate a RST in response
* to a segment. If it returns B_FALSE, TCP should not respond.
*/
static boolean_t
tcp_send_rst_chk(tcp_stack_t *tcps)
{
clock_t now;
/*
* TCP needs to protect itself from generating too many RSTs.
* This can be a DoS attack by sending us random segments
* soliciting RSTs.
*
* What we do here is to have a limit of tcp_rst_sent_rate RSTs
* in each 1 second interval. In this way, TCP still generate
* RSTs in normal cases but when under attack, the impact is
* limited.
*/
if (tcps->tcps_rst_sent_rate_enabled != 0) {
now = lbolt;
/* lbolt can wrap around. */
if ((tcps->tcps_last_rst_intrvl > now) ||
(TICK_TO_MSEC(now - tcps->tcps_last_rst_intrvl) >
1*SECONDS)) {
tcps->tcps_last_rst_intrvl = now;
tcps->tcps_rst_cnt = 1;
} else if (++tcps->tcps_rst_cnt > tcps->tcps_rst_sent_rate) {
return (B_FALSE);
}
}
return (B_TRUE);
}
Easy to address with ndd on the server side and tcp_rst_sent_rate or tcp_rst_sent_rate_enabled if this latency causes you an issue and you can cope with the potential exposure to this DOS.
Thursday Jul 31, 2008
I came across a utility called mcs(1) by accident. The purpose of mcs is to manipulate the comment section of an object file, but with the -p option it also prints out the compiler and linker versions.
enoexec(5.11)$ mcs -p hack
hack:
acomp: Sun C 5.9 SunOS_sparc 2007/05/03
ld: Software Generation Utilities - Solaris Link Editors: 5.10-1.489
We will still need to use strings -a to determine what compiler flags were used and if an application has scope for speedup by using better compiler options and a more recent compiler version.
Monday Jul 28, 2008
As I have mentioned in a previous post, I worked in the super computers part of Cray Research just after it had been taken over by SGI.
I remember discussions going on about various customers wanting Teraflop machines and being willing to pay the current going rate, but vendors at that time were unable to provide systems, no matter what the customer was willing to pay.
This benchmark caught my eye as a single image machine with an output of over 2 TFLOPS. While the current clusters at the head of the top 500 list have in excess of 100,000 cores, this benchmark is quite something for a single image system with all the advantages of SMP!
Monday Jul 28, 2008
The temp gauge in our car read 27 degree's which is tropical for Wales.
So heat was the theme of this years Snowdon Race with many of the competitors finishing up to 10 minutes down on last years time. Apart, of course, from the winner Andi Jones who runs like a machine and was only 30 seconds down on his time for the previous year. I came in 45 minutes after Andi and I don't think I could have run any harder given the heat.
The race also doubles as the Llanberis St John's Ambulance training day. I have never seen so many customers for their services at a race:- heat exhustion, blisters and bleeding legs and heads. The medical tent was full to overflowing. At least 1 runner who ran for Wales was helicoptered off, I do he is OK now.
So it was a tough race for those not used to running in heat, I doubt few personal bests were set(unless like me it was your 1st). While Snowdon has a railway running to its summit, it is still a serious mountain which needs considerable respect if you choose to use your feet at any speed or in any weather.
Much credit to the marshalls and organisers and also to the 1000's of people who turned out to cheers everyone on, not just the winners or their friends.
Thursday Jul 24, 2008
I have been working with the MIS people at Edinburgh University and a consultant from Tribal which has been much fun indeed. I learned a number of things and was reminded of a few more my brain have choosen to put in long term storage.
- cron starts non-root processes with a NICE value of 2 hence will have a lower priority than jobs started on the command line or via SMF. The queuedefs man page explains more, but the syntax is arcane!
- Worth snooping traffic to and from the DNS server. Often shows up errors or performance opportunities in nscd.conf and resolve.conf such as having cache-enable set to no for ipnodes.
- If any type of network latency is important such as in ping-pong of packets sitting on 2 clients, map out and understand where your firewall(s) sit(s) and benchmark without the firewall to get the scope of the impact. Firewalls are often an invisible(and hard to observe) component, so are often ignored.
- Turning TCP NAGLE off via ndd is well worthwhile if ping-pong latency is a barrier.
Next race is on Saturday which is the Snowdon International Race.
Friday Jul 11, 2008
Assisted by extra reading time made available by a round trip to south Africa, I have just finished reading Survival of the Fittest by Mike Stroud. This is not a typical physical adventures read, though parts of the book are hard to put down. It a book written by a medical doctor with a serious interest in what extremes the human body can be taken to and quite a lot of practical experience of doing just that.
He uses both of own experienence of doing extreme events such as crossing the Antarctic, running 7 marathons in 7 days and the various Eco-challanges in teams ages from 30 to 74.
For example, he explains how the body controls extremes of heat and why it may malfunction.
In short its a very practical read if you are interested in pushing your own physical limts or just living a better quality of life into your more senior years.
Saturday Jun 28, 2008
BBC Radio 4 have been running an excellent series on the history of astronomyCosmic Quest. During a discussion of what might happen if you fell into a black hole (not an every day even in mid Wales, but we do lag behind), Spaghettification describes the process where the difference in the gravitational force between your head and your feet would stretch you akin to Spaghetti.
Quite a lot of the program last night was devoted to the Anthropic Principles and the program did a very good job of presenting some very heavy material in an accessible manner. Well worth downloading and listening to if you are at all interested in Astronomy.
Monday Jun 23, 2008
Long ago in a far off land (Bracknell) I worked for the Cray bit of SGI. I was the Site Analyst at the UK Met office in the days when 450mhz Dec Alpha were just about the fastest general purpose processors in the wild. The system being commissioned was the T3E which ended up having in excess of 800 liquid cooled processors. It was as close to the real Deep Thought as you could get. The customer just wanted the fastest computer they could get their hands on and if we could have provided 10X the performance they would have eaten it as well. Power consumption was an aside that was just dealt with. Given the 3/4 of the machine was devoted to modelling climate change now looks ironic.
This was the 2nd largest T3E built at the time, the largest(1300 or so processor units) went somewhere in Maryland. I think this machine was Number 4 on the top 500 at the time(12 years ago).
Look how the Top 500 has changed with power consumption now included.
Nice to see the Ranger system in the Number 4 slot.
I learned a lot from the quite short time I worked with that machine, including but not limited to :-
- For some customers, technology won't keep you ahead of their demand for processing power.
- The biggest of a new product line always exposes multiple new problems (We saw the same on every top end product line that Sun has released, so its industry wide) which need to be debugged at the customers pain.
- Don't assume because a script is written in ksh, that it is not critical to the performance of a customers business (Shell builtins are holy)
- Crash dumps are critical to effective diagnosis
Unicos/MK was a well developed OS in terms of debugging technology at the time which given that each processing element (processor) could generate its own crash dump file was just as well.
Sunday Jun 22, 2008
The race this weekend was Red Kite Weekend and came in just under my target time of 1:45 was quite pleasing. I don't think the local knowledge helped much, but it was a fun race.
The race very much lived up to its name. Just after I finished was Red Kite feeding time and to see around 100 of these birds of prey flocking was amazing. For the rights and wrongs of feeding and I guess many of these birds are dependent on being feed, when I was a child 30 years ago in the same part of Wales, it was special to see a Red Kite at all. Now a Red Kite is probably more common than a Crow which is I think a worthwhile tradeoff in the big scheme of things.
I shall have to make it to Nant-y-Arian with the video camera at feeding time, the Red Kite is a spectacular bird.
I am going to have a break from races for a few months, get some more concentrated hill training in the bag over the summar, do family holidays etc and return to a couple of intesting races in late September.
Sunday Jun 15, 2008
On Saturday I joined around 500 runners and 40+ horses to take part in one of the more curious events in the running calendar : The Man vs Horse Marathon. This years event was won by a horse (a relay team did finish 1st, but they are not counted in this respect) 10 seconds ahead of the 1st man which must have been some finish to watch.
Its the 1st time I have run that sort of distance (22 miles, 3000ft ascent), so I was pleased to be just outside my target time at 3:35:13 and managed to finish before quite a few horses and about 1/2 way down the field of runners. I found the last 4 miles really hard.
About 60 relay teams entered, each member doing a leg of around 7 miles. I was past by Andy Croft, an Engagement Architect in Sun UK, at about mile 18 who was running the final leg for his team which was a bit of a surreal moment.
One feature of this race which should be compulsory for any marathon is a river crossing a few 100 meters from the finish to help induce leg cramp.
I did manage a spurt of speed in the last 100m as the thundering hoofs of a horse named Socks approached from behind. Funny how the thought of beating just one more horse can help you find energy you wished you had 3 miles back.
This is a race I will do again! Very well organized and marshaled. Horses did not constitute a significant risk of trampling. The course was a mixture of forestry tracks, open moorland, rivers, mud, foot paths and some tarmac (< 20%).
Think I will give the Bog Snorkeling a miss which also takes place in Llanwrtyd Wells, I don't quite have the features for it.
Next weekend is The Red Kite Challange where the course passes within less than a mile of where I live, so it would be rude not to do it!
Thursday Jun 05, 2008
Thanks to Steve for pointing out a new patch revision. It did cause me to ask for the SRAS rules (an internal proactive analysis tool set) to be updated. There are 2 streams of firmware for the T2000 depending on version and if you want LDOM support.
This link has some useful firmware related matrial.
Thanks to zdz for his insight into why firmware upgrade is seen as a bigger hurdle than software patching. Diagnosis of firmware related issues is typically more difficult and systems like the T2000 which have a thicker firmware layer including the Hypervisor, I would argue that proactive firmware upgrades become more critical. We don't yet have truss, mdb and DTrace for firmware!
Thanks to Robert for articulating is T2000 firmware upgrade exprience. You may have hit CR 6696642 'slow responsive console after firmware upgrade' for which a later firmware rev should soon be available. A workaround is documented here.
Thanks for Peter for his question about bufhwm. We don't actually close comments ourselves, there is a default period(a week from memory, but can be modified) after which ability to add comments is closed automatically. As regards bufhwm tuning, I would always start with what business metric tells you that tuning is required? A good answer can be "just because I am interested", if so, contact me via email and we can explore your query in more detail.
Thanks again to Robert who documents the work around of tuning the machine off for 10 seconds.
Thanks to Steve for the link. Not ideal for remote management I have to agree. We run a remote lab so have the ablity to power on and off remote machines, but that does require some investment in infrastructure which I agree should not be needed for a firmware upgrade. An ideal world we do not live in.
I was thinking of entering the Man vs Hourse in the hope that they have 3 legged cart hourse which has to do the race backwards to pace me.
Wednesday May 28, 2008
Many customers(and engineers!) ignore firmware as part of their patching strategy and this can result in hard to diagnose issues. Over the last couple on weeks I have come across a couple of customer performance issues (some 1st hand and some 2nd hand) on T2000 which were resolved by applying the current firmware patch.
We have very limited observability in th firmware layer, so diagnosis can be a challange to say the least.
So in the spirit of avoiding future problems, have a quick look at the output of prtconf -V and if it does not show 4.28.1 or later, consider applying the patch 136927-01 or later if you are reading this in 6 months time. This is a patch where a long cool read of the README and install instructions is savy.
Same principle applies to T1000 and T5220's, but the patches are different.
Tuesday May 27, 2008
Kernel Crash dumps are a point in time snapshot of the Solaris Kernel state. The aim is to allow
post mortem analysis of the system state at the point the crash dump was taken. For system panic's and hangs, the ability to look at the system state is the primary failure analysis tool and one of the reasons Solaris
is as reliable as it is.
I think of system failures as a 2 dimensional problem. The interaction of data and code at the point in time of the failure can be analyzed with tools such as MDB which are designed for this type of post-mortem analysis.
Performance adds the 3rd dimension of time.
Autopsy is not commonly used as a tool for determining the root cause of individual productivity issues.
In a small subset of cases, poor individual productivity may be the result of a medical condition requiring
a CAT scan (the medical version of a live Kernel Crash Dump). However, these cases are very rare and such techniques would only be used with a significant body of supporting evidence.
Kernel Crash Dumps are useful for a very small subset of performance cases. Specific performance problems rooted in memory shortfall caused by a memory leak would be one example, but these are quite rare in the big scheme of things and would need supporting evidence to use the Kernel Crash Dump approach.
I have come across a number of cases in the last few months where a crash dump has been requested and only
one was possibly valid.
Before collecting the CAT scan equivalent of your system (with the associated cost) in the hope it shows up the cause of a performance problem, check the pulse, breathing and circulation 1st. If you do collect a live crash dump, make sure the supporting evidence and rational are sound.
Monday May 26, 2008
The Cader Idris Fell Race which starts in and returns to Dolgellau. I have walked up Cader many times, including carrying both little people up there in a baby backpack. Weather was on the hot side for Wales, but on the mountain the wind kept the temperature down. Great organisation as usual, thank you Mr. Stringer and friends. I was a bit slower than I hoped, so either I need to train harder or lower my aspirations a little.
Much respect to the walker, unconnected with the race, who in a random act of kindness thrust 2 jelly babies into my hand as I stumbled past.
Bit of a break till my next race which is local(about a mile from where I live), The Red Kite Challange.
I am in Dublin this week running an Sun Global Resolution Troubleshooting course for the Systems Test Group. These are the people who test future Solaris releases, so effective and accurate troubleshooting is essential to work out if a problem lies in a test harness or the product. Passing well defined and described problems statements back to development only where a real bug exists.
The Systems Test and the Patch Test Groups are probably the most advanced users of SGRT outside Services. SGRT really is an integral part of their processes and delivers real gains in productivity
and reduces the number of false positive sent back to Development for diagnosis.
Was the accident a configure script that mistook m...
I was looking for another command in /usr/ccs...
You might find ...