Interconnectedness of all things

A few days in Edinburgh

Thursday Jul 24, 2008

I have been working with the MIS people at Edinburgh University and a consultant from Tribal which has been much fun indeed. I learned a number of things and was reminded of a few more my brain have choosen to put in long term storage.

  • cron starts non-root processes with a NICE value of 2 hence will have a lower priority than jobs started on the command line or via SMF. The queuedefs man page explains more, but the syntax is arcane!
  • Worth snooping traffic to and from the DNS server. Often shows up errors or performance opportunities in nscd.conf and resolve.conf such as having cache-enable set to no for ipnodes.
  • If any type of network latency is important such as in ping-pong of packets sitting on 2 clients, map out and understand where your firewall(s) sit(s) and benchmark without the firewall to get the scope of the impact. Firewalls are often an invisible(and hard to observe) component, so are often ignored.
  • Turning TCP NAGLE off via ndd is well worthwhile if ping-pong latency is a barrier.

Next race is on Saturday which is the Snowdon International Race.

[0] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Survival of the Fittest

Friday Jul 11, 2008

Assisted by extra reading time made available by a round trip to south Africa, I have just finished reading Survival of the Fittest by Mike Stroud. This is not a typical physical adventures read, though parts of the book are hard to put down. It a book written by a medical doctor with a serious interest in what extremes the human body can be taken to and quite a lot of practical experience of doing just that.

He uses both of own experienence of doing extreme events such as crossing the Antarctic, running 7 marathons in 7 days and the various Eco-challanges in teams ages from 30 to 74. For example, he explains how the body controls extremes of heat and why it may malfunction.

In short its a very practical read if you are interested in pushing your own physical limts or just living a better quality of life into your more senior years.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Spaghettification

Saturday Jun 28, 2008

BBC Radio 4 have been running an excellent series on the history of astronomyCosmic Quest. During a discussion of what might happen if you fell into a black hole (not an every day even in mid Wales, but we do lag behind), Spaghettification describes the process where the difference in the gravitational force between your head and your feet would stretch you akin to Spaghetti.

Quite a lot of the program last night was devoted to the Anthropic Principles and the program did a very good job of presenting some very heavy material in an accessible manner. Well worth downloading and listening to if you are at all interested in Astronomy.

Like this post? del.icio.us | furl | slashdot | technorati | digg

How the top 500 changes

Monday Jun 23, 2008

Long ago in a far off land (Bracknell) I worked for the Cray bit of SGI. I was the Site Analyst at the UK Met office in the days when 450mhz Dec Alpha were just about the fastest general purpose processors in the wild. The system being commissioned was the T3E which ended up having in excess of 800 liquid cooled processors. It was as close to the real Deep Thought as you could get. The customer just wanted the fastest computer they could get their hands on and if we could have provided 10X the performance they would have eaten it as well. Power consumption was an aside that was just dealt with. Given the 3/4 of the machine was devoted to modelling climate change now looks ironic.

This was the 2nd largest T3E built at the time, the largest(1300 or so processor units) went somewhere in Maryland. I think this machine was Number 4 on the top 500 at the time(12 years ago).

Look how the Top 500 has changed with power consumption now included.

Nice to see the Ranger system in the Number 4 slot.

I learned a lot from the quite short time I worked with that machine, including but not limited to :-

  • For some customers, technology won't keep you ahead of their demand for processing power.
  • The biggest of a new product line always exposes multiple new problems (We saw the same on every top end product line that Sun has released, so its industry wide) which need to be debugged at the customers pain.
  • Don't assume because a script is written in ksh, that it is not critical to the performance of a customers business (Shell builtins are holy)
  • Crash dumps are critical to effective diagnosis

Unicos/MK was a well developed OS in terms of debugging technology at the time which given that each processing element (processor) could generate its own crash dump file was just as well.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Red Kite

Sunday Jun 22, 2008

The race this weekend was Red Kite Weekend and came in just under my target time of 1:45 was quite pleasing. I don't think the local knowledge helped much, but it was a fun race.

The race very much lived up to its name. Just after I finished was Red Kite feeding time and to see around 100 of these birds of prey flocking was amazing. For the rights and wrongs of feeding and I guess many of these birds are dependent on being feed, when I was a child 30 years ago in the same part of Wales, it was special to see a Red Kite at all. Now a Red Kite is probably more common than a Crow which is I think a worthwhile tradeoff in the big scheme of things. I shall have to make it to Nant-y-Arian with the video camera at feeding time, the Red Kite is a spectacular bird.

I am going to have a break from races for a few months, get some more concentrated hill training in the bag over the summar, do family holidays etc and return to a couple of intesting races in late September.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Clive v horse!

Sunday Jun 15, 2008

On Saturday I joined around 500 runners and 40+ horses to take part in one of the more curious events in the running calendar : The Man vs Horse Marathon. This years event was won by a horse (a relay team did finish 1st, but they are not counted in this respect) 10 seconds ahead of the 1st man which must have been some finish to watch.

Its the 1st time I have run that sort of distance (22 miles, 3000ft ascent), so I was pleased to be just outside my target time at 3:35:13 and managed to finish before quite a few horses and about 1/2 way down the field of runners. I found the last 4 miles really hard.

About 60 relay teams entered, each member doing a leg of around 7 miles. I was past by Andy Croft, an Engagement Architect in Sun UK, at about mile 18 who was running the final leg for his team which was a bit of a surreal moment.

One feature of this race which should be compulsory for any marathon is a river crossing a few 100 meters from the finish to help induce leg cramp.

I did manage a spurt of speed in the last 100m as the thundering hoofs of a horse named Socks approached from behind. Funny how the thought of beating just one more horse can help you find energy you wished you had 3 miles back.

This is a race I will do again! Very well organized and marshaled. Horses did not constitute a significant risk of trampling. The course was a mixture of forestry tracks, open moorland, rivers, mud, foot paths and some tarmac (< 20%).

Think I will give the Bog Snorkeling a miss which also takes place in Llanwrtyd Wells, I don't quite have the features for it.

Next weekend is The Red Kite Challange where the course passes within less than a mile of where I live, so it would be rude not to do it!

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

T2000 firmware : Thanks for the comments

Thursday Jun 05, 2008

Thanks to Steve for pointing out a new patch revision. It did cause me to ask for the SRAS rules (an internal proactive analysis tool set) to be updated. There are 2 streams of firmware for the T2000 depending on version and if you want LDOM support.

This link has some useful firmware related matrial.

Thanks to zdz for his insight into why firmware upgrade is seen as a bigger hurdle than software patching. Diagnosis of firmware related issues is typically more difficult and systems like the T2000 which have a thicker firmware layer including the Hypervisor, I would argue that proactive firmware upgrades become more critical. We don't yet have truss, mdb and DTrace for firmware!

Thanks to Robert for articulating is T2000 firmware upgrade exprience. You may have hit CR 6696642 'slow responsive console after firmware upgrade' for which a later firmware rev should soon be available. A workaround is documented here.

Thanks for Peter for his question about bufhwm. We don't actually close comments ourselves, there is a default period(a week from memory, but can be modified) after which ability to add comments is closed automatically. As regards bufhwm tuning, I would always start with what business metric tells you that tuning is required? A good answer can be "just because I am interested", if so, contact me via email and we can explore your query in more detail.

Thanks again to Robert who documents the work around of tuning the machine off for 10 seconds.

Thanks to Steve for the link. Not ideal for remote management I have to agree. We run a remote lab so have the ablity to power on and off remote machines, but that does require some investment in infrastructure which I agree should not be needed for a firmware upgrade. An ideal world we do not live in.

I was thinking of entering the Man vs Hourse in the hope that they have 3 legged cart hourse which has to do the race backwards to pace me.

Like this post? del.icio.us | furl | slashdot | technorati | digg

T2000 firmware often ignored

Wednesday May 28, 2008

Many customers(and engineers!) ignore firmware as part of their patching strategy and this can result in hard to diagnose issues. Over the last couple on weeks I have come across a couple of customer performance issues (some 1st hand and some 2nd hand) on T2000 which were resolved by applying the current firmware patch.

We have very limited observability in th firmware layer, so diagnosis can be a challange to say the least.

So in the spirit of avoiding future problems, have a quick look at the output of prtconf -V and if it does not show 4.28.1 or later, consider applying the patch 136927-01 or later if you are reading this in 6 months time. This is a patch where a long cool read of the README and install instructions is savy.

Same principle applies to T1000 and T5220's, but the patches are different.

[8] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Using Kernel Crash dumps for Performance Analysis

Tuesday May 27, 2008

Kernel Crash dumps are a point in time snapshot of the Solaris Kernel state. The aim is to allow post mortem analysis of the system state at the point the crash dump was taken. For system panic's and hangs, the ability to look at the system state is the primary failure analysis tool and one of the reasons Solaris is as reliable as it is.

I think of system failures as a 2 dimensional problem. The interaction of data and code at the point in time of the failure can be analyzed with tools such as MDB which are designed for this type of post-mortem analysis.

Performance adds the 3rd dimension of time.

Autopsy is not commonly used as a tool for determining the root cause of individual productivity issues. In a small subset of cases, poor individual productivity may be the result of a medical condition requiring a CAT scan (the medical version of a live Kernel Crash Dump). However, these cases are very rare and such techniques would only be used with a significant body of supporting evidence.

Kernel Crash Dumps are useful for a very small subset of performance cases. Specific performance problems rooted in memory shortfall caused by a memory leak would be one example, but these are quite rare in the big scheme of things and would need supporting evidence to use the Kernel Crash Dump approach.

I have come across a number of cases in the last few months where a crash dump has been requested and only one was possibly valid.

Before collecting the CAT scan equivalent of your system (with the associated cost) in the hope it shows up the cause of a performance problem, check the pulse, breathing and circulation 1st. If you do collect a live crash dump, make sure the supporting evidence and rational are sound.

[1] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

This weekends mountain based lunacy was ....

Monday May 26, 2008

The Cader Idris Fell Race which starts in and returns to Dolgellau. I have walked up Cader many times, including carrying both little people up there in a baby backpack. Weather was on the hot side for Wales, but on the mountain the wind kept the temperature down. Great organisation as usual, thank you Mr. Stringer and friends. I was a bit slower than I hoped, so either I need to train harder or lower my aspirations a little.

Much respect to the walker, unconnected with the race, who in a random act of kindness thrust 2 jelly babies into my hand as I stumbled past.

Bit of a break till my next race which is local(about a mile from where I live), The Red Kite Challange.

I am in Dublin this week running an Sun Global Resolution Troubleshooting course for the Systems Test Group. These are the people who test future Solaris releases, so effective and accurate troubleshooting is essential to work out if a problem lies in a test harness or the product. Passing well defined and described problems statements back to development only where a real bug exists.

The Systems Test and the Patch Test Groups are probably the most advanced users of SGRT outside Services. SGRT really is an integral part of their processes and delivers real gains in productivity and reduces the number of false positive sent back to Development for diagnosis.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Sarn Helen

Monday May 19, 2008

This weekends race was in Lampeter put on by the Sarn Helen club. The route was just over 16 miles and 3000ft and was about 1/3 fields and tracks and 2/3 road. I am not a big fan of running on roads. Part of the route followed the old Sarn Helen Roman road.

1st time I have run this type of distance, so after about 13 miles my legs were not happy. However, the views were quite stunning both looking north and south from the higher elevations of the race and the rest was on very minor and quiet roads.

Next race is Ras Cader Idris on Saturaday the 24th.

Like this post? del.icio.us | furl | slashdot | technorati | digg

An excuse to visit Church Stretton

Thursday May 15, 2008

and what a nice place Church Stretton is. Set in the Shropshire countryside with moorland on either side. Well worth spending a day walking around and lots of potential for pleasent walks.

The hills east and west of Church Stretton are not large by the standards of north Wales or the Lake District, but they do present a number of steep hills which make for some cracking fell races.

Just over 120 souls set out last night on the Caradoc Classic race, 880ft and 3 miles. Bit less mud than I am used to on the races in north Wales where waste deep bog trotting is not considered unusual. However, the lack of mud did make the course quite fast. A perfect early summer evening and a stunning view(sorry, left my camera behind) from the top of Caer Caradoc which few of the runners took time to soak up.

Pleased with my placing, but the race times have not yet been posted.

Sarn Helen race near Lampeter on Sunday which is a much longer and harder proposition.

Like this post? del.icio.us | furl | slashdot | technorati | digg

Remote replication using ZFS

Tuesday May 13, 2008

So the question I was asked by one of our UK Academic Sales Account Manager was "Can you use ZFS for replication between remote sites"?

The answer is depends

It depends on

  • How big the window of data you can afford to loose is?
  • How much data get written to the filesystem?
  • How much data you can send over ssh?

So, if you can not afford to loose a single transaction in the event of needing to fail over, then ZFS replication is not for you. Look at the vast range of SNDR type products which do synchronous data replication across remote sites.

If the number of transactions you can afford to loose is non zero, then ZFS may open up an exciting world at no extra cost. Lets start by finding a few figures

  • What is the peak change rate on your filesystem (now and projected)?
  • What transaction loss window can be tolerated?
  • How may GB/s can you send over ssh between your 2 candidate machines?

I have been working with Geoff Bell at the University of Bradford who manages their mail service. The rate of change of the mail servers filestore has been observed at 20GB of change over a 6 day period. This is in the region of 135MB an hour or close to 2 MB a minute average change.

The mail servers that Geoff manages get backed up every night. So the current transaction loss window is up to 24 hours meaning that if an email comes in during the day and an improbable event such as the disk array going on fire occurs, then all messages sent in that day may be lost.

The command

ptime dd if=/dev/zero bs=16k count=10000 | ssh >hostname< dd of=/dev/null
Shows that we can get just over 2GB a minute between the two X4500's using ssh. This improves by about 30% if we add -c blowfish to ssh.

So we have headroom for error/growth in the region of around 1000 times.

I put togther this script to manage a loop of zfs snapshot and zfs send/recv. The experimental results show that it was good up to 2GB of filesystem change per minute.

The script is simple. It looks for a snapshot on the failover system. If it is not there, then does a full snapshot. If there is a existing snapshot it takes the scripts argument and works from it.

It then works in a loop taking a snapshot and doing incremental send/recv until the end of time.

The biggest downside is that with 1.4TB of existing mail, the 1st send/recv will take in the region of 8 hours! Still, should only have to do it once.

I have left Geoff the open problem of working out which snapshots to delete, but pointed him at Chris Gerhard's blog which gives a solution to this very problem.

Failover would of course be manual, but on the standby machine would only require the most current complete snapshot to be promoted and renamed and the service restarted on the standb node.

Each site will have different needs in terms of filesystem layout, interval, etc. I can only really provide a template that worked in one place. The script does not need an argument, but if you want to restart again from the last snapshot transfered, then just give that as an argument to the script. Any changes/improvements very welcome.

ZFS snapshots and the send/recv mechnaism opens some novel options for very little extra cost to provide improved currency of the data in case of fail over

[2] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg

Bringing some science into running

Tuesday May 06, 2008

This weeks race was Y Garn in north Wales and my 1st race as a member of Eryri Harriers. This was a shorter race, 3 miles and 1500ft where you run/walk up a mountain to the first check point and then struggle to follow the markers in the mist down a different route to a second checkpoint and then to the finish.

As noted last week, I got myself a heart rate monitor and this is the 1st race to put it and the learning from Jon the PhD student at Aberystwyth University Sports Sciences Department into practice. Earlier in the week I spent 45 minutes running on a tread mill where Jon took a blood sample every 5 minutes then increased the tread mill speed. Combining a record of my heart rate with the blood glucose levels, etc he suggested a number of heart rate thresholds not to go above. So for longer distances taking much longer than an hour stay below 156 beat per minute and with 171 as my max heart rate, for shorter events such as the Y Garn race stay around 165.

For the race on saturday I set an alert if my heart rate went above 166 on the way up and tried to keep above 160 which worked really well, I kept close to my sustainable limit uphill and had reserves for the downhill. I just ignored the monitor alert on the way down.

Next planned race is in 2 weeks time, Sarn Helen near Lampeter which is a little longer at 16 miles and 3000ft.

Like this post? del.icio.us | furl | slashdot | technorati | digg

70x performance improvement in 5 minutes

Friday May 02, 2008

A good friend of mine who is a Systems Engineer/Engagement Architect in the UK sent me a copy of a benchmark which his customer was using to assess the performance of various types of Sparc machines. While the benchmark is simplistic, the customer had a concern over its performance on a T5220, so any concern is valid. So here is the customer benchmark

The spirit of the customer benchmark was

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    ./run2_slow &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_slow

which calls

#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
       bc < /dev/null
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100
E
       loop=`expr ${loop} + 1`
done 

which executes in around 70 seconds.

Clive's version which required 5 minutes of very simple coding change

#!/bin/ksh

i=0
while [ $i -lt 63 ]
do
    time ./run2_fast &

    echo Starting $i

    i=`expr $i + 1`
done

time ./run2_fast
calls
#!/bin/ksh 

loop=0

while [ $loop -lt 1000 ]
do
n=0
((n=n+100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100* \
               100/100*100/100
))
	((loop=loop+1))
done
run the same number of iterations on the same machine in 1.01 seconds.

For the slower version dtrace -s /usr/demo/dtrace/whoexec.d shows that there are huge number of sequences of fork/exec where the script forks bc which forks dc and also the counter using expr requires calls to fork/exec. Less than 1% of the time spent in this script was actually calculation.

An interesting system level bottleneck did drop out where the text segment of libc was being faulted in as the process is being created as a result of a call to memcntl something like this

enoexec(5.11)$ truss -t memcntl /bin/true
memcntl(0xC4080000, 227576, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
Where you have many concurrent processes calling fork, you do end up with some lock contention in ufs_getpage like this
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
111484  18%  18% 0.00  1747392 0x30001293610          ufs_lockfs_begin_getpage+0xc8

      nsec ------ Time Distribution ------ count     Stack                   
       512 |                               5         ufs_getpage+0x7c        
      1024 |                               26        fop_getpage+0x90        
      2048 |                               43        segvn_faulta+0x114      
      4096 |                               61        as_faulta+0x138         
      8192 |                               131       memcntl+0x8d0           
     16384 |                               711       syscall_trap32+0xcc     
     32768 |@@@                            14042     
     65536 |@@@@                           17764     
    131072 |@@@@                           17988     
    262144 |@@                             9247      
    524288 |                               2818      
   1048576 |                               3215      
   2097152 |@                              5994      
   4194304 |@@@@@@                         24221     
   8388608 |@@                             10884     
  16777216 |                               3414      
  33554432 |                               310       
  67108864 |                               7         

it would be interesting to try this on a T5220 with ZFS as a root filesystem, but where is the bottleneck? I would argue that there may be a little room for improvement in UFS, but that the benchmark is pathological. My experience of dealing with performance issues in the field over the last 6 years is that large 15K/25K and then T2000 and now T5220 are very good indeed at exposing applications which don't scale. This is an example of where a simplistic benchmark can lead to incorrect or at least very incomplete conclusions about the underlying platform. The benchmark as the customer used just measured fork/exec performance, you would not implement a business solution like that, or would you?

Getting a 70x improvement for changes to Solaris or the underlying hardware is going to be a significant challenge. A 70x speedup from application changes in this case was viable and a little consulting help might not go a miss.

[3] Comments
Like this post? del.icio.us | furl | slashdot | technorati | digg