Tuesday Mar 31, 2009

What if you have what appears to be a hang (Solaris only at this time)?

I've broken this out of the data collection section since it was beginning to over crowd that it. If you spot anything wrong or glaring omissions, as ever, I would be grateful for your contribution.

What follows is for the next time scenario, if your machine is hung right now you more than likely can't do a lot about it.  It's best to start from the ilom for this and so for the next line I'll assume you have...

When you experience a hang you will be asked for information, using the script utility makes command output collection a little easier for you.From a terminal window type script -a <filename>and then ssh to the ilom, execute start /SP/console. This gives you a console session on the host. Reboot the machine (using the ilom commands reset /SP or reset /SYS or however you want to/can) and modify the grub boot option that you normally use. Put the machine into kernel debugger mode as follows...

kernel /boot/multiboot kernel/unix -k , you might want to boot the machine single user too and that just needs you to add a -s to that command line also.

Once the machine appears to hang send a break via the Ilom console (remember you should have come from the ilom at this point) as follows,  'Esc' followed by shift+b. You should now drop to the kmdb prompt. If you happen to be on the host itself then so long as you have a functioning keyboard you can drop into kmdb by using the key sequence [shift]+[pause] and possibly F1+a though be prepared that it might not work, also, if you have a usb kvm it would be as well not to start switching keyboards/screen as it appears to lock up either the keyboard or kmdb (I'm not sure which one). make sure that you do not run the gui on the console as this tends to hide the console output.

Once at the kmdb prompt please run and capture the output from the commands below:  Don't forget to type the "::" otherwise you'll either get rubbish or nothing.

::cpuinfo -v
::ps
::ptree
::kmastat
::cpustack -c 0
::cpustack -c 1
::interrupts
::msgbuf -v
::fsinfo
::cpuregs
::swapinfo
::sysevent
$<threadlist

If you are able to collect a full crash dump then you can also type $<systemdump whilst in the debugger.

If the kernel isn't responding it might be worth enabling deadman kernel code to help you out , this is done in /etc/system by using the line.
set snooping=1 (there are other arguments, but sunsolve can help you out with that).

For particularly difficult hard hangs the following might help (in conjunction with set snooping), modify /etc/system to have the followings lines...

set pcplusmp:apic_panic_on_nmi=1
set pcplusmp:apic_kmdb_on_nmi=1

The lines above are intent on creating  a crash dump on nmi and dropping to the kernel debugger. 

---o---
What's an NMI?
None Maskable Interrupt, an interrupt that in most cases can't be stopped, since the nmi is set at int 2 (vector 2 of the idt table) this is fairly close to invincible. An nmi is called either by external hardware or by either the system bus or an apic. Only one nmi at a time is permitted. The nmi reset button (if your machine has one) is tied to the INTR pin of the cpu, this is classed as a maskable interrupt though so you might not get the same result as a true nmi event. An NMI is a level 5 priority (as opposed to a level 2 interrupt, different things you see) and comes after the following,

hardware reset and / or machine check (mce)
task switch trap
external hardware int
breakpoints and debug traps
the our NMI

There are another 5 after this (1-10) but we don't need to care about them really. All of this means that we have a way to cause a machine to inteerupt anything it's doing and panic on certain events (hence the /etc/system lines above).

 Notes:
idt = an index to the place where the inteerupt service routine is held
interrupt = a notice for the processor to stop what it's doing and deal with this request (based on priority of course).
apic = advanced programmable interrupt controller (cpu pin based interrupts or cpu to cpu interrupts in multi cpu units).
---o--- 

It's also possible that you will need to boot 32bit mode, this is how you do that. Edit the grub entry that reads "kernel /platform/i86pc/multiboot " and change it to read "kernel /platform/i86pc/multiboot kernel/unix" if you want to boot 32bit and single user then add a "-s" to the end of the boot line.

Monday Oct 20, 2008

Svm is a stable and solid product borne from years of development work yet there seems to be a great many calls raised when trying to patch a machine when it has Solaris X86 installed.along with svm. For the most part this is trivial and almost always successful in the Sparc world yet it seems to give no end of trouble in x86 land.

So here's a list of things that seem to work (for me at least)
Here's a brief resume of the machine:

Solaris 10 05/08 127128-11
grub 0.95 (standard for Sun)
patch 137112-08

1).check you can boot from the mirror (important)
2).go back to underlying devices
make a backup of these files!
strip out /etc/systems md portions
modify mount points for /etc/vfstab   
update the boot archive
3).boot single user (from grub "multiboot -s")
4).clear out metadevices, do a quick metastat just to make sure no metadevices are seen
5).start the patching process
6).some patches will leave /reconfigure created, be aware of that in case you have disks pulled.
7).update the bootarchive, though reboot is almost certain to do this anyway
8).reboot multiuser
make sure before re-encapsulating the disks (with svm) that all appears well. Do all the applications appear to be running? can this machine run as it used to?
9).encapsulate disks using svm
make sure you run metaroot and modify /etc/vfstab, ensure vfstab is mirroring both the root disk and swap
10).reboot multiuser.

if any fsck's are required during any of this, make sure you do not do that on underlying devices when svm is running, always remember where you are and what state your system is in.

11).make sure you check dumpadm, you might need it at some point.
12).notes:
patching (at least 137112-08) now seems to require a password, this password is found in the cluster readme file. it seems that this may be the way forward, so now you really do need to read the README.

Getting svm out of the way is important, at least if only to break the mirrors and pull the mirror from the machine.
Depending on what you pull you may need to modify bootenv.rc to cope with a new hardware path, I don't think you should but you should know it's a possibility.

---o---

Patch 137138-09 has a few problems, Check these sun alerts out before you start patching. One of them you need to have a support contract for but the rest are Public.

246206    Public
246207    Public
245626    Public
248126    Support contract required
250426    Public

Enter these search terms in sunsolve for information on patches you should have installed already.

"Install and Patch Utilities Patch"
"patch behavior patch"
"umountall patch"

Your version of Solaris may not need them, But check anyway. And the one caveat that I assume you always know to do is; "READ THE README".

BIG NOTE

A Sun Alert 257908 has been released today(05/05/2009) which covers the issues encountered when adding patch 137138-09; and the hanging problems. The essence of it is as follows

Workaround
svcadm enable -rst network/rpc/meta:default before adding the patch.

OR patches to apply before this patch
119255-58 or later and 125556-03 or later

If you want to know more then look the Sun Alert up on Sunsolve.
---o---

This blog copyright 2009 by Paul Scott