Almost certainly when a customer has a question about one of our x86 machines we need some information, Just as it is in the world of sparc it cannot be avoided. I thought I would post this little lot here as a starting point at least. These questions I ask nearly every time I have to provide the answer to the question "What's wrong with it".

Please be aware that we need a problem statement and we will dig for this, simply telling us that the machine does not work really does not help us to help you.

Most of the information we gather from the sp and for this we use a shell on the sp itself and also an opensource product called IPMITOOL. It's far too useful not to have installed somewhere on the same network as your problem machine (laptop, desktop etc) if it's allowed. 

Windows ipmitool: http://www.sun.com/download/products.xml?id=46f1ff04
Look here for the latest SUN  Supported ipmitool and search for ipmitool:
http://www.sun.com/download/index.jsp?tab=2
Open source (We do not support using this one for data gathering)  ipmitool: http://ipmitool.sourceforge.net/

Always use the latest possible ipmitool but never anything less than 1.8.8.x (always update ipmitool to the very latest that you can manage).

Please bear in mind that you may need to update ipmitool or use one from the tools and drivers cd "As directed" by your support contact, This  will be because the newer the machine the more up to date you need your ipmitool version to be and vice versa. Please make sure you use the correct version, The better our information the better our diagnosis.

BIG NOTE:

You need to check that you have a support contract for Windows or the many Linux variants, It's not a given that you can expect support on these operating systems simply because they reside on Sun  Kit

Run the following against the Ilom:

ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> bmc info
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sel elist
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> fru
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sensor
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem led get all


(on some of the newer machines (intel) you might get "Sun OEM Get LED command failed: Parameter out of range" if that happens use the following command  to get the led status (it just adds sb to led to form "sbled"):
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sunoem sbled get all


ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr elist
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr elist full
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> sdr elist all
ipmitool -I lan -H <ILOM IP address> -U <ILOM Username> chassis status

You might find that, if we are interested in something specific we might ask for more in depth ipmitool command output, please be patient when we do. A common additional switch we might ask for is for you to place a -v or a -vv or a -vvv directly after ipmitool (ipmitool -vvv)

On the Ilom itself
Ssh to the Ilom  as root and run the following:  (do this on the cmm as well if you have a blade chassis (and label it so we know it's the cmm and blade Ilom):

show  / -l  all

It does emit a lot of information, You would be advised to use the Solaris (or the Linux equivalent) "script" tool thus... at the prompt (when I say prompt, this means you will most likely be at another machine/terminal and not on the local problem machine) type : script -a <filename-of-your-choice> when done typing commands and gathering output type: ctrl-d and script will exit leaving you with <filename-of-your-choice> which you can email in.

And on the blade / host (if possible):

General none o/s tools
PcCheck is useful for testing the machine, It comes bundled on the tools and drivers cd which you can download from sun.com/download do remember to download the correct tools and drivers cd for your model type.

Memtest86 is a very good and useful tool if you think you might have memory problems too.
http://www.memtest.org/#downiso

Windows
if the machine is running Windows then the system and event log saved as text files, if it produced a memory dump then that as well. also  http://www.microsoft.com/downloads/details.aspx?familyid=cebf3c7c-7ca5-408f-88b7-f9c79b7306c0&displaylang=en
and use those down loadable tools to send in various reports (you don;t have to install reporting tools for those facilities you don't have).

Redhat
if the machine is running Linux (RedHat) then the output of "sysreport" , did you also know that Redhat can produce crash dumps? if you use this link http://www.redhat.com/support/wpapers/redhat/netdump/  you can also send the dump to a networked machine (not your problem machine) which makes things much easier if we ask to see it. I might add instructions here later on on how to configure it (let me know if that would be useful to you). don't be put off by the documents age, it's still quite valid (make sure it is for your version of redhat obviously).

Suse

For the latest crash dump stuff under opensuse try http://en.opensuse.org/Kdump . If you want or need to use lkcd then look here for more information  http://www.novell.com/coolsolutions/feature/15284.html

Before sending us your suse crash dump for analysis make sure your running a supported version of suse for your machine type, if your not it's not not likely we can analyse the crash since we would need the "exact" same kernel environment you have to do the analysis, if your running some esoteric version of suse it's not likely we'll have it and you'll be out of luck.

install kdump (using yast2 kdump from the command line or install software from menu). kdump can be found on the distribution cd/dvd  enable settings as per your configuration (memory, filters etc) I can't tell you what the best settings are, you need to know that, make sure you don't use too much memory for the crash kernel though (trial and error).

Make sure that the dump target is realistic, i.e: it has enough space to consume the dump and can be mounted later in a emergency., you can set the target to be a filesystem ftp, nfs etc

Gather a supportconfig. if you can't find that then try http://en.opensuse.org/Supportutils  let me know if the link is either broken or useless for you. (opensuse you can install using the following command line: sudo zypper install supportconfig) so long as you have a yum server or internet connection it'll install for you).

Generic Linux
if the machine is running Linux then check for the presence of /var/log/mce and send those in. If it does not exist then please run the. command /usr/sbin/mcelog >> /var/log/mcelog and send in the resultant log file. if you do not have mcelog installed then you can get it here...

also collect the following... lspci -xxxx (gives us a pci dump we can interrogate using the -F (file) option)

http://rpmfind.net/linux/rpm2html/search.php?query=mcelog

If you using this on a amd opteron/athlon then the best command line is: mcelog --k8 --dmi --ascii >> /var/log/mcelog

Vmware
collect a vm-support script output (vm-support is installed as part of the default install) from the console of the server
use vmkdump -l <core-file-name> to extract the kernel log, also make sure you save the dump itself. either we or vmware can look through this for you.

gather the psod (use print screen, screen grab, digital camera if need be)
contact vmware themselves and request details on how to use the serial logging feature, basic instructions are update SerialPort to equal whichever port you have connected your laptop/other machine to (it's in advanced setting(s)), startup your favourite terminal software and start logging to / capturing the text to disk. when this is done or you have enough send them in.
once all this is gathered run the diagnostics which you can download for your specific machine/model from sun.com/download

Solaris
if the machine is running Solaris, Then  a explorer [and crash dump if one is produced].

Some questions to ask yourself first:
Sometimes there is no hardware problem, Sometimes a configuration change or patch or something else might be the real problem. It might pay to ask yourself the following questions before you log the call. I have a small list of questions I ask which, if nothing else, clarifies things a little. It also helps to think about priorities, You might have 2, 3 or more questions you want answers to; it pays then to prepare yourself by putting these questions in order of importance. Make the priority known to the person who takes your case on. Make sure that you have a clear "what is the problem with what" statement. Make sure that you clearly state your timescales and be clear about how this is affecting your project/business unit or whole company. Very often when logging a call it's possible to get caught up in a lot of other things, stay focussed and be precise, you would be surprised how much that helps.

1). did this ever work and if it did work, when was the last time it worked
2). do you have anything else the same as this that does work
3). what drew your attention to the problem (logging, monitoring). provide this information
4). is there more than one of the same thing that's broken
5). have you changed anything in the last four weeks (or more if the problem has been around longer than that and you have just decided to log the call) patching, configuration changes etc, be prepared to have your change logs handy 

Sometimes, after a component is replaced (and there are a lot of customer replaceable parts in the x86 kit) you will probably need to tell the operating system you have repaired and or replaced something (not all, but some (like Solaris) need you to). In Solaris you do this using fmadm faulty to find the problem component and fmadm repair to notify Solaris it need no longer concern itself over this item any more (use the man pages to find the full syntax for these commands). it's worth doing this since you may find the message logs or other notification paths telling you about faulty items again (the same one?) when it's not faulty.
---o---

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by Paul Scott