A couple of months ago, I got a new workstation. A quad core Xeon box from Dell. It came with preinstalled Ubuntu. I decide to play with that for while. After a while I noticed that the system would randomly reset. Ah the darn unstable linux is what I thought!!
I reinstalled the system with OpenSolaris 2008.11 and it ran for a day or so. The system reset again. Annoyed I started looking at /var/adm/messages and I found that there was a hardware fault detected by FMA and taken appropriate action. Now it was nice to call Dell Support and tell them definatively the cause, diagnosis and that I needed a new motherboard.
The rep asked me what test was running. I said nothing. This functionality is built into OpenSolaris...and it's free!! One cannot get this running linux.
Here's what fmadm faulty printed
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 20 01:29:08 152a7687-c256-40dd-80b1-83c1f4ed74c7 INTEL-8001-43 Critical
Fault class : fault.cpu.intel.nb.ie
FRU : "MB" (hc://:product-id=Precision-WorkStation-T3400:chassis-id=65QLTH1:server-id=opensolaris/motherboard=0)
faulty
Description : Northbridge has detected an internal error Refer to
http://sun.com/msg/INTEL-8001-43 for more information.
Response : System panic or reset by BIOS
Impact : System may be unexpectedly reset
Action : Replace motherboard
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 20 01:29:08 50ba84aa-3f12-c2c5-9c0b-8fdec9454104 INTEL-8000-LE Major
Fault class : fault.cpu.intel.l1dcache
Affects : hc://:product-id=Precision-WorkStation-T3400:chassis-id=65QLTH1:server-id=opensolaris/motherboard=0/chip=0/core=3/strand=0
faulted and taken out of service
FRU : hc://:product-id=Precision-WorkStation-T3400:chassis-id=65QLTH1:server-id=opensolaris/motherboard=0/chip=0
faulty
Description : A level 1 Data Cache on this cpu is faulty. Refer to
http://sun.com/msg/INTEL-8000-LE for more information.
Response : The system will attempt to offline this cpu to remove it from
service.
Impact : Performance of this system may be affected.
Action : Schedule a repair procedure to replace the affected CPU. Use
'fmadm faulty' to identify the module.
