Laptop and retired I/O devices
FMA and my laptop ....
Last Thursday was not much fun. The “reduction in force”, more appropriately called lay-off, was a challenge. Good people lost their jobs and those of use remaining are trying to figure out how to still get our jobs done with out some key support folks.
Well coincidently with the RIF my laptop started mis-behaving. Upon reboot Friday (the day after the rif) I was greeted with this error message:
NOTICE: One or more of your I/O devices have been retired
Great, now did my laptop get RIF'd?
Actually not. Thankfully!
What happen is that FMA detected too many error with my on-board ethernet driver and it disabled the faulty component. How cool is that?!? The operating system on my laptop detected a faulty component and disabled so the processor didn't have to keep dealing with the interrupts.
This is way cool, but it was a bit difficult to figure out what was happening.
To save some of you the pain of debugging this condition, here is a quick post on how to determine what's happening and how to fix the condition.
First, let's check to see where the error message was coming from. A quick search at src.opensolaris.org resulted in a link to retire_store.c:241. Here quick look around the source let me know this was from the fault management portion of OpenSolaris.
To see the fault, use fmadm with the faulty sub-command:
$ pfexec fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 11 06:30:19 6dc01480-53e3-6046-92c5-b88ea74e17af PCIEX-8000-0A Critical
Fault class : fault.io.pciex.device-interr
Affects : dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0
faulted and taken out of service
FRU : "MB" (hc://:product-id=TECRA-M5:chassis-id=Y6071991H:server-id=yoyo/motherboard=0)
faulty
Description : A problem was detected for a PCIEX device.
Refer to http://sun.com/msg/PCIEX-8000-0A for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : Schedule a repair procedure to replace the affected device. Use
fmdump -v -u to identify the device or contact Sun for
support.
- With the device path I checked to see if this path corresponded to any known devices:
Check that out ... the e1000g0 is the driver for my ethernet!
$ ls -l /dev/* | grep "pci1179,1@0' lrwxrwxrwx 1 root root 54 2008-04-24 16:32 /dev/e1000g0 -> ../devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0
but that device is not available:
ls -l /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0 /usr/gnu/bin/ls: cannot access /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0: No such device or address
Knowing that this was the device that was not working I had to correct the fault. Here use the fmadm command again with the repair sub-command and the path the to device.
$ pfexec fmadm repair dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0 fmadm: recorded repair to dev:////pci@0,0/pci8086,27d0@1c/pci1179,1@0 $ ls -l /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0 crw-rw-rw- 1 root root 225, 1 2008-07-14 13:40 /devices/pci@0,0/pci8086,27d0@1c/pci1179,1@0:e1000g0
- Now to reenable the device and replum my network
$ pfexec ifconfig -a lo0: flags=2001000849
mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 wpi0: flags=201004843 mtu 1500 index 2 inet 10.0.231.211 netmask ffff0000 broadcast 10.0.255.255 ether 0:18:de:6a:9e:5a ip.tun0: flags=10008d1 mtu 1402 index 7 inet tunnel src 10.0.231.211 tunnel dst 192.9.5.100 tunnel security settings --> use 'ipsecconf -ln -i ip.tun0' tunnel hop limit 60 inet 10.7.251.235 --> 129.146.17.123 netmask ffffffff lo0: flags=2002000849 mtu 8252 index 1 inet6 ::1/128 ip.tun0: flags=2204851 mtu 1480 index 7 inet tunnel src 10.0.231.211 tunnel dst 192.9.5.100 tunnel security settings --> use 'ipsecconf -ln -i ip.tun0' tunnel hop limit 60 inet6 fe80::a07:fbeb/128 --> fe80::1 ip.tun0:1: flags=2200851 mtu 1480 index 7 inet6 2002:8192:117b:1::a07:fbeb/128 --> 2002:8192:117b:1::1 $ pfexec ifconfig e1000g0 plumb $ svcadm restart nwam Everything is up and running again. It's pretty cool to have enterprise features like FMA on my laptop, but being a simple guy it's good to know how to just get up and running again.
Now I'll keep monitoring my ethernet driver and see if it keeps acting up ...
Once again, the power of an enterprise class operating system on my laptop. This is so cool. Thanks to the FMA team for your great work!
Posted by Jeff Cheeney [Solaris] ( July 14, 2008 03:15 PM ) Permalink


