Today on this ol' server
Dynamic Reconfiguration on a v1280
Dynamic reconfiguration on a v1280
A while back I ran across a v1280 that threw some cpu errors, and needed to have it's system board replaced before it caused unplanned maintenance. The server was running Solaris 10. We use dynamic reconfiguration to remove the system board while the server was running to minimize downtime.
Today we'll look at dynamic reconfiguration on a v1280. We'll refer to dynamic reconfiguration as DR from here on. The same process will work for 38/48 and 6800s. The commands are firmware dependent, not OS specific. (Check the admin guide in the firmware docs for more details on DR as well as use the most current version of cfgadm, core kernel, and fault management if applicable.) This works best when you are logged in to the System Controler and have a shell on the system.
I don't have any x900 series to try commands on. The same caveats apply; the commands will work better with the latest software. Also the newer the software, the more automatic and better intergrated the DR features will be with SMF and FM. Also here is a developer blog summary on cfgadm titled "A Little Bit About cfgadm(1M)"
- Identify the system board you need to replace. Send applicible logs (dmesg, /var/adm/messages, showlogs from the sc, fmdump, , fmadm faulty -a) to support. Support will have confirmed that the errors you sent in are valid, and are severe enough to have the board replaced.
- Check if the system board in question is holding the main memory and kernel.
cfgadm -av | grep permanent N0.SB0::memory connected configured ok base address 0x0, 8388608 KBytes total, 586400 KBytes permanent
If it's not, you can DR with no impact to the running OS. If it isn't you have to wait for the firmware to copy the kernel to the other board. - Unconfigure the board so it is no longer useable from the system. (When I did it on this board it took about three minutes for the copy to complete.)
# cfgadm -c unconfigure N0.SB0 System may be temporarily suspended, proceed (yes/no)? yes
- Check that system board SB0 shows up as unconfigured.
# cfgadm -av Ap_Id Receptacle Occupant Condition Information When Type Busy Phys_Id N0.IB6 connected configured ok powered-on, assigned Jun 3 09:30 PCI_I/O_Boa n /devices/ssm@0,0:N0.IB6 N0.IB6::pci0 connected configured ok device /ssm@0,0/pci@19,700000 Jun 3 09:30 io n /devices/ssm@0,0:N0.IB6::pci0 N0.IB6::pci1 connected configured ok device /ssm@0,0/pci@19,600000, referenced Jun 3 09:30 io n /devices/ssm@0,0:N0.IB6::pci1 N0.IB6::pci2 connected configured ok device /ssm@0,0/pci@18,700000, referenced Jun 3 09:30 io n /devices/ssm@0,0:N0.IB6::pci2 N0.IB6::pci3 connected configured ok device /ssm@0,0/pci@18,600000, referenced Jun 3 09:30 io n /devices/ssm@0,0:N0.IB6::pci3 N0.SB0 connected unconfigured ok powered-on, assigned Jun 3 10:16 CPU n /devices/ssm@0,0:N0.SB0 N0.SB0::cpu0 connected unconfigured ok cpuid 0, speed 900 MHz, ecache 8 MBytes Jun 3 10:16 cpu n /devices/ssm@0,0:N0.SB0::cpu0 N0.SB0::cpu1 connected unconfigured ok cpuid 1, speed 900 MHz, ecache 8 MBytes Jun 3 10:16 cpu n /devices/ssm@0,0:N0.SB0::cpu1 N0.SB0::cpu2 connected unconfigured ok cpuid 2, speed 900 MHz, ecache 8 MBytes Jun 3 10:16 cpu n /devices/ssm@0,0:N0.SB0::cpu2 N0.SB0::cpu3 connected unconfigured ok cpuid 3, speed 900 MHz, ecache 8 MBytes Jun 3 10:16 cpu n /devices/ssm@0,0:N0.SB0::cpu3 N0.SB0::memory connected unconfigured ok base address 0x2000000000, 8388608 KBytes total Jun 3 10:16 memory n /devices/ssm@0,0:N0.SB0::memory N0.SB2 connected configured ok powered-on, assigned Jun 3 09:30 CPU n /devices/ssm@0,0:N0.SB2 N0.SB2::cpu0 connected configured ok cpuid 8, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu0 N0.SB2::cpu1 connected configured ok cpuid 9, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu1 N0.SB2::cpu2 connected configured ok cpuid 10, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu2 N0.SB2::cpu3 connected configured ok cpuid 11, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu3 N0.SB2::memory connected configured ok base address 0x0, 8388608 KBytes total, 586400 KBytes permanent Jun 3 10:16 memory n /devices/ssm@0,0:N0.SB2::memory N0.SB4 empty unconfigured unknown assigned Jun 3 09:30 unknown n /devices/ssm@0,0:N0.SB4 c0 connected configured unknown unavailable scsi-bus n /devices/ssm@0,0/pci@18,700000/ide@3:scsi c0::dsk/c0t0d0 connected configured unknown TOSHIBA DVD-ROM SD-C2612 unavailable CD-ROM n /devices/ssm@0,0/pci@18,700000/ide@3:scsi::dsk/c0t0d0 c1 connected configured unknown unavailable scsi-bus n /devices/ssm@0,0/pci@18,600000/scsi@2:scsi c1::dsk/c1t0d0 connected configured unknown FUJITSU MAP3735N SUN72G unavailable disk n /devices/ssm@0,0/pci@18,600000/scsi@2:scsi::dsk/c1t0d0 c1::dsk/c1t1d0 connected configured unknown FUJITSU MAG3182L SUN18G unavailable disk n /devices/ssm@0,0/pci@18,600000/scsi@2:scsi::dsk/c1t1d0 c2 connected unconfigured unknown unavailable scsi-bus n /devices/ssm@0,0/pci@18,600000/scsi@2,1:scsi c3 connected unconfigured unknown unavailable fc n /devices/ssm@0,0/pci@19,700000/SUNW,qlc@2/fp@0,0:fc
- Then run cfgadm with the disconnect option to power off the board. (This takes less than a minute to complete.)
# cfgadm -c disconnect N0.SB0
- Next check the system board's status. This time it says unconnected and unconfigured. Also, from the SC prompt, if you run showboards it will display Off, although I didn't do that step.
# cfgadm -av Ap_Id Receptacle Occupant Condition Information When Type Busy Phys_Id ... N0.SB0 disconnected unconfigured unknown assigned Jun 3 10:19 CPU n /devices/ssm@0,0:N0.SB0 N0.SB2 connected configured ok powered-on, assigned Jun 3 09:30 CPU n /devices/ssm@0,0:N0.SB2 N0.SB2::cpu0 connected configured ok cpuid 8, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu0 N0.SB2::cpu1 connected configured ok cpuid 9, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu1 N0.SB2::cpu2 connected configured ok cpuid 10, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu2 N0.SB2::cpu3 connected configured ok cpuid 11, speed 900 MHz, ecache 8 MBytes Jun 3 09:30 cpu n /devices/ssm@0,0:N0.SB2::cpu3 N0.SB2::memory connected configured ok base address 0x0, 8388608 KBytes total, 586400 KBytes permanent Jun 3 10:16 memory n /devices/ssm@0,0:N0.SB2::memory ...
Now it's ready for replacement. No need to bring the system down. At the tail end of the unconfig command this message appeared on the system controler due to the reconfiguration.lom>{/N0/SB0/P0} test case reset reason = 00000001.0404ff05 {/N0/SB0/P0} test case ecache_size=00000000.00800000, tag_size=00000000.00004000 {/N0/SB0/P0} test case Ecache Mode: 4:4:4 {/N0/SB0/P0} test case E$ control register = 00000000.07a34c00 {/N0/SB0/P0} Controller PCI Config Space Test for aid 0x18 {/N0/SB0/P0} Subtest: IDE Controller Bus Probe for aid 0x18 {/N0/SB0/P0} Removable ATAPI device, TOSHIBA DVD-ROM SD-C2612 {/N0/SB0/P0} Subtest: SCSI Controller PCI Config Space Test for aid 0x18 {/N0/SB0/P0} Subtest: SCSI Controller Register Test for aid 0x18 {/N0/SB0/P0} Subtest: SCSI Controller SCRIPTS RAM Test for aid 0x18 {/N0/SB0/P0} Subtest: SCSI Controller SCSI Timers Test for aid 0x18 {/N0/SB0/P0} Subtest: SCSI Controller DMA Test for aid 0x18 {/N0/SB0/P0} Subtest: PCI IO Controller Register Initialization for aid 0x19 {/N0/SB0/P0} Subtest: PCI IO Controller IOMMU TLB Compare Tests for aid 0x19 {/N0/SB0/P0} Subtest: PCI IO Controller IOMMU TLB Flush Tests for aid 0x19 {/N0/SB0/P0} Subtest: PCI IO Controller DMA loopback Tests for aid 0x19 {/N0/SB0/P0} Subtest: PC test case IoSram Add : 0000041c.00900000 {/N0/SB0/P0} test case Estate = 00000000.0000000b {/N0/SB0/P0} test case Ecache control = 00000000.07a34c00 {/N0/SB0/P0} test case CPU features = 0000224f.004204ff {/N0/SB0/P0} test case After setting CPU features, DCU = 0000ee00.0000000f {/N0/SB0/P0} test case DCR = 00000000.0000103f {/N0/SB0/P1} test case reset reason = 00000000.0404ff05 {/N0/SB0/P1} test case ecache_size=00000000.00800000, tag_size=00000000.00004000 {/N0/SB0/P1} test case Ecache Mode: 4:4:4 {/N0/SB0/P1} test case E$ control register = 00000000.07a34c00 {/N0/SB0/P1} @(#) lpost 5.18.1 test case IoSram Add : 0000041c.00900000 {/N0/SB0/P1} test case Estate = 00000000.0000000b {/N0/SB0/P1} test case Ecache control = 00000000.07a34c00 {/N0/SB0/P1} test case CPU features = 0000224f.004204ff {/N0/SB0/P1} test case After setting CPU features, DCU = 0000ee00.0000000f {/N0/SB0/P1} test case DCR = 00000000.0000103f {/N0/SB0/P2} @(#) lpost 5.18.1 {/N0/SB0/P3} @(#) lpost 5.18.1 - After the board is replaced verify that the new board is running the same firmware. From the SC run
lom>showboards -p prom Component Compatible Version --------- ---------- ------- SSC1 Reference 5.18.1 Build_01 /N0/IB6 Yes 5.18.1 Build_01 /N0/SB0 Yes 5.18.1 Build_01 /N0/SB2 Yes 5.18.1 Build_01
If they aren't all running the same version, you need to upgrade the firmware. (use flashupdate from the LOM prompt, see docs.) -
If they are the same, then from the OS run configure to configure the board for use.
cfgadm -c configure
At the LOM prompt you will see a bunch of messages as the board runs through post, before it's finally brought back on line to be usable for Solaris.lom>{/N0/SB0/P0} Running CPU POR and Set Clocks {/N0/SB0/P2} Running CPU POR and Set Clocks {/N0/SB0/P1} Running CPU POR and Set Clocks {/N0/SB0/P3} Running CPU POR and Set Clocks {/N0/SB0/P0} @(#) lpost 5.18.1 2004/12/09 12:32 {/N0/SB0/P2} @(#) lpost 5.18.1 2004/12/09 12:32 {/N0/SB0/P1} @(#) lpost 5.18.1 2004/12/09 12:32 {/N0/SB0/P3} @(#) lpost 5.18.1 2004/12/09 12:32 {/N0/SB0/P0} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved. {/N0/SB0/P1} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved. {/N0/SB0/P2} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved. {/N0/SB0/P3} Copyright 2001-2004 Sun Microsystems, Inc. All rights reserved. {/N0/SB0/P0} Use is subject to license terms. {/N0/SB0/P2} Use is subject to license terms. {/N0/SB0/P1} Use is subject to license terms. ...Also there will be messages from the console ...Jun 3 10:16:48 j1hol-1280 genunix: [ID 408114 kern.info] /ssm@0,0/memory-controller@0,400000 (mc-us30) offline ... Jun 3 12:25:26 j1hol-1280 lw8: [ID 477720 kern.notice] SB0, hotplug status, SB0, module removed (9,16) Jun 3 12:58:29 j1hol-1280 lw8: [ID 328834 kern.notice] /N0/SB0, hotplug status, SB0, module inserted (9,17) Jun 3 13:00:07 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed out waiting for SC response Jun 3 13:00:08 j1hol-1280 last message repeated 2 times Jun 3 13:00:08 j1hol-1280 picld[107]: [ID 653604 daemon.error] sgfru ioctl 0xf handle 0xe failed: Connection timed out Jun 3 13:00:37 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed out waiting for SC response Jun 3 13:02:03 j1hol-1280 last message repeated 1 time Jun 3 13:04:03 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed out waiting for SC response Jun 3 13:05:18 j1hol-1280 unix: [ID 950921 kern.info] cpu0: UltraSPARC-III+ (portid 0 impl 0x15 ver 0x23 clock 900 MHz) .... Jun 3 13:05:37 j1hol-1280 sbdp: [ID 713682 kern.info] cpu3 initialization complete - restarted Jun 3 13:05:42 j1hol-1280 unix: [ID 700753 kern.info] kphysm_add_memory_dynamic: adding 8388608K at 0x2000000000 Jun 3 13:05:46 j1hol-1280 unix: [ID 323408 kern.info] kphysm_add_memory_dynamic: mem = 16777216K (0x400000000) Jun 3 13:05:46 j1hol-1280 unix: [ID 401001 kern.info] kphysm_add_memory_dynamic: avail mem = 16237010944 Jun 3 13:05:46 j1hol-1280 ssm: [ID 349649 kern.info] memory-controller0 at ssm0: Node 0 Safari id 0 0x400000 ... Jun 3 13:05:46 j1hol-1280 genunix: [ID 936769 kern.info] mc-us30 is /ssm@0,0/memory-controller@0,400000 Jun 3 13:05:46 j1hol-1280 genunix: [ID 408114 kern.info] /ssm@0,0/memory-controller@0,400000 (mc-us30) online
Posted at 09:02AM Jun 14, 2006 by saf in Sun | Comments[0]