Today on this ol' server

Wednesday Jun 14, 2006

Dynamic Reconfiguration on a v1280

Dynamic reconfiguration on a v1280

A while back I ran across a v1280 that threw some cpu errors, and needed to have it's system board replaced before it caused unplanned maintenance. The server was running Solaris 10. We use dynamic reconfiguration to remove the system board while the server was running to minimize downtime.

Today we'll look at dynamic reconfiguration on a v1280. We'll refer to dynamic reconfiguration as DR from here on. The same process will work for 38/48 and 6800s. The commands are firmware dependent, not OS specific. (Check the admin guide in the firmware docs for more details on DR as well as use the most current version of cfgadm, core kernel, and fault management if applicable.) This works best when you are logged in to the System Controler and have a shell on the system.
I don't have any x900 series to try commands on. The same caveats apply; the commands will work better with the latest software. Also the newer the software, the more automatic and better intergrated the DR features will be with SMF and FM. Also here is a developer blog summary on cfgadm titled "A Little Bit About cfgadm(1M)"

  1. Identify the system board you need to replace. Send applicible logs (dmesg, /var/adm/messages, showlogs from the sc, fmdump, , fmadm faulty -a) to support. Support will have confirmed that the errors you sent in are valid, and are severe enough to have the board replaced.
  2. Check if the system board in question is holding the main memory and kernel.
    cfgadm -av | grep permanent
    N0.SB0::memory                 connected    configured   ok         base
    address 0x0, 8388608 KBytes total, 586400 KBytes permanent
    
    If it's not, you can DR with no impact to the running OS. If it isn't you have to wait for the firmware to copy the kernel to the other board.
  3. Unconfigure the board so it is no longer useable from the system. (When I did it on this board it took about three minutes for the copy to complete.)
    # cfgadm -c unconfigure N0.SB0
    System may be temporarily suspended, proceed (yes/no)? yes
    
  4. Check that system board SB0 shows up as unconfigured.
    # cfgadm -av
    Ap_Id                          Receptacle   Occupant     Condition  Information
    When         Type         Busy     Phys_Id
    N0.IB6                         connected    configured   ok powered-on, assigned
    Jun  3 09:30 PCI_I/O_Boa  n        /devices/ssm@0,0:N0.IB6
    N0.IB6::pci0                   connected    configured   ok         device
    /ssm@0,0/pci@19,700000
    Jun  3 09:30 io           n        /devices/ssm@0,0:N0.IB6::pci0
    N0.IB6::pci1                   connected    configured   ok         device
    /ssm@0,0/pci@19,600000, referenced
    Jun  3 09:30 io           n        /devices/ssm@0,0:N0.IB6::pci1
    N0.IB6::pci2                   connected    configured   ok         device
    /ssm@0,0/pci@18,700000, referenced
    Jun  3 09:30 io           n        /devices/ssm@0,0:N0.IB6::pci2
    N0.IB6::pci3                   connected    configured   ok         device
    /ssm@0,0/pci@18,600000, referenced
    Jun  3 09:30 io           n        /devices/ssm@0,0:N0.IB6::pci3
    N0.SB0                         connected    unconfigured ok
    powered-on, assigned
    Jun  3 10:16 CPU          n        /devices/ssm@0,0:N0.SB0
    N0.SB0::cpu0                   connected    unconfigured ok         cpuid
    0, speed 900 MHz, ecache 8 MBytes
    Jun  3 10:16 cpu          n        /devices/ssm@0,0:N0.SB0::cpu0
    N0.SB0::cpu1                   connected    unconfigured ok         cpuid
    1, speed 900 MHz, ecache 8 MBytes
    Jun  3 10:16 cpu          n        /devices/ssm@0,0:N0.SB0::cpu1
    N0.SB0::cpu2                   connected    unconfigured ok         cpuid
    2, speed 900 MHz, ecache 8 MBytes
    Jun  3 10:16 cpu          n        /devices/ssm@0,0:N0.SB0::cpu2
    N0.SB0::cpu3                   connected    unconfigured ok         cpuid
    3, speed 900 MHz, ecache 8 MBytes
    Jun  3 10:16 cpu          n        /devices/ssm@0,0:N0.SB0::cpu3
    N0.SB0::memory                 connected    unconfigured ok         base
    address 0x2000000000, 8388608 KBytes total
    Jun  3 10:16 memory       n        /devices/ssm@0,0:N0.SB0::memory
    
    N0.SB2                         connected    configured   ok
    powered-on, assigned
    Jun  3 09:30 CPU          n        /devices/ssm@0,0:N0.SB2
    N0.SB2::cpu0                   connected    configured   ok         cpuid
    8, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu0
    N0.SB2::cpu1                   connected    configured   ok         cpuid
    9, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu1
    N0.SB2::cpu2                   connected    configured   ok         cpuid
    10, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu2
    N0.SB2::cpu3                   connected    configured   ok         cpuid
    11, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu3
    N0.SB2::memory                 connected    configured   ok         base
    address 0x0, 8388608 KBytes total, 586400 KBytes permanent
    Jun  3 10:16 memory       n        /devices/ssm@0,0:N0.SB2::memory
    N0.SB4                         empty        unconfigured unknown    assigned
    Jun  3 09:30 unknown      n        /devices/ssm@0,0:N0.SB4
    c0                             connected    configured   unknown
    unavailable  scsi-bus     n        /devices/ssm@0,0/pci@18,700000/ide@3:scsi
    c0::dsk/c0t0d0                 connected    configured   unknown    TOSHIBA
    DVD-ROM SD-C2612
    unavailable  CD-ROM       n
    /devices/ssm@0,0/pci@18,700000/ide@3:scsi::dsk/c0t0d0
    c1                             connected    configured   unknown
    unavailable  scsi-bus     n        /devices/ssm@0,0/pci@18,600000/scsi@2:scsi
    c1::dsk/c1t0d0                 connected    configured   unknown    FUJITSU
    MAP3735N SUN72G
    unavailable  disk         n
    /devices/ssm@0,0/pci@18,600000/scsi@2:scsi::dsk/c1t0d0
    c1::dsk/c1t1d0                 connected    configured   unknown    FUJITSU
    MAG3182L SUN18G
    unavailable  disk         n
    /devices/ssm@0,0/pci@18,600000/scsi@2:scsi::dsk/c1t1d0
    c2                             connected    unconfigured unknown
    unavailable  scsi-bus     n        /devices/ssm@0,0/pci@18,600000/scsi@2,1:scsi
    c3                             connected    unconfigured unknown
    unavailable  fc           n
    /devices/ssm@0,0/pci@19,700000/SUNW,qlc@2/fp@0,0:fc
    
  5. Then run cfgadm with the disconnect option to power off the board. (This takes less than a minute to complete.)
    # cfgadm -c disconnect N0.SB0
    
  6. Next check the system board's status. This time it says unconnected and unconfigured. Also, from the SC prompt, if you run showboards it will display Off, although I didn't do that step.
    # cfgadm -av
    Ap_Id                          Receptacle   Occupant     Condition  Information
    When         Type         Busy     Phys_Id
    ...
    
    N0.SB0                         disconnected unconfigured unknown    assigned
    Jun  3 10:19 CPU          n        /devices/ssm@0,0:N0.SB0
    
    N0.SB2                         connected    configured   ok
    powered-on, assigned
    Jun  3 09:30 CPU          n        /devices/ssm@0,0:N0.SB2
    N0.SB2::cpu0                   connected    configured   ok         cpuid
    8, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu0
    N0.SB2::cpu1                   connected    configured   ok         cpuid
    9, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu1
    N0.SB2::cpu2                   connected    configured   ok         cpuid
    10, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu2
    N0.SB2::cpu3                   connected    configured   ok         cpuid
    11, speed 900 MHz, ecache 8 MBytes
    Jun  3 09:30 cpu          n        /devices/ssm@0,0:N0.SB2::cpu3
    N0.SB2::memory                 connected    configured   ok         base
    address 0x0, 8388608 KBytes total, 586400 KBytes permanent
    Jun  3 10:16 memory       n        /devices/ssm@0,0:N0.SB2::memory
    ...
    
    Now it's ready for replacement. No need to bring the system down. At the tail end of the unconfig command this message appeared on the system controler due to the reconfiguration.
    lom>{/N0/SB0/P0}     test case reset reason = 00000001.0404ff05
    {/N0/SB0/P0}     test case ecache_size=00000000.00800000,
    tag_size=00000000.00004000
    {/N0/SB0/P0}     test case Ecache Mode: 4:4:4
    {/N0/SB0/P0}     test case E$ control register = 00000000.07a34c00
    {/N0/SB0/P0} Controller PCI Config Space Test for aid 0x18
    {/N0/SB0/P0} Subtest: IDE Controller Bus Probe for aid 0x18
    {/N0/SB0/P0}    Removable ATAPI device, TOSHIBA DVD-ROM SD-C2612
    
    {/N0/SB0/P0} Subtest: SCSI Controller PCI Config Space Test for aid 0x18
    {/N0/SB0/P0} Subtest: SCSI Controller Register Test for aid 0x18
    {/N0/SB0/P0} Subtest: SCSI Controller SCRIPTS RAM Test for aid 0x18
    {/N0/SB0/P0} Subtest: SCSI Controller SCSI Timers Test for aid 0x18
    {/N0/SB0/P0} Subtest: SCSI Controller DMA Test for aid 0x18
    {/N0/SB0/P0} Subtest: PCI IO Controller Register Initialization for aid 0x19
    {/N0/SB0/P0} Subtest: PCI IO Controller IOMMU  TLB Compare Tests for aid 0x19
    {/N0/SB0/P0} Subtest: PCI IO Controller IOMMU TLB Flush Tests for aid 0x19
    {/N0/SB0/P0} Subtest: PCI IO Controller DMA loopback Tests for aid 0x19
    {/N0/SB0/P0} Subtest: PC    test case IoSram Add : 0000041c.00900000
    {/N0/SB0/P0}     test case Estate = 00000000.0000000b
    {/N0/SB0/P0}     test case Ecache control = 00000000.07a34c00
    {/N0/SB0/P0}     test case CPU features = 0000224f.004204ff
    {/N0/SB0/P0}     test case After setting CPU features, DCU = 0000ee00.0000000f
    {/N0/SB0/P0}     test case DCR = 00000000.0000103f
    {/N0/SB0/P1}     test case reset reason = 00000000.0404ff05
    {/N0/SB0/P1}     test case ecache_size=00000000.00800000,
    tag_size=00000000.00004000
    {/N0/SB0/P1}     test case Ecache Mode: 4:4:4
    {/N0/SB0/P1}     test case E$ control register = 00000000.07a34c00
    {/N0/SB0/P1} @(#) lpost         5.18.1      test case IoSram Add :
    0000041c.00900000
    {/N0/SB0/P1}     test case Estate = 00000000.0000000b
    {/N0/SB0/P1}     test case Ecache control = 00000000.07a34c00
    {/N0/SB0/P1}     test case CPU features = 0000224f.004204ff
    {/N0/SB0/P1}     test case After setting CPU features, DCU = 0000ee00.0000000f
    {/N0/SB0/P1}     test case DCR = 00000000.0000103f
    {/N0/SB0/P2} @(#) lpost         5.18.1  {/N0/SB0/P3} @(#) lpost
    5.18.1
    
  7. After the board is replaced verify that the new board is running the same firmware. From the SC run
    lom>showboards -p prom
    
    Component   Compatible Version
    ---------   ---------- -------
    SSC1        Reference  5.18.1 Build_01
    /N0/IB6     Yes        5.18.1 Build_01
    /N0/SB0     Yes        5.18.1 Build_01
    /N0/SB2     Yes        5.18.1 Build_01
    
    If they aren't all running the same version, you need to upgrade the firmware. (use flashupdate from the LOM prompt, see docs.)
  8. If they are the same, then from the OS run configure to configure the board for use.
    cfgadm -c configure
    At the LOM prompt you will see a bunch of messages as the board runs through post, before it's finally brought back on line to be usable for Solaris.
    lom>{/N0/SB0/P0} Running CPU POR and Set Clocks
    {/N0/SB0/P2} Running CPU POR and Set Clocks
    {/N0/SB0/P1} Running CPU POR and Set Clocks
    {/N0/SB0/P3} Running CPU POR and Set Clocks
    {/N0/SB0/P0} @(#) lpost         5.18.1  2004/12/09 12:32
    {/N0/SB0/P2} @(#) lpost         5.18.1  2004/12/09 12:32
    {/N0/SB0/P1} @(#) lpost         5.18.1  2004/12/09 12:32
    {/N0/SB0/P3} @(#) lpost         5.18.1  2004/12/09 12:32
    {/N0/SB0/P0} Copyright 2001-2004 Sun Microsystems, Inc.  All rights reserved.
    {/N0/SB0/P1} Copyright 2001-2004 Sun Microsystems, Inc.  All rights reserved.
    {/N0/SB0/P2} Copyright 2001-2004 Sun Microsystems, Inc.  All rights reserved.
    {/N0/SB0/P3} Copyright 2001-2004 Sun Microsystems, Inc.  All rights reserved.
    {/N0/SB0/P0} Use is subject to license terms.
    {/N0/SB0/P2} Use is subject to license terms.
    {/N0/SB0/P1} Use is subject to license terms.
    ...
    
    Also there will be messages from the console ...
    Jun  3 10:16:48 j1hol-1280 genunix: [ID 408114 kern.info]
    /ssm@0,0/memory-controller@0,400000 (mc-us30) offline
    ...
    Jun  3 12:25:26 j1hol-1280 lw8: [ID 477720 kern.notice] SB0, hotplug
    status, SB0, module removed (9,16)
    Jun  3 12:58:29 j1hol-1280 lw8: [ID 328834 kern.notice] /N0/SB0, hotplug
    status, SB0, module inserted (9,17)
    Jun  3 13:00:07 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed
    out waiting for SC response
    Jun  3 13:00:08 j1hol-1280 last message repeated 2 times
    Jun  3 13:00:08 j1hol-1280 picld[107]: [ID 653604 daemon.error] sgfru ioctl
    0xf handle 0xe failed: Connection timed out
    Jun  3 13:00:37 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed
    out waiting for SC response
    Jun  3 13:02:03 j1hol-1280 last message repeated 1 time
    Jun  3 13:04:03 j1hol-1280 sgsbbc: [ID 402060 kern.notice] NOTICE: Timed
    out waiting for SC response
    Jun  3 13:05:18 j1hol-1280 unix: [ID 950921 kern.info] cpu0:
    UltraSPARC-III+ (portid 0 impl 0x15 ver 0x23 clock 900 MHz)
    ....
    Jun  3 13:05:37 j1hol-1280 sbdp: [ID 713682 kern.info] cpu3 initialization
    complete - restarted
    Jun  3 13:05:42 j1hol-1280 unix: [ID 700753 kern.info]
    kphysm_add_memory_dynamic: adding 8388608K at 0x2000000000
    Jun  3 13:05:46 j1hol-1280 unix: [ID 323408 kern.info]
    kphysm_add_memory_dynamic: mem = 16777216K (0x400000000)
    Jun  3 13:05:46 j1hol-1280 unix: [ID 401001 kern.info]
    kphysm_add_memory_dynamic: avail mem = 16237010944
    Jun  3 13:05:46 j1hol-1280 ssm: [ID 349649 kern.info] memory-controller0 at
    ssm0: Node 0 Safari id 0 0x400000 ...
    Jun  3 13:05:46 j1hol-1280 genunix: [ID 936769 kern.info] mc-us30 is
    /ssm@0,0/memory-controller@0,400000
    Jun  3 13:05:46 j1hol-1280 genunix: [ID 408114 kern.info]
    /ssm@0,0/memory-controller@0,400000 (mc-us30) online
    
And it's done. The domain did experience another pause when it copied the kernel resident memory back to system board 0.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed

Calendar

Feeds

Search

Links

Navigation

Referrers