Sunday December 06, 2009
Today's Page Hits: 286
The package version of the Sun Cluster 3.2 11/09 Update3 are the same for the core framework and the agents as for Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 and Sun Cluster 3.2 1/09 Update2. Therefore it's possible to patch up an existing Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 or Sun Cluster 3.2 1/09 Update2.
The package version of the Sun Cluster Geographic Edition 3.2 11/09 Update3 are NOT the same as Sun Cluster Geographic Edition 3.2. But it's possible to upgrade the Geographic Edition 3.2 without interruption of the service. See documentation for details.
The following patches (with the mentioned revision) are included/updated in Sun Cluster 3.2 11/09 Update3. If these patches are installed on Sun Cluster 3.2, Sun Cluster 3.2 2/08 Update1 or Sun Cluster 3.2 1/09 Update2 release, then the features for framework & agents are identical with Sun Cluster 3.2 11/09 Update3. It's always necessary to read the "Special Install Instructions of the patch" but I made a note behind some patches where it's very important to read the "Special Install Instructions of the patch" (Using shortcut SIIOTP).
Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 10 05/09 Update7 or higher
126106-38 Sun Cluster 3.2: CORE patch for Solaris 10 Note: Please read SIIOTP
125992-05 Sun Cluster 3.2: SC Checks patch for Solaris 10
126017-03 Sun Cluster 3.2: HA-DNS Patch for Solaris 10
126032-09 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10 Note: Please read SIIOTP
126035-06 Sun Cluster 3.2: HA-NFS Patch for Solaris 10
126044-06 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10 Note: Please read SIIOTP
126047-12 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10 Note: Please read SIIOTP
126050-04 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 10 (-04 not yet on SunSolve)
126059-05 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10
126071-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10
126080-04 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10
126083-04 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10 Note: Please read SIIOTP
126095-06 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
128556-04 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc
137931-02 Sun Cluster 3.2: Sun Cluster 3.2: HA-Informix patch for Solaris 10
Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 10 x86 05/09 Update7 or higher
126107-38 Sun Cluster 3.2: CORE patch for Solaris 10_x86 Note: Please read SIIOTP
125993-05 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 10_x86
126018-05 Sun Cluster 3.2: HA-DNS Patch for Solaris 10_x86
126033-10 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126036-07 Sun Cluster 3.2: HA-NFS Patch for Solaris 10_x86
126045-07 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 10_x86 Note: Please read SIIOTP
126048-12 Sun Cluster 3.2: Ha-Oracle patch for Solaris 10_x86 Note: Please read SIIOTP
126060-06 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 10_x86
126072-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 10_x86
126081-05 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 10_x86
126084-06 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 10_x86 Note: Please read SIIOTP
126096-06 Sun Cluster 3.2: Localization patch for Solaris 10 amd64
128557-04 Sun Cluster 3.2: Man Pages Patch for Solaris 10_x86
137932-02 Sun Cluster 3.2:Sun Cluster 3.2: HA-Informix patch for Solaris 10_x86
Included/updated patch revisions of Sun Cluster 3.2 11/09 Update3 for Solaris 9 5/09 or higher
126105-38 Sun Cluster 3.2: CORE patch for Solaris 9 Note: Please read SIIOTP
125991-05 Sun Cluster 3.2: Sun Cluster 3.2: SC Checks patch for Solaris 9
126016-03 Sun Cluster 3.2: HA-DNS Patch for Solaris 9
126031-09 Sun Cluster 3.2: Ha-MYSQL Patch for Solaris 9 Note: Please read SIIOTP
126034-06 Sun Cluster 3.2: HA-NFS Patch for Solaris 9
126043-06 Sun Cluster 3.2: HA-PostgreSQL Patch for Solaris 9 Note: Please read SIIOTP
126046-12 Sun Cluster 3.2: HA-Oracle patch for Solaris 9 Note: Please read SIIOTP
126049-04 Sun Cluster 3.2: HA-Oracle E-business suite Patch for Solaris 9 (-04 not yet on SunSolve)
126058-05 Sun Cluster 3.2: HA-SAPDB Patch for Solaris 9
126070-02 Sun Cluster 3.2: HA-Tomcat Patch for Solaris 9
126079-04 Sun Cluster 3.2: HA-Sun Java Systems App Server Patch for Solaris 9
126082-04 Sun Cluster 3.2: HA-Sun Java Message Queue Patch for Solaris 9 Note: Please read SIIOTP
126095-06 Sun Cluster 3.2: Localization patch for Solaris 9 sparc and Solaris 10 sparc
128556-04 Sun Cluster 3.2: Man Pages Patch for Solaris 9 and Solaris 10, sparc
The quorum server is an alternative to the traditional quorum disk. The quorum server is outside of the Sun Cluster and will be accessed through the public network. Therefore the quorum server can be a different architecture.
Included/updated patch revisions in Sun Cluster 3.2 11/09 Update3 for quorum server feature:
127404-03 Sun Cluster 3.2: Quorum Server Patch for Solaris 9
127405-04 Sun Cluster 3.2: Quorum Server Patch for Solaris 10
127406-04 Sun Cluster 3.2: Quorum Server Patch for Solaris 10_x86
Please beware of the following note which in the Special Install Instructions of Sun Cluster 3.2 core patch -38 and higher:
NOTE 17: Quorum server patch 127406-04 (or greater) needs to be installed on quorum server host first, before installing 126107-37 (or greater) Core Patch on cluster nodes.
127408-02 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 9 and Solaris 10, sparc
127409-02 Sun Cluster 3.2: Quorum Man Pages Patch for Solaris 10_x86
If some patches must be applied when the node is in noncluster mode, you can apply them in a rolling fashion, one node at a time, unless a patch's instructions require that you shut down the entire cluster. Follow procedures in How to Apply a Rebooting Patch (Node) in Sun Cluster System Administration Guide for Solaris OS to prepare the node and boot it into noncluster mode. For ease of installation, consider applying all patches at once to a node that you place in noncluster mode.
|
I came across some fungi when I was walking through the forest... This is a wood and tree fungus, the white was much more beautiful than on the picture. |
Update 4.Dec.2009:
Support of Solaris 10 10/09 Update8 with Sun Cluster 3.2 1/09 Update2 is now announced. The recommendation is to use the 126106-39 (sparc) / 126107-39 (x86) with Solaris 10 10/09 Update8. Note: The -39 Sun Cluster core patch is a feature patch because the -38 Sun Cluster core patch is part of Sun Cluster 3.2 11/09 Update3 which is already released.
For new installations/upgrades with Solaris 10 10/09 Update8 use:
* Sun Cluster 3.2 11/09 Update3 with Sun Cluster core patch -39 (fixes problem 1)
* Use link-based IPMP (workaround for problem 2) the patches 142900-02/142901-02 which fix the issue coming soon.
* Add "set nautopush=64" to /etc/system (workaround for problem 3)
For patch updates to 141444-09/141445-09 use:
* Sun Cluster core patch -39 (fixes problem 1)
* if possible configure link-based IPMP otherwise wait for the fix (workaround for problem 2). The patches 142900-02/142901-02 which fix the issue coming soon.
* Add "set nautopush=64" to /etc/system (workaround for problem 3)
It's time to notify that there are some issues with these kernel patches in combination with Sun Cluster 3.2
1.) The patch breaks the zpool cachefile feature if using SUNW.HAStoragePlus
a.) If the kernel patch 141444-09 (sparc) / 141445-09 (x86) is installed on a Sun Cluster 3.2 system where the Sun Cluster core patch 126106-33 (sparc) / 126107-33 (x86) is already installed then hastorageplus_prenet_start will fail with the following error message:
...
Oct 26 17:51:45 nodeA SC[,SUNW.HAStoragePlus:6,rg1,rs1,hastorageplus_prenet_start]: Started searching for devices in '/dev/dsk' to find the importable pools.
Oct 26 17:51:53 nodeA SC[,SUNW.HAStoragePlus:6,rg1,rs1,hastorageplus_prenet_start]: Completed searching the devices in '/dev/dsk' to find the importable pools.
Oct 26 17:51:54 nodeA zfs: [ID 427000 kern.warning] WARNING: pool 'zpool1' could not be loaded as it was last accessed by another system (host: nodeB hostid: 0x8516ced4). See: http://www.sun.com/msg/ZFS-8000-EY
...
b.) If the kernel patch 141444-09 (sparc) / 141445-09 (x86) is installed on a Sun Cluster 3.2 system where the Sun Cluster core patch 126106-35 (sparc) / 126107-35 (x86) is already installed then hastorageplus_prenet_start will work but the zpool cachefile feature of SUNW.HAStoragePlus is disabled. Without the zpool cachefile feature the time of zpool import increases because the import will scan all available disks. The messages look like:
...
Oct 30 15:37:45 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 148650 daemon.notice] Started searching for devices in '/dev/dsk' to find the importable pools.
Oct 30 15:37:45 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 148650 daemon.notice] Started searching for devices in '/dev/dsk' to find the importable pools.
Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 547433 daemon.notice] Completed searching the devices in '/dev/dsk' to find the importable pools.
Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 547433 daemon.notice] Completed searching the devices in '/dev/dsk' to find the importable pools.
Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 792255 daemon.warning] Failed to update the cachefile contents in /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile to CCR table zpool1.cachefile for pool zpool1 : file /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile open failed: No such file or directory.
Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 792255 daemon.warning] Failed to update the cachefile contents in /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile to CCR table zpool1.cachefile for pool zpool1 : file /var/cluster/run/HAStoragePlus/zfs/zpool1.cachefile open failed: No such file or directory.
Oct 30 15:37:49 nodeA SC[,SUNW.HAStoragePlus:8,nfs-rg,zpool1-rs,hastorageplus_validate]: [ID 205754 daemon.info] All specified device services validated successfully.
...
If the ZFS cachefile feature is not required AND the above kernel patches are installed, problem a.) is resolved by installing Sun Cluster core patch 126106-35 (sparc) / 126107-35 (x86).
Solution for a) and b):
126106-39 Sun Cluster 3.2: CORE patch for Solaris 10
126107-39 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Sun Alert 272669: A Solaris Kernel Change Stops Sun Cluster Using "zpool.cachefiles" to Import zpools Resulting in ZFS pool Import Performance Degradation or Failure to Import the zpools
This is reported in Bug 6895580
2.) The patch breaks probe-based IPMP if more than one interface is in the same IPMP group
After installing the already mentioned kernel patch:
141444-09 SunOS 5.10: kernel patch or
141445-09 SunOS 5.10_x86: kernel patch
then the probe-based IPMP group feature is broken if the system is using more than one interface in the same IPMP group. This means all Solaris 10 systems which are using more than one interface in the same probe-based IPMP group are affected!
After installing this kernel patch the following errors will be sent to the system console after a reboot:
...
nodeA console login: Oct 26 19:34:41 in.mpathd[210]: NIC failure detected on bge0 of group ipmp0
Oct 26 19:34:41 in.mpathd[210]: Successfully failed over from NIC bge0 to NIC e1000g0
...
Workarounds:
a) Use link-based IPMP instead of probe-based IPMP
b) Use only one interface in the same IPMP group if using probe-based IPMP
See the blog "Tips to configure IPMP with Sun Cluster 3.x" for more details if you like to change the configuration.
c) Do not install the listed kernel patch above. Note: Fix is already in progress and can be reached via a service request. I will update this blog when the general fix is available.
Sun Alert 271519: Solaris 10 Kernel Patches 141444-09 and 141445-09 May Cause Interface Failure in IP Multipathing (IPMP)
This is reported in Bug 6888928
3.) When applying the patch Sun Cluster can hang on reboot
After installing the already mentioned kernel patch:
141444-09 SunOS 5.10: kernel patch or
141511-05 SunOS 5.10_x86: ehci, ohci, uhci patch
the Sun Cluster nodes can hang within boot because the Sun Cluster nodes has exhausted the default number of autopush structures. When clhbsndr module is loaded, it causes a lot more autopushes to occur than would otherwise happen on a non-clustered system. By default, we only allocate nautopush=32 of these structures.
Workarounds:
a) Do not use the mentioned kernel patch with Sun Cluster
b) Boot in non-cluster-mode and add the following to /etc/system
set nautopush=64
Sun Alert 273610: Solaris autopush(1M) Changes (with patches 141444-09/141511-04) May Cause Sun Cluster 3.1 and 3.2 Nodes to Hang During Boot
This is reported in Bug 6879232
| 142900-xx 142901-xx |
newest patchid sparc newest patchid x86 |
requires 141444-09 SunOS 5.10: kernel patch requires 141445-09 SunOS 5.10_x86: kernel patch |
| 141444-09 141445-09 |
Solaris 10 10/09 Update8 sparc Solaris 10 10/09 Update8 x86 |
requires
139555-08 requires 139556-08 |
| 141414-10 141415-10 |
highest release sparc highest release x86 |
Obsoleted by: 141444-09 SunOS 5.10: kernel patch Obsoleted by: 141445-09 SunOS 5.10_x86: kernel patch |
| 139555-08 139556-08 |
Solaris 10 05/09 Update7 sparc Solaris 10 05/09 Update7 x86 |
requires
137137-09 requires 137138-09 |
| 138888-08 138889-08 |
highest release sparc highest release x86 |
Obsoleted by: 139555-08 SunOS 5.10: Kernel Patch Obsoleted by: 139556-08 SunOS 5.10_x86: Kernel Patch |
| 137137-09 137138-09 |
Solaris 10 10/08 Update6 sparc Solaris 10 10/08 Update6 x86 |
requires
127127-11 requires 127128-11 |
| 137111-08 137112-08 |
highest release sparc highest release x86 |
Obsoleted by: 137137-09 SunOS 5.10: kernel patch Obsoleted by: 137138-09 SunOS 5.10_x86: kernel patch |
| 127127-11 127128-11 |
Solaris 10 05/08 Update5 sparc Solaris 10 05/08 Update5 x86 |
requires
120011-14 requires 120012-14 |
| 127111-11 127112-11 |
highest release sparc highest release x86 |
Obsoleted by: 127127-11 SunOS 5.10: kernel patch Obsoleted by: 127128-11 SunOS 5.10_x86: kernel patch |
| 120011-14 120012-14 |
Solaris 10 08/07 Update4 sparc Solaris 10 08/07 Update4 x86 |
requires
118833-36 requires 118855-36 |
| 125100-10 125101-10 |
highest release sparc highest release x86 |
Obsoleted by: 120011-14 SunOS 5.10: Kernel Update patch Obsoleted by: 120012-14 SunOS 5.10_x86: Kernel Update patch |
| 118833-36 118855-36 |
highest release sparc highest release x86 |
this is a must have for Solaris 10 11/06 Update3 sparc this is a must have for Solaris 10 11/06 Update3 x86 |
| 118833-33 118855-33 |
Solaris 10 11/06 Update3 sparc Solaris 10 11/06 Update3 x86 |
|
| 118833-17 118855-14 |
Solaris 10 06/06 Update2 sparc Solaris 10 06/06 Update2 x86 |
|
| 118822-30 118844-30 |
highest release sparc highest release x86 |
Obsoleted by: 118833-36 SunOS 5.10: kernel Patch Obsoleted by: 118855-36 SunOS 5.10_x86: kernel Patch |
| 118822-25 118844-26 |
Solaris 10 01/06 Update1 sparc Solaris 10 01/06 Update1 x86 |
|
| 118822-10 118844-11 |
Solaris 10 03/05 HW1 sparc Solaris 10 03/05 HW1 x86 |
|
| Solaris 10 sparc Solaris 10 x86 |
This is a short overview on how to configure a zone cluster. It is highly recommended to use Solaris 10 5/09 update7 with patch baseline July 2009 (or higher) and Sun Cluster 3.2 1/09 with Sun Cluster 3.2 core patch revision -33 or higher. The name of the zone cluster must be unique throughout the global Sun Cluster and must be configured on a global Sun Cluster. Please read the requirements for zone cluster in Sun Cluster Software Installation Guide
In some cases it's necessary to add a tagged VLAN id to the cluster interconnect. This example show the difference of the cluster interconnect configuration if using tagged VLAN id or not. The interface e1000g2 have a "normal" setup (no VLAN id) and the interface e1000g1 got a VLAN id of 2. The used ethernet switch must be configured first with tagged VLAN id before the cluster interconnect can be configured. Use "clsetup" to assign a VLAN id to cluster interconnect.
Entries for "normal" cluster interconnect interface in /etc/cluster/ccr/global/infrastructure - no tagged VLAN:
cluster.nodes.1.adapters.1.name e1000g2
cluster.nodes.1.adapters.1.properties.device_name e1000g
cluster.nodes.1.adapters.1.properties.device_instance 2
cluster.nodes.1.adapters.1.properties.transport_type dlpi
cluster.nodes.1.adapters.1.properties.lazy_free 1
cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.1.adapters.1.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.1.adapters.1.properties.nw_bandwidth 80
cluster.nodes.1.adapters.1.properties.bandwidth 70
cluster.nodes.1.adapters.1.properties.ip_address 172.16.1.129
cluster.nodes.1.adapters.1.properties.netmask 255.255.255.128
cluster.nodes.1.adapters.1.state enabled
cluster.nodes.1.adapters.1.ports.1.name 0
cluster.nodes.1.adapters.1.ports.1.state enabled
Entries for cluster interconnect interface in /etc/cluster/ccr/global/infrastructure - with tagged VLAN:
cluster.nodes.1.adapters.2.name e1000g2001
cluster.nodes.1.adapters.2.properties.device_name e1000g
cluster.nodes.1.adapters.2.properties.device_instance 1
cluster.nodes.1.adapters.2.properties.vlan_id 2
cluster.nodes.1.adapters.2.properties.transport_type dlpi
cluster.nodes.1.adapters.2.properties.lazy_free 1
cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_timeout 10000
cluster.nodes.1.adapters.2.properties.dlpi_heartbeat_quantum 1000
cluster.nodes.1.adapters.2.properties.nw_bandwidth 80
cluster.nodes.1.adapters.2.properties.bandwidth 70
cluster.nodes.1.adapters.2.properties.ip_address 172.16.2.1
cluster.nodes.1.adapters.2.properties.netmask 255.255.255.128
cluster.nodes.1.adapters.2.state enabled
cluster.nodes.1.adapters.2.ports.1.name 0
cluster.nodes.1.adapters.2.ports.1.state enabled
The tagged VLAN interface is a combination of the VLAN id and the used network interface. In this example e1000g2001, the 2 after the e1000g is the VLAN id and the 1 at the end is the instance of the e1000g driver. Normally this would be the e1000g1 interface but with the VLAN id it becomes the interface e1000g2001.
The ifconfig -a of the above configuration is:
# ifconfig -a
lo0: flags=20010008c9
inet 127.0.0.1 netmask ff000000
e1000g0: flags=9000843
inet 10.16.65.63 netmask fffff800 broadcast 10.16.55.255
groupname sc_ipmp0
ether 0:14:4f:20:6a:18
e1000g2: flags=201008843
inet 172.16.1.129 netmask ffffff80 broadcast 172.16.1.255
ether 0:14:4f:20:6a:1a
e1000g2001: flags=201008843
inet 172.16.2.1 netmask ffffff80 broadcast 172.16.2.127
ether 0:14:4f:20:6a:19
clprivnet0: flags=1009843
inet 172.16.4.1 netmask fffffe00 broadcast 172.16.5.255
ether 0:0:0:0:0:1

Now it's time to install/upgrade to Sun Cluster 3.2 1/09 Update2. The major bugs of Sun Cluster 3.2 1/09 Update2 are fixed in
126106-33 or higher Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-33 or higher Sun Cluster 3.2: CORE patch for Solaris 9
If problems occurred due to administration mistakes then the following errors have been seen:
NODE1# zpool import tank
cannot import 'tank': I/O error
NODE2# zpool import tankothernode
cannot import 'tankothernode': one or more devices is currently unavailable
NODE2# zpool import tankothernode
cannot import 'tankothernode': no such pool available
NODE1# zpool import tank
cannot import 'tank': pool may be in use from other system, it was last accessed by NODE2 (hostid: 0x83083465) on Fri May 8 13:34:41 2009
use '-f' to import anyway
NODE1# zpool import -f tank
cannot import 'tank': one or more devices is currently unavailable
This is addressed in bug 6783988.
The issue only occurs if the Sun Cluster 3.2 1/09 Update2 will be installed with a non-default netmask address for cluster interconnect.
Seen problems if system is affected:
Errors with:
* did devices
* quorum device
* 'scstat -i' can look like:
-- IPMP Groups --
Node Name Group Status Adapter Status
--------- ----- ------ ------- ------
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for s4u-4800f-domc-muc07 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for s4u-4800f-doma-muc07 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for s4u-4800e-domc-muc07 - unexpected error.
scrconf: RPC: Authentication error; why = Client credential too weak
scrconf: Failed to get zone information for s4u-4800e-doma-muc07 - unexpected error.
IPMP Group: s4u-4800f-domc-muc07 sc_ipmp0 Online qfe0 Online
IPMP Group: s4u-4800f-doma-muc07 sc_ipmp0 Online qfe0 Online
IPMP Group: s4u-4800e-domc-muc07 sc_ipmp0 Online qfe0 Online
IPMP Group: s4u-4800e-doma-muc07 sc_ipmp0 Online qfe0 Online
How the problem occur?
After the installation of Sun Cluster 3.2 1/09 Update2 product with the java installer it's necessary to run the #scinstall command. If choose "Custom" installation instead of "Typical" installation then it's possible to change the default of the netmask of cluster interconnect. The following questions come up within the installation procedure if answering the default netmask question with 'no'.
Example scinstall:
Is it okay to accept the default netmask (yes/no) [yes]? no
Maximum number of nodes anticipated for future growth [64]? 4
Maximum number of private networks anticipated for future growth [10]?
Maximum number of virtual clusters expected [12]? 0
What netmask do you want to use [255.255.255.128]?
Prevent the issue by answering the virtual clusters question with '1' or other serious consideration to future growth potential if necessary.
Do NOT answer the virtual clusters question with '0'!
Example of the whole scinstall log when corrupted ccr occur:
In the /etc/cluster/ccr/global/infrastructure file the error can be found by an empty entry for cluster.properties.private_netmask. Furthermore some other lines are not reflect the correct values for netmask as choosen within scinstall.
Wrong infrastructure file:
cluster.state enabled
cluster.properties.cluster_id 0x49F82635
cluster.properties.installmode disabled
cluster.properties.private_net_number 172.16.0.0
cluster.properties.cluster_netmask 255.255.248.0
cluster.properties.private_netmask
cluster.properties.private_subnet_netmask 255.255.255.248
cluster.properties.private_user_net_number 172.16.4.0
cluster.properties.private_user_netmask 255.255.254.0
cluster.properties.private_maxnodes 6
cluster.properties.private_maxprivnets 10
cluster.properties.zoneclusters 0
cluster.properties.auth_joinlist_type sys
If answering the virtual cluster question with value '1' then the correct netmask entries are:
cluster.properties.cluster_id 0x49F82635
cluster.properties.installmode disabled
cluster.properties.private_net_number 172.16.0.0
cluster.properties.cluster_netmask 255.255.255.128
cluster.properties.private_netmask 255.255.255.128
cluster.properties.private_subnet_netmask 255.255.255.248
cluster.properties.private_user_net_number 172.16.0.64
cluster.properties.private_user_netmask 255.255.255.224
cluster.properties.private_maxnodes 6
cluster.properties.private_maxprivnets 10
cluster.properties.zoneclusters 1
cluster.properties.auth_joinlist_type sys
Workaround if problem already occured:
1.) Boot all nodes in non-cluster-mode with 'boot -x'
2.) Change the wrong values of /etc/cluster/ccr/global/infrastructure on all nodes. See example above.
3.) Write a new checksum for all infrastructure files on all nodes. Use -o (master file) on the node which is booting up first.
s4u-4800e-doma-muc07 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure -o
s4u-4800e-domc-muc07 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
s4u-4800f-doma-muc07 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
s4u-4800f-domc-muc07 # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure
4.) first reboot s4u-4800e-doma-muc07 (master infrastructure file) into cluster, then the other nodes.
This is reported in bug 6825948.
Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue at installation time.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86
There is a missing/old preremove script in Sun Cluster 3.2 2/08 Update1 which is equivalent to the patches
126106-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-12 until -19 Sun Cluster 3.2: CORE patch for Solaris 9
This means in case of Upgrade (using scinstall -u) from Sun Cluster 3.2 to Sun Cluster 3.2 update1 or update2 the issue can occur.
More details available in Missing preremove script in Sun Cluster 3.2 core patch revision 12 and higher.
The issue is, if the mentioned Sun Cluster core patches are installed it is not possible to remove the SUNWscr package within the upgrade to Sun Cluster 3.2 1/09 Update2.
The problem looks as:
# ./scinstall -u update
Starting upgrade of Sun Cluster framework software
Saving current Sun Cluster configuration
Do not boot this node into cluster mode until upgrade is complete.
Renamed "/etc/cluster/ccr" to "/etc/cluster/ccr.upgrade".
** Removing Sun Cluster framework packages **
...
Removing SUNWscrtlh..done
Removing SUNWscr.....failed
scinstall: Failed to remove "SUNWscr"
Removing SUNWscscku..done
 ...
scinstall: scinstall did NOT complete successfully!
Workaround:
Before the upgrade to Sun Cluster 3.2 Update1/Update2 install the following patch which delivers a correct preremove script for Sun Cluster 3.2
140016 Sun Cluster 3.2: CORE patch for Solaris 9
140017 Sun Cluster 3.2: CORE patch for Solaris 10
140018 Sun Cluster 3.2: CORE patch for Solaris 10_x86
If already one of the following patches installed then the above patches are not necessary, because these patches also include a correct preremove script for package SUNWscr.
126106-27 or higher Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or higher Sun Cluster 3.2: CORE patch for Solaris 10_x86
126105-26 or higher Sun Cluster 3.2: CORE patch for Solaris 9
This is reported in bugs 6676771 and 6747530 with further details.
In case of Sun Cluster 3.2 it's possible that nested mounts will be mounted in the wrong order. As a result, the data on these file systems become inaccessible to users.
The issue happen if one of the following Sun Cluster core patches are active and nested mounts are managed with resource type SUNW.HAStoragePlus.
126106-27 or -29 or -30 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or -30 or -31 Sun Cluster 3.2: CORE patch for Solaris 10_x86
126107-26 or -28 or -29 Sun Cluster 3.2: CORE patch for Solaris 9
The error can look like:
The correct output of df -k should be
/dev/vx/dsk/datadg/vol01 480751 1048 431628 1% /test
/dev/vx/dsk/datadg/vol02 288639 1042 258734 1% /test/test2
/dev/vx/dsk/datadg/vol03 577295 1041 518525 1% /test/test3
The mount order is defined in the HAStoragePlus resource test-rs
# clrs show -v test-rs | grep FilesystemMountPoints
FilesystemMountPoints: /test /test/test2 /test/test3
But, due to runtime problems the filesystems get mounted in wrong order and the df -k can look like:
/dev/vx/dsk/datadg/vol02 480751 1048 431628 1% /test/test2
/dev/vx/dsk/datadg/vol03 480751 1048 431628 1% /test/test3
/dev/vx/dsk/datadg/vol01 480751 1048 431628 1% /test
In this specific case, /test/test2 and /test/test3 were mounted first followed by an overlay mount of /test. Due to this, data in /test/test2 and /test/test3 would not be accessible and show the same information as /test.
Workaround:
It's possible to split the SUNW.HAStoragePlus resource. For the example above change the resource test-rs and remove the FilesystemMountPoints /test/test2 and /test/test3. Furthermore create a new resource test1-rs with the mentioned FilesystemMountPoints and add a resource dependency.
The commands to change this specific configuration will be:
# clrs set -p FilesystemMountPoints=/test test-rs
# clrs create -g test-rg -t SUNW.HAStoragePlus -p FilesystemMountPoints=/test/test2,/test/test3 -p Resource_dependencies=test-rs -p AffinityOn=True test1-rs
Due to this change the test1-rs starts after the test-rs and the problem is solved.
Details available in:
Sun Alert 256368 Nested Mounts Managed by a SUNW.HAStoragePlus Resource may Fail to Mount in the Correct Order on Solaris Cluster 3.2
Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86
A memory leak occurs in the "rgmd -z global" process on Sun Cluster 3.2 1/09 Update2. The global zone instance of the rgmd process leaks memory in most situations such as "scstat" or "cluster show" and other basic commands. The problem is severe and the rgmd heap grows to a large size and crashes the Sun Cluster node.
The issue only happen if one of the following Sun Cluster core patches are active.
126106-27 or -29 or -30 Sun Cluster 3.2: CORE patch for Solaris 10
126107-28 or -30 or -31 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Due to the fact that this patches are also part of the Sun Cluster 3.2 1/09 Update2 release the issue occur also on fresh installed Sun Cluster 3.2 1/09 Update2 systems.
The error can look as follows:
Analyze the grow of memory allocation with (or similar tools)
# prstat
3942 root 61M 11M sleep 101 - 0:00:02 0.7% rgmd/41
sometime later the increase of the memory allocation is visible.
3942 root 61M 20M sleep 101 - 0:01:15 0.7% rgmd/41
or
# pmap -x <pid_of_rgmd-z_global> | grep heap
00022000 47648 6992 6984 - rwx-- [ heap ]
sometime later the increase of the memory allocation is visible.
00022000 47648 15360 15352 - rwx-- [ heap ]
When the memory is full the Sun Cluster node panics with the following message:
Feb 25 07:59:23 node1 RGMD[1843]: [ID 381173 daemon.error] RGM: Could not allocate 1024 bytes; node is out of swap space; aborting node.
...
Feb 25 08:10:05 node1 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Feb 25 08:10:05 node1 unix: [ID 836849 kern.notice]
Feb 25 08:10:05 node1 ^Mpanic[cpu0]/thread=2a100047ca0:
Feb 25 08:10:05 node1 unix: [ID 562397 kern.notice] Failfast: Aborting zone "global" (zone ID 0) because "globalrgmd" died 30 seconds ago.
Feb 25 08:10:06 node1 unix: [ID 100000 kern.notice]
...
Update 20.Mar.2009:
Available now:
Sun Alert 254908 Memory Leak in the "rgmd" Process of Solaris Cluster 3.2 may Cause a failfast Panic
Update 17.Jun.2009:
The -33 revision of the Sun Cluster core patch is the first released version which fix this issue.
126106-33 Sun Cluster 3.2: CORE patch for Solaris 10
126107-33 Sun Cluster 3.2: CORE patch for Solaris 10_x86
Workaround: Use previous version -19 to prevent issue.
126106-19 Sun Cluster 3.2: CORE patch for Solaris 10
126107-19 Sun Cluster 3.2: CORE patch for Solaris 10_x86
The issue is reported in bug 6808508 (description: scalable services coredump during the failover due to network failure). A fix is in progress. This blog will be updated when the fix is available.