Availability Engineering
Sun Cluster Oasis
« Sun Cluster 3.2,... | Main | What is Flash archiv... »
Wednesday Nov 29, 2006
How does sun cluster decide on which node a service runs?

A customer asked us how Sun Cluster decides where to bring a resource group online.

The selection of a primary node for a resource group is determined first of all by the Nodelist property that is configured for the group. The Nodelist specifies a list of nodes or zones where the resource group can be brought online, in order of preference. These nodes or zones are known as the potential primaries (or masters) of the resource group.

The nodelist preference order is modified by the following factors:

- "Ping pong" prevention: If the resource group has recently failed to start on a given node or zone, that node or zone is given lower priority than one on which it has not yet failed.

- Resource group affinities: This is a configurable resource group property that indicates a positive or negative affinity of one group for another. The node selection algorithm always satisfies strong affinities, and makes a best-effort to satisfy weak affinities.

The resource group affinities property (RG_affinities) was introduced in Sun Cluster 3.1 9/04. This property allows you to express that the RGM should do one of the following:

1. Attempt to locate the resource group on a node that is a current master of another group, referred to as positive affinity.

2. Attempt to locate the resource group on a node that is not a current master of a given group, referred to as negative affinity.

Resource group affinities comes in five flavors:

+, or weak positive affinity

++, or strong positive affinity

+++, or strong positive affinity with failover delegation

-, or weak negative affinity

--, or strong negative affinity

At this point you may well be wondering, what are affinities used for, and how do they work? To answer that, here are a few examples:

Example 1: Enforcing collocation of a resource group with another resource group

Suppose that our cluster is running an Oracle database server controlled by a failover resource group, ora-rg. We also have an application in resource group dbmeasure-rg, whose job it is to measure and log the performance of the database. The dbmeasure application, if it runs, must run on the same node as the Oracle server. However, the measurement application is not mandatory, and Oracle can run fine without it.

We can force dbmeasure to run only on a node where Oracle is running, by declaring a strong positive affinity:

clrg set -p RG_affinities=++ora-rg dbmeasure-rg

When we initially switch ora-rg online, dbmeasure-rg will automatically come online on the same node. If ora-rg fails over to a different node or zone, then dbmeasure-rg will follow it automatically. While ora-rg remains online we can switch dbmeasure-rg offline, however, dbmeasure-rg cannot switch over or fail over onto any node where ora-rg is *not* running.

Note: Besides the RG_affinities, we may also configure a dependency of the dbmeasure resource upon the oracle server resource. This assures that the dbmeasure resource does not get started until the oracle server resource is online. Resource group affinities are enforced independently of resource dependencies. While resource dependencies control the order in which resources are started and stopped, RG_affinities control the _locations_ where resource groups are brought online across multiple nodes or zones of a cluster.

Suppose that dbmeasure is a more critical application, and it is important to keep it up and running? In that case, we might want to allow dbmeasure-rg itself to initiate a failover onto a different node, dragging ora-rg along with it. To accomplish this, we use the strong positive affinity with delegated failover:

clrg set -p RG_affinities=+++ora-rg dbmeasure-rg

Example 2: Specifying a preferred collocation of a resource group with another resource group

Assume again a cluster running our Oracle database resource group, ora-rg. On the same cluster, we are running a customer service application that uses the database; this application is configured in a separate failover resource group, app-rg. The application and the database _can_ run on two different nodes, but perhaps we have discovered that the application is database-intensive and runs faster if it is hosted on the same node as the database. Therefore, we prefer to start the application on the same node as the database.

However, it might also be the case that we want to avoid switching the application from one node to another, even if the database changes nodes. To avoid breaking client connections or for some other reason, we would rather keep the application on its current master, even if it incurs some performance penalty.

To achieve these semantics, we give app-rg a weak positive affinity for ora-rg:

clrg set -p RG_affinities=+ora-rg app-rg

With this affinity, the RGM will start app-rg on the same node as ora-rg when possible, but will not force it to always run on the same node.

Example 3: Balancing the load of a set of resource groups

Now suppose that we have a cluster that is hosting three independent applications in resource groups app1-rg, app2-rg, and app3-rg. By giving each resource group a weak negative affinity for the other two groups, we can achieve a rudimentary form of load balancing on our cluster:

clrg set -p RG_affinities=-app2-rg,-app3-rg app1-rg
clrg set -p RG_affinities=-app1-rg,-app3-rg app2-rg
clrg set -p RG_affinities=-app1-rg,-app2-rg app3-rg

With these settings, the RGM will try to bring each resource group online on a node that is not currently hosting either of the other two groups. If there are three or more nodes available, this will place each resource group onto its own node. If there are fewer than three nodes available, then the RGM will "double-up" or "triple-up" the resource groups onto the available node(s). Conceptually, the resource group with weak negative affinity is trying to stay away from the other group, sort of like electrostatic charges that repel one another.

Example 4: Specifying that a critical service has precedence

In this example, a critical service -- let's say it's our Oracle database in the ora-rg resource group -- is sharing the cluster with a non-critical service, for example, a prototype of a newer version of our software which is undergoing testing and development. Supposing that we have a two-node cluster, we want ora-rg to start on one node, and the test-rg to start on the other node. Suppose that the first node, hosting ora-rg, dies, causing ora-rg to fail over to the second node. We want the non-critical service in test-rg to go offline on that node.

To accomplish this behavior, we give test-rg a strong negative affinity for ora-rg:

clrg set -p RG_affinities=--ora-rg test-rg

When the first node dies and ora-rg fails over to the second node where test-rg is currently running, test-rg will get "bumped off" of the second node and will remain offline (assuming a two-node cluster). When the first node reboots, it takes on the role of backup node, and test-rg is automatically started on it.

Example 5: Combining different flavors of RG_affinities to achieve more complex behavior

In the Sun Cluster HA for SAP Replicated Enqueue Service, we configure the enqueue server in one resource group enq-rg, and the replica server in a second resource group repl-rg.

A requirement of this data service is that the enqueue server, if it fails on the node where it is currently running, must fail over to the node where the replica server is running. The replica server needs to move immediately to a different node. Setting a weak positive affinity from the enqueue server resource group to the replica server resource group ensures the enqueue server resource group will fail over to the node where the replica server is currently running:

clrg set -p RG_affinities=+rg-repl rg-enq

Setting a strong negative affinity from the replica server resource group to the enqueue server resource group ensures the replica server resource group is offloaded from the replica server node, before the enqueue server resource group is brought online on the same node:

clrg set -p RG_affinities=--rg-enq rg-repl

The replica server resource group will be started up on another node if one is available.

Thus by using the simple declarative mechanism of RG_affinities, we can achieve robust recovery behavior for our data services running on Sun Cluster.

Martin Rattner
Sun Cluster Engineering

Posted at 12:00AM Nov 29, 2006 in Sun  |  Comments[17]

Comments:

interesting

Posted by 60.208.167.212 on April 27, 2007 at 01:07 AM PDT #

Thanks for answer. That's helps.

Posted by 207.45.248.18 on May 29, 2007 at 01:20 AM PDT #

Hi,

For example:
I have two resource group each of them having two resources like:
rg1 - rg1rs1 rg1rs2
rg2 - rg2rs1 rg2rs2

Now I want rg1 and all of its resources to start first then rg2 and its resources to start.

Now if I set dependency like this:
clrg set -p RG_Dependencies=rg1 rg2

Do I need to set the resource dependencies as well? If yes then what is the difference between RG_Dependencies & Resource_dependencies?

Regards,
Mohammad Ali

Posted by Mohammad Ali on June 17, 2008 at 10:02 AM PDT #

The RG_dependencies are a weaker form of dependency, in that they are applied only within a given node. In Ali's example above, if rg1 and rg2 are starting on the same node, then both resources of rg1 would start on that node before any resources of rg2 would be started. However, in the event that rg1 and rg2 are starting on two different nodes, there would be no guarantees about start ordering of the resources. [Note, if you wanted to force rg1 and rg2 to always start on the same node, you could use RG_affinities.]

The only way to enforce resource start ordering across different nodes is to use resource dependencies:

clrs set -p resource_dependencies=rg1rs1,rg1rs2 rg2rs1 rg2rs2

Note, RG_dependencies was an older feature which we continue to support. However, it mostly has been superseded by resource dependencies.

Posted by Martin Rattner on June 17, 2008 at 12:19 PM PDT #

Hi Martin Rattner,

Thanks a lot for your so quick response. This blog is really helpful. I appreciate.

One more answer please...
In the same box does global zone & local zone will be considered as two different node for the resource group or resource dependencies?

Regards,
Mohammad Ali

Posted by Mohammad Ali on June 17, 2008 at 11:11 PM PDT #

That's a good question. The answer is yes, currently each zone including the global zone is considered as if it were a different "node" on which to locate a resource group.

Currently, resource group affinities work only at the zone level. For example, if rg1 has a strong positive affinity for rg2, then rg1 must start in the same zone (global or non-global) in which rg2 is started.

We have discovered that many applications would prefer a physical node affinity; that is, rg1 should start on the same physical node (but not necessarily in the same zone) as rg2. We are working on this enhancement to RG affinities behavior, and hope to provide it in a future release of Solaris Cluster and/or Open HA Cluster.

Posted by Martin Rattner on June 17, 2008 at 11:55 PM PDT #

Hi Martin Rattner,

I really appreciate your support. Thanks again.

Regards,
Mohammad Ali

Posted by Mohammad Ali on June 18, 2008 at 12:28 AM PDT #

Hi Martin,

In your example 4, when a resource group has precedence over another resource, what happends if the sub-ordinate resource group, is only configured on one node?

I have a two node cluster with a production Oracle database failing between nodes, and a QA Oracle database running solely on node 2, my customer doesn't want his QA Oracle database to fail over off of node 2, but to be taken offline, setting the RG_affinities parameter to "--<proddb-rg>" in my qadb-rg resource group doesn't appear to do this?

Also, suppose I also have an application group, where I want the same behaviour, but against each group, ie the RG_affinities for my qadb-rg, needs to be "--<proddb-rg> --<prodapp-rg>" I want basically want an "OR" behaviour and not an "AND" behaviour?

Regards

Andy

Posted by Andrew Dibbins on July 11, 2008 at 06:14 AM PDT #

Andy,

In example 4, if the subordinate RG (in your example, the QA Oracle database) is configured to run on only one node, and it declares a strong negative affinity for the production RG; then if the production RG fails over onto that node, the subordinate RG will go offline as your customer would want.

Setting RG_affinities=--<proddbrg> on the subordinate RG should have the desired effect of forcing it offline. I don't know why you are not observing this behavior. Check for syslog messages in /var/adm/messages on each node to see if you can find further information about what is happening.

In your second example where an application RG declares strong negative affinity for two different production RGs, then either production rg alone (if it fails over onto the node where the application rg is running) should force the subordinate RG offline. I think this is what you are referring to as "OR" behavior and not AND behavior.

Posted by Martin Rattner on July 11, 2008 at 12:26 PM PDT #

If you've been experimenting with RG affinities, you might have noticed the following problem:

Suppose that two resource groups run on the same set of physical nodes, but are configured to run in different zones within those nodes. For example suppose that RG1's nodelist is: "node1:zoneA,node2:zoneA"
while RG2's nodelist is "node1:zoneB,node2:zoneB". In this case, RG_affinities cannot currently be used between RG1 and RG2, because the affinity is interpreted at the zone level, i.e., that the two RGs should run in the same zone or in different zones.

The question is: Is there a way to get the RG_affinities to ignore the zone component of the node list, and just to concentrate on the physical machines, rather than the combination of physical machine and zone?

This issue is considered to be a design defect, addressed by OpenSolaris change request number is 6443496. A fix is underway.

Posted by Martin Rattner on July 23, 2008 at 11:35 AM PDT #

Correction to the preceding comment: The change request number given (6443496) is a Sun CR number, not an OpenSolaris CR number. I still need to get familiar with the OpenSolaris processes, so I am not sure how the community gets access to existing Sun Cluster (now Open HA Cluster) bug reports.

Posted by Martin Rattner on July 23, 2008 at 04:49 PM PDT #

What about the case where you want each node to have locally installed binaries but still have the HA Oracle (failover) depend on those filesystems (devices) online before the Oracle RG starts to turn up?
thanks!

Posted by David Schramm on August 06, 2008 at 02:39 PM PDT #

Re. David's question about locally installed Oracle binaries: Even though locally installed binaries are not strictly HA resources, you can use an HAStoragePlus resource on each node to manage the availability of the local file system on that node. The HASP resource for each node would be configured in its own single-mastered resource group (i.e., there would be one node in the RG nodelist). The failover ha-oracle resource group would declare RG_dependencies upon all of the single-node HASP resource groups. The RGM has special logic in this case, such that RG_dependencies upon single-node RGs are enforced only on the local node where HA-oracle is going online.

If anyone is curious about the details, this RGM feature is described in Sun change request (CR) number 4778869.

Posted by Martin Rattner on August 06, 2008 at 06:06 PM PDT #

Ok, I set up a special HASP RG for the Oracle Home FS on the "local" diskset. I set the dependency like this:
"clrs set -p Resource_dependencies_weak=orahome-rs{LOCAL_NODE} oracle_serverdb480rd1-rs"

It appears to be working. Our local diskset has multiple file systems so will I still be able to create multiple HASP RG's for them? This will allow me to set weak dependencies on those FS' for the other various resource groups which rely on them being there before they mount up/run.
thanks for your help.

Posted by Dave Schramm on August 07, 2008 at 05:35 AM PDT #

I have also been working the forum as well to see if Tim or anyone has some ideas on this. Please check out the this and comment....
http://forums.sun.com/thread.jspa?threadID=5320982&tstart=0

Posted by Dave Schramm on August 07, 2008 at 07:28 AM PDT #

OK, you opted to use Resource_dependencies_weak. That should work OK if the HASP resource goes online. However, if the HAStoragePlus resource remains offline, the Oracle resource will attempt to start anyway due to the weak dependency. To avoid this problem, you can set a strong dependency (Resource_dependencies) with the {LOCAL_NODE} qualifier. This will cause the dependency to be enforced only on the local node where the Oracle RG is going online. For more info, see the r_properties(5) man page.

In my earlier comment above, I had suggested using RG_dependencies instead of resource dependencies. That should also work as an alternative.

Posted by Martin Rattner on August 07, 2008 at 10:04 AM PDT #

Just to give an update on the physical-node affinities feature (Change Request #4778869, mentioned in the comments above.)

This feature was integrated in Solaris Cluster 3.2 1/09, the same update release in which Zone Cluster support was introduced. Physical-node affinities are now the default behavior, except when zones are configured as logical nodes (usually for prototyping or demo purposes) on a single-machine cluster.

Posted by Martin Rattner on July 27, 2009 at 01:41 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
« Sun Cluster 3.2,... | Main | What is Flash archiv... »