Wednesday Feb 07, 2007
Wednesday Feb 07, 2007
I've often been asked what value the Sun Cluster 3.x software brings to Oracle 10g R2 RAC deployments when 10g already has a built-in cluster framework. The short answer is increased robustness, availability and, above all, data integrity. These qualities stem mainly from Sun Cluster's membership and failure fencing features that have been present in the product since its initial release.
To try and condense the all technical arguments behind these assertions into a suitable blog posting would be almost impossible. Instead, my colleagues and I have written a whitepaper that provides detailed technical information on the benefits of using Sun Cluster 3.x when deploying Oracle 10g R2 RAC databases. The whitepaper can be downloaded from the Sun web site.
Tim Read
Staff Engineer
Monday Oct 16, 2006
Oracle is by far and away the most popular service running on Sun Cluster 3.x . Sun Cluster supports highly available (HA) Oracle, Oracle Parallel Server (OPS) and Oracle Real Application Cluster ( RAC ) giving users a very wide choice. Here it is the breadth of release, operating system and platform coverage that drives its appeal.
The HA Oracle agent on SPARC supports a long list of Oracle releases from 8.1.6.x on Solaris 8 to 10.2.0.x on Solaris 10 and numerous options in between. Additionally, the Sun Cluster 3.1u4 for (64 bit) x86 HA Oracle agent supports Oracle 10g R1 (32 bit) and 10g R2 (64 bit).
The parallel database coverage is similarly extensive with the SPARC platform supporting a broad set of volume manager (Solaris Volume Manager and Veritas Volume Manager) and Oracle releases from 8.1.7 up to 10.2.0.x. In addition Oracle 10g R2 (10.2.0.x) is also supported on the 64 bit x86 platform.
There are also a wide set of Oracle data storage options: raw disk, highly available local file systems and global file systems for HA Oracle; raw disk or network attached storage for Oracle OPS and raw disk, network attached storage or shared QFS file system for Oracle RAC.
But why even mention that Sun support these releases, why don't Sun support all releases in every hardware and software combination? The answer is that high availability is Sun Cluster's number one goal and achieving this doesn't happen by accident. It demands careful design and implementation of the software using extensive peer review of all code changes, followed by extremely thorough testing.
Having only joined the engineering group in the last year or so, I was staggered by the sheer volume of testing that is actually performed. It was also encouraging to see how close the engineering relationship was with Oracle too. For the recent release of Oracle 10g R2 on 64 bit x86 Solaris, the team I work with performed numerous Oracle designed tests on the product. These checked the installation process, its 'flexing' capability, i.e. adding or removing nodes, and its co-existence with previous releases, each for the various types of storage option. These tests numbered in the 100s and often required re-tests if bugs were found and these were just the Oracle mandated tests. In addition the Sun Cluster QA performed extensive load and fault injection tests.
It's these latter two items that set Sun Cluster apart in the robustness stakes. What makes an insurance policy worth the investment is the degree of confidence the user has that it will 'do the right thing' when a failure occurs. When system is sick or under load, user land processes often don't respond or may only respond after a long delay. It may also be difficult to determine whether other cluster nodes are alive or dead. Here, Sun Cluster comes into its own; the kernel based membership monitor very quickly determines whether cluster nodes are alive or not and takes action, i.e. failure fencing, to ensure that failed or failing nodes do not corrupt critical customer data.
By using automated test harnesses, Sun Cluster's Quality Assurance (QA) team are able to simulate a wide variety of fault conditions, e.g. killing critical processes or aborting nodes. These can be performed repeatably at any point during the test cycle. Faults are also injected even while the cluster is recovering from previous faults. In addition, the QA team perform a comprehensive set of manual, physical fault injections, such as disconnecting network cables and storage connections. All of this helps ensure that the cluster survives and continues to provide service, even in the event of cascading failures, and under extreme load.
This level of "certification", rather than simple functional regression testing, means that Sun Cluster has the capability to achieve levels of service availability that competing products may struggle to match.
Tim Read
Staff Engineer