This post discusses an issue encountered installing Oracle 11.2.0.2 on a two-node x86-64 cluster running Oracle Enterprise Linux 5 Update 5.
The Grid Infrastructure installation went smoothly until we tried to run root.sh on the second node. The script failed with the following error:
CRS-4402: The CSS daemon was started in exclusive mode but found an active CSS daemon on node <nodename>, number 1, and is terminating An active cluster was found during exclusive startup, restarting to join the cluster Failed to start Oracle Clusterware stack Failed to start Cluster Synchorinisation Service in clustered mode at /u01/app/11.2.0.2/grid/crs/install/crsconfig_lib.pm line 1016 /u01/app/11.2.0.2/grid/perl/bin/perl -I/u01/app/11.2.0.2/grid/perl/lib -I/u01/app/11.2.0.2/grid/crs/install /u01/app/11.2.0.2/grid/crs/install/rootcrs.pl execution failed
We decided to try installing Oracle 11.2.0.1 and then upgrading to Oracle 11.2.0.2. This process requires the latest 11.2.0.1 Grid Infrastructure PSU to prevent a failure when the rootupgrade.sh script is executed on the first node. When this patch has been applied, rootupgrade.sh succeeded on the first node, but failed with the error described above on the second node.
This issue is described in MOS Note 1212703.1 “Grid Infrastructure install or upgrade may fail due to Multicasting”
If multicasting is not enabled on the private network, then root.sh will be successful on the first node, but will fail on the second and subsequent nodes when attempting to start CSSD. This affects both installations and upgrades.
Multicasting is required to enable the new HAIP interconnect feature. If multicast is not enabled, the node will not be able to join the cluster.
According to the note, the only solution is to enable multicasting on the private network (interconnect). This could be nasty on a production system, particularly for an out-of-hours upgrade where the relevant network specialists may not available to modify the switch configurations.
However, we did some research and it appears that multicasting is already enabled by default in OEL5U5. Each network interface described in ifconfig already had MULTICAST enabled. However past experience tells us that just because something is configured at operating system level, we cannot assume it is configured at switch level – remember jumbo frames?
Since we originally discovered this problem, Oracle have released a utility to test the availability of multicast addresses. The utility is called mcasttest and can be downloaded from MOS Note 1212703.1 Grid Infrastructure install or upgrade may fail due to Multicasting
In the environment discussed in this post the mcasttest utility returned the following output:
$ ./mcasttest.pl -n server23,server24 -i bond1 ########### Setup for node server23 ########## Checking node access 'server23' Checking node login 'server23' Checking/Creating Directory /tmp/mcasttest for binary on node 'server23' Distributing mcast2 binary to node 'server23' ########### Setup for node server24 ########## Checking node access 'server24' Checking node login 'server24' Checking/Creating Directory /tmp/mcasttest for binary on node 'server24' Distributing mcast2 binary to node 'server24' ########### testing Multicast on all nodes ########## Test for Multicast address 230.0.1.0 Nov 19 11:29:11 | Multicast Failed for bond1 using address 230.0.1.0:42000 Test for Multicast address 224.0.0.251 Nov 19 11:29:12 | Multicast Succeeded for bond1 using address 224.0.0.251:42001
The mcasttest utility first attempts to use 230.0.1.0 which is the default address. It then repeats the test for 224.0.0.251. If the first test fails, but the second test succeeds as shown in the above example then Oracle recommends that the patch for bug Bug 9974223 – “Grid Infrastructure needs multicast communication on 230.0.1.0 address working” is installed on each node in the cluster after installation of the Oracle binaries, but before running root.sh or rootupgrade.sh.
I have subsequently successfully installed Oracle 11.2.0.2 Grid Infrastructure at another site without any issues. The second site is also a 2-node Linux x86-64 cluster, this time running Red Hat Enterprise Linux 5 Update 4. Both the public and private networks are bonded. In this case Oracle 11.2.0.2 installed without any problems at the first attempt.
We also hit similar issues attempting to install Oracle 11.2.0.2 Grid Infrastructure on a new Solaris 5.10 SPARC cluster. These are described in a separate post.
Yes, its a nasty thing that the damn oracle developers have done it quietly. This is supposed to be a patchset. Also we have so many VCS clusters and we never ran into this multicasting issue. This is something unique to Oracle developers and their software. Oh, well, this is what happens when a database company tries to take over the world by developing clusterwares and volume managers (ASM) etc.,
Hi Julian
How are things?
Just wondering if you’ve manage to sort out this issue? We’ve run into exactly the same issue trying to do a new install on RHEL 5.4. Currently checking with the network admins if it’s been disabled on the switch…
Cheers
Fred
Hi Fred
It is interesting to see you are hitting this issue in RHEL5U4. We hit it on OEL5U5. I have not been able to return to the original site since I wrote the post. I have been working at a different site this week with almost identical architecture, but we have not encountered the problem here. This new site where it works is RHEL5U4. I will be here for a few more days so feel free to contact me by e-mail and we can compare environments.
I ran into this same issue. After trying several different options, I found removing NIC bounding on the OS level resolved this issue.
That is a very interesting approach since my understanding is that the new HAIP feature is intended to provide similar functionality to NIC bonding. It is entirely possible that there is some conflict between operating system level bonding and HAIP.
Linux-only users may not be aware, but the Solaris equivalent of NIC bonding is IPMP. However, IPMP has an asymmetrical active/backup configuration that is very different from NIC bonding and which did not work in 11.2.0.1 until a patch was released. I think HAIP is an attempt by Oracle to provide their own private interconnect bonding to eliminate these porting issues. It is possible that HAIP will not work with proprietary bonding solutions.
I’ll have to take what I posted back, unbounding the 2 NICs aren’t the solutions for this problem, it just resolved my issue coincidentally because the one NIC ( I was using after un-bounding) was plugged in a switch that had been configured to allow multicast. The issue still lies in the switch’s multicast configurations. In our case, we are using Cisco switches, we found out if you configure “no ip igmp snooping” on the switch, multicast works, if you remove that configuration, multicast stops working. There’s a metalink note 1212703.1 that provides a program to test if multicast works or not. Configure the switch with “no ip igmp snooping” doesn’t make a lot sense, but it does make the multicast work.
Thanks for this comment Li Li. From the reaction on this blog and personal messages I have received over the last couple of weeks, there seem to be a lot of people experiencing this problem. It would be really useful to know from others whether this solution works in all cases.
Thanks also for the MOS note number; I will investigate this week.
[...] issue related to Multicasting (required to enable HAIP interconnect feature) in below post http://juliandyke.wordpress.com/2010/10/06/oracle-11-2-0-2-requires-multicasting-on-the-interconnect… Julian also published test program to check if multicasting is enabled on your box [...]
It looks like my test programs may have already been superceded by MOS Note 1212703.1
Hi Julian,
We hit this issue too on RHEL5u4. Our network team is saying Multicasting is enabled and there is nothing else to configure but running Oracle’s Multicasting tests we found that they fail on address 230.0.1.0.
After contacting Oracle Support they’ve told us they’re releasing a patch (patch 9974223 – not yet public) to change the multicasting address to 224.0.0.251. When running the Multicasting tests on this address they work successfully.
Hopefully when Oracle makes this patch public it will fix our problem
Hi Julian
Hope things are good.
We’ve managed to solve the issue a couple of weeks ago too.
To give a bit of background, our installations are two, three-node grids in two different data centres. The OS is RHEL 5.4 with EMC DMX-4′s for storage and Cisco switches for the private interconnect. Different switches are used in the two data centres (older Cisco switches in the one and newer Cisco Nexus switches at the other).
Initially our networking team indicated that multicasting was configured. Further investigation however indicated that it was not. Enabling the switches for multicasting proved to be a trivial task. I think one of your users above indicated by disabling IGMP snooping on the switch they got it to work. We had to implement another small change in addition to this. Will post details soon.
One interesting thing was that the Oracle provided utilities to test multicasting (they provide two – one written in Java and one in C) showed multicasting as configured at the one site, but not the other. In both cases it turned out not to be configured. The networking guys are still troubleshooting this.
Just to be clear though, no change is necessary at the OS level.
Cheers
Frederik
Frederik,
Please post all details for enabling switches for mulitcast.My network team is also indicating that musticasting is configured but mcast1 ( Oracle test program ) is not showing it.
Thanks,
Miladin
I’m hitting similar issue with a fresh installation of 11gR2 on AIX 6.1 too. The root.sh failed at starting up of ora.cluster_interconnect.haip. After patching with 9974223, no luck again. But the failing statement in crsconfig_lib.pm changed from 6484 to 1074 with asm instance and vote disk successfully created. Case reported to oracle.