Introduction:
In my earlier post we have seen how we can upgrade the grid infrastructure 12.1.0.1 to 12.1.0.2 running in a Flex Cluster Environment. In the same article its demonstrated how to handle node failure/unavailability issues during the upgrade. Starting from 12.1.0.1 there is an option to complete the upgrade if in case any one/multiple nodes in the cluster are not reachable due to hardware/software failure. We can use "rootupgrade.sh -force" to complete the upgrade. The detailed steps of this option is demonstrated in first part of this article:
http://www.toadworld.com/platforms/oracle/w/wiki/11710.upgrading-oracle-12c-flex-cluster-parti
In this article we will see how we can join the failed cluster node after completion of an upgrade. The below table lists the current Environment of the Flex cluster:
S.No | Node Name | Current Version | Upgrade Status | Node Mode |
1 | flexrac1 | 12.1.0.2 | completed | Cluster Hub-Node |
2 | flexrac2 | 12.1.0.2 | completed | Cluster Hub-Node |
3 | flexrac3 | 12.1.0.1 | Failed (Hardware Issue) | Cluster Hub-Node |
4 | flexrac4 | 12.1.0.2 | completed | Cluster Leaf-Node |
5 | flexrac5 | 12.1.0.2 | completed | Cluster Leaf-Node |
We can see from the above table that "flexrac3" cluster node was failed due to Hardware issue and not able to complete the grid upgrade to 12.1.0.2. Prior to 12c grid Infrastructure if any one of the node is failed during the upgrade process then it need to be available in-time to continue the upgrade process and the node cannot be available during required time then the failed node should be deleted from existing cluster to proceed with the upgrade and it needs downtime.
But starting from 12c we can complete the upgrade even if one/multiple cluster nodes failed during the upgrade and it can be joined later to the upgraded cluster whenever the cluster nodes are back and available.
The figure below illustrates the detailed process of upgrading the cluster in case of node failure and how we can overcome such situations:
Demonstration:
This demonstration shows how we can join the failed nodes to an existing upgrade flex cluster. Here "flexrac1, flexrac2, flexrac4 & flexrac5" successfully upgraded to 12.1.0.2, but "flexrac3" was not able to complete the upgrade due to the hardware failure and in demonstration we will see how we can join this failed node to existing cluster after this node is back after fixing the issues.
Check the version of clusterware on failed node:
[root@flexrac3 bin]# ./crsctl query crs releaseversion
Oracle High Availability Services release version on the local node is [12.1.0.1.0]
Check the Cluster Services on flexrac3:
The cluster services should be disabled and it should not be running.
[root@flexrac3 bin]# ./crs_stat -t -v
CRS-0184: Cannot communicate with the CRS daemon.
[root@flexrac3 bin]#
Joining the failed cluster node to existing upgrade cluster:
[root@flexrac3 install]# perl rootcrs.pl -upgrade -join -existingnode flexrac2
Using configuration parameter file: ./crsconfig_params
2016/10/16 23:05:06 CLSRSC-4015: Performing install or upgrade action for Oracle Trace File Analyzer (TFA) Collector.
2016/10/16 23:05:06 CLSRSC-4012: Shutting down Oracle Trace File Analyzer (TFA) Collector.
2016/10/16 23:05:41 CLSRSC-4013: Successfully shut down Oracle Trace File Analyzer (TFA) Collector.
2016/10/16 23:05:53 CLSRSC-4003: Successfully patched Oracle Trace File Analyzer (TFA) Collector.
2016/10/16 23:05:55 CLSRSC-464: Starting retrieval of the cluster configuration data
2016/10/16 23:06:29 CLSRSC-465: Retrieval of the cluster configuration data has successfully completed.
2016/10/16 23:06:29 CLSRSC-363: User ignored prerequisites during installation
ASM configuration upgraded in local node successfully.
OLR initialization - successful
2016/10/16 23:08:03 CLSRSC-329: Replacing Clusterware entries in file '/etc/inittab'
CRS-4133: Oracle High Availability Services has been stopped.
CRS-4123: Oracle High Availability Services has been started.
2016/10/16 23:10:43 CLSRSC-343: Successfully started Oracle Clusterware stack
clscfg: EXISTING configuration version 5 detected.
clscfg: version 5 is 12c Release 1.
Successfully taken the backup of node specific configuration in OCR.
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
PRCC-1015 : LISTENER was already running on flexrac3
PRCR-1004 : Resource ora.LISTENER.lsnr is already running
2016/10/16 23:11:24 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded
[root@flexrac3 install]#
- We should execute this command from flexrac3 - 12.1.0.2 Grid Infrastructure home (GI_HOME/install) directory.
- In place of existing node, we should select any cluster node that was successfully upgraded and is currently available in the cluster.
Now we can see the node is successfully added to the existing upgraded Grid infrastructure cluster.
Check the version of clusterware after joining the node:
[root@flexrac3 bin]# ./crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [12.1.0.2.0]
[root@flexrac3 bin]# ./crsctl query crs softwareversion
Oracle Clusterware version on node [flexrac3] is [12.1.0.2.0]
[root@flexrac3 bin]#
Recommendation:
In oracle documentation its mentioned to use "rootupgrade.sh" script to join the cluster which I believe is not correct. The correct script to use for joining the cluster node in such situations in "rootcrs.pl" .
In documentation its mentioned :
B.8.3 Upgrading Inaccessible Nodes After Forcing an Upgrade
Starting with Oracle Grid Infrastructure 12c, after you compete a force cluster upgrade command,
you can join inaccessible nodes to the cluster as an alternative to deleting the nodes,
which was required in earlier releases. To use this option, Oracle Grid Infrastructure 12c Release 1 (12.1) software must already be installed on the nodes.
To complete the upgrade of inaccessible or unreachable nodes:
Log in as the Grid user on the node that is to be joined to the cluster.
Change directory to the /crs/install directory in the Oracle Grid Infrastructure 12c Release 1 (12.1) Grid home. For example:
$ cd /u01/12.1.0/grid/crs/install
Run the following PERL command, where existingnode is the name of the option and upgraded_node is the upgraded node:
$ rootupgrade.sh -join -existingnode upgraded_node
Note:
The -join operation is not supported for Oracle Clusterware releases earlier than 11.2.0.1.0. In such cases, delete the node and add it to the clusterware using the addNode command.
The above highlighted part is taken from "https://docs.oracle.com/database/121/CWAIX/procstop.htm#CWAIX623"
Here in this documentation its requesting to execute perl script, but in demonstration its running the shell script and moreover you cannot find the rootupgrade.sh script in $GI_HOME/install directory. If we see the header of "rootcrs.pl" script then its clearly mentioned it can be used for joining an upgraded cluster.
Header of "rootcrs.pl" script:
# madoming 10/28/13 - Add validation for current working directory
# xyuan 10/07/13 - Fix bug 17549800: add -rollback option for
# rootcrs.pl
# sidshank 04/07/13 - Add an option LANG to be passed during automatic
# execution.
# xyuan 08/31/12 - Fix bug 14535011 - Add '-init' option
# xyuan 08/12/12 - Fix bug 14464512
# xyuan 04/24/12 - Add options for joining an upgraded cluster <<--- option for joining node with upgrade option
# sidshank 04/20/12 - adding -auto option to be used internally by root
# automation alone.
So the correct script to be used in such situations in "$GI_HOME/install/rootcrs.pl"
Conclusion:
Before 12cR1 all cluster nodes must be available during the upgrade and if any node crashed during the upgrade process then upgrade cannot continue until the issue is fixed or problematic node is removed from the existing cluster. But starting from 12c this has been simplified to next level.
If cluster node failed during the upgrade, its just fine >>> Complete the upgrade by using "rootupgrade.sh --force" option
If failed node is back after issues has been resolved >>> join it to the existing upgraded cluster using command "perl rootcrs.pl -upgrade -join -existingnode"
These options are really helpful in large scale cluster Environments.