paint-brush
Fixing Oracle RAC Node Problems With Addnode: GRID Binariesby@lolima
229 reads

Fixing Oracle RAC Node Problems With Addnode: GRID Binaries

by Rodrigo LimaSeptember 4th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Previously, I explained how to fix a node where only the DB binaries were corrupted. Now, let's try to fix a node with only the GRID binaries corrupted.
featured image - Fixing Oracle RAC Node Problems With Addnode: GRID Binaries
Rodrigo Lima HackerNoon profile picture


Previously, I explained how to fix a node where only the DB binaries were corrupted. Now, let's try to fix a node with only the GRID binaries corrupted.


My configuration
Oracle RAC with 2 nodes: ol8-19-rac1 and ol8-19-rac2
CDB: cdbrac1
PBD: pdb1

Grid Version
34318175;TOMCAT RELEASE UPDATE 19.0.0.0.0 (34318175)
34160635;OCW RELEASE UPDATE 19.16.0.0.0 (34160635)
34139601;ACFS RELEASE UPDATE 19.16.0.0.0 (34139601)
34133642;Database Release Update : 19.16.0.0.220719 (34133642)
33575402;DBWLM RELEASE UPDATE 19.0.0.0.0 (33575402)

DB Version
34086870;OJVM RELEASE UPDATE: 19.16.0.0.220719 (34086870)
34160635;OCW RELEASE UPDATE 19.16.0.0.0 (34160635)
34133642;Database Release Update : 19.16.0.0.220719 (34133642)

When you see <dbenv>, load the DB HOME variables.
When you see <gridenv>, load the GRID HOME variables.

When you see <bnode>, execute the command on the broken node.
When you see <anode>, execute the command on any other working node.


I am using an installation where both DB and GRID are installed under the user ORACLE, and I set the variables to access each environment. But the same procedure works even if the installation is done using two different users (usually oracle e grid).


Always validate any procedure before you try it in a production environment.


Scenario B - GRID Binaries Corrupted On ol8-19-rac2 After Patch Apply

In this scenario, only the GRID binaries were affected, so let's try to preserve the DB binaries. The stack is already down since the binaries are corrupted.


01. Backup the OCR configuration
================================
As root, at any other node, create a OCR's backup. Better safe than sorry.


<anode gridenv root>
[root@ol8-19-rac2 scripts]# ocrconfig -manualbackup
ol8-19-rac2     2022/11/17 13:40:44     +CRS:/ol8-19-cluster/OCRBACKUP/backup_20221117_134044.ocr.258.1121002845     896235792


02. Check whether both nodes are unpinned (they must be Unpinned to proceed)
============================================================================
<anode gridenv>
[oracle@ol8-19-rac1 ~]$ olsnodes -s -t
ol8-19-rac1     Active  Unpinned
ol8-19-rac2     Active  Unpinned

If the node is pinned, then run the crsctl unpin css.


03. Backup the $ORACLE_HOME/network/admin
=========================================
<bnode gridenv>
[oracle@ol8-19-rac2 ~]$ mkdir -p /tmp/oracle; tar cvf /tmp/oracle/grid_netadm.tar -C $ORACLE_HOME/network/admin .


04. Deinstall the GRID binaries
===============================
<bnode gridenv>
[oracle@ol8-19-rac2 ~]$ $ORACLE_HOME/deinstall/deinstall -local
Confirm whether everything is OK, then say yes.

After, execute  the root step on another prompt.
[root@ol8-19-rac2 ~]# /u01/app/19.0.0/grid/crs/install/rootcrs.sh -force  -deconfig -paramfile "/tmp/deinstall2022-11-17_01-49-44PM/response/deinstall_OraGI19Home1.rsp"
...
2022/11/17 13:58:09 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node

Now, press enter to continue and finish the deinstall.


05. Remove the broken node from the cluster
===========================================
<anode gridenv root>
[root@ol8-19-rac1 ~]# crsctl delete node -n ol8-19-rac2
CRS-4661: Node ol8-19-rac2 successfully deleted.


06. Remove the broken node's VIP
================================
<anode gridenv root>
[root@ol8-19-rac1 ~]# srvctl stop vip -vip ol8-19-rac2
[root@ol8-19-rac1 ~]# srvctl remove vip -vip ol8-19-rac2
Please confirm that you intend to remove the VIPs ol8-19-rac2 (y/[n]) y


07. Check if the cluster is OK after the node removal
=====================================================
<anode gridenv>
[oracle@ol8-19-rac1 ~]$ cluvfy stage -post nodedel -n ol8-19-rac2 -verbose

Performing following verification checks ...

  Node Removal ...
    CRS Integrity ...PASSED
    Clusterware Version Consistency ...PASSED
  Node Removal ...PASSED

Post-check for node removal was successful.

CVU operation performed:      stage -post nodedel
Date:                         Nov 17, 2022 2:04:22 PM
Clusterware version:          19.0.0.0.0
CVU home:                     /u01/app/19.0.0/grid
Grid home:                    /u01/app/19.0.0/grid
User:                         oracle
Operating system:             Linux5.4.17-2136.312.3.4.el8uek.x86_64


08. Pre node addition validation
================================
If we were adding an actual new node, we were supposed to run some cluster verification checks to confirm that everything is OK, but since the node was already part of the cluster, I usually skip this validation, but you can execute it, if you want.

[oracle@ol8-19-rac1 ~]$ cluvfy stage -pre nodeadd -n ol8-19-rac2 -verbose -fixup


09. Add the node back
=====================
From any other node, add the ex-broken node back. This step takes a while, as it copies the files from one node to another.

<anode gridenv>
[oracle@ol8-19-rac1 ~]$ $ORACLE_HOME/addnode/addnode.sh -silent "CLUSTER_NEW_NODES={ol8-19-rac2}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={ol8-19-rac2-vip}" "CLUSTER_NEW_NODE_ROLES={hub}"

On the ex-broken node, execute the root script
<bnode root>
[root@ol8-19-rac2 tmp]# /u01/app/19.0.0/grid/root.sh


10. Check if the cluster is OK
==============================
From any other node, run the cluster verification.

<anode grid>
[oracle@ol8-19-rac1 ~]$ cluvfy stage -post nodeadd -n ol8-19-rac2 -verbose
...
Post-check for node addition was successful.


CVU operation performed:      stage -post nodeadd
Date:                         Nov 17, 2022 2:32:12 PM
Clusterware version:          19.0.0.0.0
CVU home:                     /u01/app/19.0.0/grid
Grid home:                    /u01/app/19.0.0/grid
User:                         oracle
Operating system:             Linux5.4.17-2136.312.3.4.el8uek.x86_64


Voilà, your cluster, and your database are supposed to be back online.


[oracle@ol8-19-rac2 ~]$ crsctl check cluster -all
**************************************************************
ol8-19-rac1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
ol8-19-rac2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************

[oracle@ol8-19-rac2 ~]$ srvctl status database -db cdbrac
Instance cdbrac1 is running on node ol8-19-rac1
Instance cdbrac2 is running on node ol8-19-rac2


In my next article, I'll show you how to recover from a scenario where we lost both GRID and DB binaries.