This note discusses how to achieve high availability in an Oracle 11g GoldenGate environment using Oracle 11gR2 Grid Infrastructure Clusterware services.
The GoldenGate manager process has some configuration parameters which we can set to prevent outages like AUTOSTART and AUTORESTART which will ensure that GoldenGate Extract and Replicat processes will get restarted in the event of a failure – but we need to note that for this to happen, the Manager process needs to be up and running.
But what happens if the manager process itself fails as a result of a server or OS crash or network failure or database crash?
So let us look at how to provide high availability for the manager process in a clustered environment.
In this example we are using GoldenGate in a two-node 11gR2 RAC cluster with ASM storage configured.
In a two-node RAC cluster configuration, Oracle GoldenGate runs on only one server at any given time. If that server goes down, GoldenGate is restarted automatically by CRS on the other available node.
While we can install GoldenGate on both nodes, we need to ensure that the checkpoint files which are located in the dirchk directory and the trail files are residing in a location that is accessible from any node as well as we need to ensure that the parameter files are identical in both nodes of the cluster.
So in our case we are using ACFS which is the Automatic Storage Management (ASM) Cluster File System which acts as ahared storage location and we need to install GoldenGate just once in this shared storage location and the same can then be accessed fron either node in the cluster as the ACFS is mounted on both nodes in the cluster.
In addition to configuring the ACFS, we also need to obtain an unused IP address on the public subnet whch is registered in DNS – this is the VIP or the Virtual IP address. In the event of a node failure, Oracle Clusterware will migrate the VIP to a surviving node in the cluster.
Have a read of the Oracle white paper on the subject – I have followed that in setting up the GoldenGate clusterware.
Oracle White Paper—Oracle GoldenGate high availability with Oracle Clusterware
Let us now take a look at the steps involved.
Install the GoldenGate software on the shared location
The first thing we are doing here is to setup and configure ACFS and install the 11gR2 GoldenGate software in this shared location.
In our case, we have a mountpoint /goldengate which is the ASM Cluster File System and this file system is the shared location for the GoldenGate software mounted on both nodes in the cluster.
In short create the ASM disk group, then the volume and finally the ASM Cluster File System.
Install the GoldenGate 11gR2 software then in this location. As this is a shared location, we need to install GoldenGate just once .
Create the application VIP
From the GRID_HOME/bin directory run the appvipcfg command to create the application VIP. Oracle Clusterware assigns this VIP to a physical server in the cluster and will migrate the VIP to a surviving node in the cluster in the event of a server failure.
As root:
[root@mycorp-racnode1 bin]# ./appvipcfg create -network=1 -ip=10.50.20.52 -vipname=mycorp-oragg-vip -user=root Production Copyright 2007, 2008, Oracle.All rights reserved 2012-11-28 03:21:48: Skipping type creation 2012-11-28 03:21:48: Create the Resource 2012-11-28 03:21:48: Executing /u01/app/11.2.0.3/grid/bin/crsctl add resource mycorp-oragg-vip -type app.appvip_net1.type -attr "USR_ORA_VIP=10.6.20.52,START_DEPENDENCIES=hard(ora.net1.network) pullup(ora.net1.network),STOP_DEPENDENCIES=hard(ora.net1.network),ACL='owner:root:rwx,pgrp:root:r-x,other::r--,user:root:r-x',HOSTING_MEMBERS=mycorp-racnode1,APPSVIP_FAILBACK=" 2012-11-28 03:21:48: Executing cmd: /u01/app/11.2.0.3/grid/bin/crsctl add resource mycorp-oragg-vip -type app.appvip_net1.type -attr "USR_ORA_VIP=10.6.20.52,START_DEP
We also have to allow permissions to the Grid Infrastructure owner (grid) and Oracle database and GoldenGate software owner (oracle) to run the script which starts the VIP.
As root:
[root@mycorp-racnode1 bin]# ./crsctl setperm resource mycorp-oragg-vip -u user:oracle:r-x [root@mycorp-racnode1 bin]# ./crsctl setperm resource mycorp-oragg-vip -u user:grid:r-x
As oracle, start the application VIP via the crsctl start resource command from the GRID_HOME/bin
[oracle@mycorp-racnode1 bin]# ./crsctl start resource mycorp-oragg-vip CRS-2672: Attempting to start 'mycorp-oragg-vip' on 'mycorp-racnode2' CRS-2676: Start of 'mycorp-oragg-vip' on 'mycorp-racnode2' succeeded
We can now verify whether VIP is running and on which node it is running via the crsctl status resource command
[oracle@mycorp-racnode1 bin]# ./crsctl status resource mycorp-oragg-vip NAME=mycorp-oragg-vip TYPE=app.appvip_net1.type TARGET=ONLINE STATE=ONLINE on mycorp-racnode2
From another node in the cluster we should now be able to ping the VIP’s IP address.
Create the Agent script
The agent script is used by Oracle Clusterware to check if the manager process is running and also to stop and start the manager as the case may be.
There is a sample script which is available in the appendix of the Oracle white paper (link mentioned ablove) which can be modified accordingly.
Here is the script I have used in my case. You will need to change the location of the GGS_HOME, CRS_HOME, ORACLE_HOME, LD_LIBRARY_PATH, and if you are using ASM, the ASMPASSWORD value which is embedded in the agent script.
This script is called 11gr2_gg_action.sh and is copied to the /goldengate directory location so that it is accessible from either node in the cluster.
Register a resource in Oracle Clusterware
We run the crsctl add resource command to register Oracle GoldenGate as a resource in Oracle Clusterware. The resource name in this case is ggate
[oracle@mycorp-racnode1 bin]# ./crsctl add resource ggate -type cluster_resource
-attr "ACTION_SCRIPT=/goldengate/11gr2_gg_action.sh,CHECK_INTERVAL=30,
START_DEPENDENCIES='hard(mycorp-oragg-vip,ora.asm) pullup(mycorp-oragg-vip)', STOP_DEPENDENCIES='hard(mycorp-oragg-vip)'"
The START_DEPENDENCIES and STOP_DEPENDENCIES indicates that the VIP and the ggate resource should always start and stop together.
Since in our case the Oracle GoldenGate software owner is not the same as the Grid Infrastructure owner, we need to run the crsctl setperm command to set the
ownership of the application to the GoldenGate software owner
As root:
[root@mycorp-racnode1 bin]# ./crsctl setperm resource ggate -o oracle
Start the Application We can now use Oracle Clusterware to start GoldenGate. We connect as oracle and run the crsctl start resourcecommand from the GRID_HOME/bin.
[oracle@mycorp-racnode1 bin]$ ./crsctl start resource ggate CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode1' CRS-2676: Start of 'ggate' on 'mycorp-racnode1' succeeded
We can use the crsctl status resource command to verify the state of the resource and which node in the cluster it is currently running on.
[oracle@mycorp-racnode1 bin]$ ./crsctl status resource ggate NAME=ggate TYPE=cluster_resource TARGET=ONLINE STATE=ONLINE on mycorp-racnode1
If we connect to GoldenGate, we can see that the manager process has been started up automatically.
GGSCI (mycorp-racnode1) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING
I found that if the manager process was already running and then we go and issue the crsctl start resource command, we got the following error message:
[oracle@mycorp-racnode1 bin]$ ./crsctl start resource ggate CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode1' CRS-2674: Start of 'ggate' on 'mycorp-racnode1' failed CRS-2679: Attempting to clean 'ggate' on 'mycorp-racnode1' CRS-2681: Clean of 'ggate' on 'mycorp-racnode1' succeeded CRS-2527: Unable to start 'ggate' because it has a 'hard' dependency on 'mycorp-oragg-vip' CRS-2525: All instances of the resource 'mycorp-oragg-vip' are already running; relocate is not allowed because the force option was not specified CRS-4000: Command Start failed, or completed with errors.
What happens if we manually stop the manager process?
Oracle Clusterware starts it up!
GGSCI (mycorp-racnode1) 2> stop manager Manager process is required by other GGS processes. Are you sure you want to stop it (y/n)? y Sending STOP request to MANAGER ... Request processed. Manager stopped. GGSCI (mycorp-racnode1) 3> quit [oracle@mycorp-racnode1 goldengate]$ ./ggsci Oracle GoldenGate Command Interpreter for Oracle Version 11.2.1.0.3 14400833 OGGCORE_11.2.1.0.3_PLATFORMS_120823.1258_FBO Linux, x64, 64bit (optimized), Oracle 11g on Aug 23 2012 20:20:21 Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved. GGSCI (mycorp-racnode1) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING
While GoldenGate is running we can test the failover process by relocating it to another node in the cluster via the crsctl relocate resource command
Note that we have to use the -f (force) option because it is already running when we are trying to manually relocate it to another node.
[oracle@mycorp-racnode1 bin]$ ./crsctl relocate resource ggate -f CRS-2673: Attempting to stop 'ggate' on 'mycorp-racnode1' CRS-2677: Stop of 'ggate' on 'mycorp-racnode1' succeeded CRS-2673: Attempting to stop 'mycorp-oragg-vip' on 'mycorp-racnode1' CRS-2677: Stop of 'mycorp-oragg-vip' on 'mycorp-racnode1' succeeded CRS-2672: Attempting to start 'mycorp-oragg-vip' on 'mycorp-racnode2' CRS-2676: Start of 'mycorp-oragg-vip' on 'mycorp-racnode2' succeeded CRS-2672: Attempting to start 'ggate' on 'mycorp-racnode2' CRS-2676: Start of 'ggate' on 'mycorp-racnode2' succeeded
We noe can see that the Manager process has been started up on the second node and via the crsctl status resource command we can verify that the ggate resource is now running on the second node in the cluster (mycorp-racnode2).
GGSCI (mycorp-racnode2) 3> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING [oracle@mycorp-racnode1 bin]$ ./crsctl status resource ggate NAME=ggate TYPE=cluster_resource TARGET=ONLINE STATE=ONLINE on mycorp-racnode2
What happens if we kill the Manager process?
[grid@mycorp-racnode1 goldengate]$ ps -ef |grep mgr grid 11796 1 0 05:41 ? 00:00:00 ./mgr PARAMFILE /goldengate/dirprm/mgr.prm REPORTFILE /goldengate/dirrpt/MGR.rpt PROCESSID MGR PORT 7809 grid 12763 12669 0 05:50 pts/0 00:00:00 grep mgr [grid@mycorp-racnode1 goldengate]$ kill -9 11796
As soon as I killed the Manager process from the OS, I quickly checked the status of the Manager as well as the Extract process which was running at the time I killed the process from the OS.
Note the status of both the processes and the how the status quickly changes.
GGSCI (mycorp-racnode1) 1> info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER STOPPED EXTRACT RUNNING EXT1 00:00:00 00:00:02 GGSCI (mycorp-racnode1) 2> ! info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT ABENDED EXT1 00:00:00 00:00:06 GGSCI (mycorp-racnode1) 3> ! info all Program Status Group Lag at Chkpt Time Since Chkpt MANAGER RUNNING EXTRACT RUNNING EXT1 00:00:00 00:00:10