How To Use HVR and Oracle Data Guard for Source Database Failover
Consider the scenario when HVR is capturing changes from an Oracle database, which is kept in sync with a Disaster Recovery (DR) site using Oracle Data Guard. When the source database fails over to the DR site special steps are needed with HVR to resume capture. In general, three situations must be considered.
Graceful failover: A planned failover where all applications and all HVR processes are stopped in a controlled manner and brought up again on the DR site. No data loss should occur.
Abrupt failover: An emergency occurs, for example, a hardware failure, and applications are switched over without being stopped in a controlled way. In this case, there is a possibility that changes that were committed to the old production are missing in the production source.
Recovery from a break in continuous replication, for example, due to a misconfiguration or loss of an Oracle archive file. This could occur irrespective of any failover.
A refresh is sometimes needed in Situation 2 (only if there was data loss) and Situation 3. This document outlines these procedures. The following assumptions are made.
The configuration consists of the following databases:
Oracle source database
DR copy maintained by Data Guard
The HVR Hub database, which is outside the objects which are failed over
The target database, which can be any or the supported target database (not necessarily Oracle). This database is also not failed over.
The capture location will be reconfigured to point at the new source, either automatically (in virtual IP or CNAME used in the Location configuration fails over with the source database) or manually (in the HVR GUI).
Data Guard is running in physical standby mode. This is important because it means that HVR capture cannot be re-initialized to an SCN from before the failover. That would be possible when running logical standby.
Graceful Failover
The procedure for a graceful failover:
Stop all active applications/end-users on the production database.
Check that HVR has no latency, or wait until the latency of the capture and integrate jobs drops to zero. Simultaneously confirm that Data Guard also has no latency.
Switch the capture location to the DR site, either automatically (VIP or CNAME changes) or manually (reconfigure node in Location Property dialog).
Switch the DR site to become primary.
Run HVR Initialize to this SCN; Select only the source location. In the Advanced Options, check:
- Table Enrollment (because the redo and archive may be named differently)
- Transaction Files and Capture Time
- Scripts and Jobs
For Capture Rewind, choose options:
- To time: time after when the DR site became primary, but before end-users are allowed in.
- Emit from that time
Start capture.
Abrupt Failover
This procedure starts after the emergency has happened and the DR became primary. In this case, the HVR capture job has had an error and is probably in RETRY state. It will be able to connect to the new primary but will not be able to resume capture because it cannot to find the correct starting record in the Oracle REDO or archive files.
It is important to identify whether the failover resulted in any data loss or data inconsistencies. If both HVR and Data Guard were running without latency or if no changes were made to the source database at the time of the failover, it can be assumed that all databases are synced and that no extra steps are necessary; the steps for Graceful Failover can be followed.
However, if there was latency, there can be out-of-sync data, for example, due to changes that Data Guard has replicated to the new production but are not yet replicated by HVR to the target machine (or vice versa). In these cases, an additional HVR Refresh should be done to correct for these missed records. The procedure is:
Check whether HVR or Data Guard were both running with a non-zero latency at the time of the failover. If both latencies were zero, then do Graceful Failover instead of this procedure.
Suspend the HVR capture job.
Leave the HVR integrate job running; it may still be processing transactions captured from the old production. Wait until has processed all transaction files.
Determine the timestamp of the failover.
Run HVR Initialize to this timestamp; Select only the source location. In the Advanced Options, check the following:
- Table Enrollment
- Transaction Files and Capture Time
- Scripts and Jobs
For Capture Rewind, choose options:
- To time: time when the DR site becomes primary
- Emit from this time
Run HVR Refresh in Online mode to correct for data inconsistencies. See below procedure “Recovery from data loss” for details.
Start capture.
Recovery from data loss
HVR comes with two utilities that may identify or resolve data loss: HVR Compare and HVR Refresh. HVR Compare can be used to determine which tables are out of sync. Those tables can subsequently be synced again using HVR Refresh. An alternative is to not use HVR Compare but immediately use HVR Refresh for all tables instead. The latter can be preferable when it is certain that data is lost and a complete refresh does not take too much time.
Identifying whether there is data loss or inconsistency
HVR Compare compares the data in the tables on source and target location. It uses the channel definition as the basis for comparison, including any transformations that may be defined in the channel and taking into consideration data type mappings between source and destination.
Compare provides two modes:
Bulk: the bulk mode computes the checksum of all data in a table. HVR will report a difference between the source and destination tables if the checksum is different, irrespective of whether the also reported rowcount is identical. The bulk mode runs on the database server and only passes the checksum to the hub, so it is very efficient.
Row-by-row: in row-by-row mode all data is sorted and passed to the hub for detailed comparison. Row-by-row comparison in verbose mode provides a detailed report of all differences including a set of SQL statements to be run on the destination of the compare operation to bring its system back in sync with the source system.
The compare job does not take in-flight transactions into consideration so expect differences on a live system. Therefore, compare jobs are ideally run during downtime or when the system is relatively idle.
In order to identify tables that have data consistency issues, the following procedure may be used:
Run HVR Compare in bulk mode for all tables.
For tables that report differences in step 1.
a. Run Compare in row-by-row mode. Keep the number of reported differences in a spreadsheet.
b. Repeat this step multiple times for the tables that have differences.
c. When the row count of the differences goes up and down with a low minimum (for example, 100, 10, 0, 1, 0 on subsequent runs), this indicates that the differences are due to in-flight transactions. If the row count fluctuates but has a high minimum (for example, 10100, 10010, 10000 on subsequent runs), this indicates that the table has missed data.
For tables that have been identified with data loss, you should use Refresh in Online Mode.
Using Refresh in Online Mode to Recover
The Online Refresh procedure enables resynchronization without requiring end-user down-time on the source database. Like HVR Compare, the HVR Refresh has two modes of operation:
Bulk: Refresh will truncate the target table and use a bulk load method to load the data. This is very efficient, but cannot be used when end-users are accessing the data in the target tables.
Row-by-row: This will do a row-by-row compare first but after identifying the differences it will apply them using insert/update/delete statements. This is less intrusive on the target tables.
The steps for Online Refresh are:
Suspend the integrate jobs. You can leave the capture jobs running.
Run Refresh in Online mode. For skipping there are different options:
Skip changes from before the refresh on read and write side. This can be used when all data is refreshed (no refresh condition) and there is only 1 integrate location
Skip changes from before the refresh on write side only. This can be used when all data is refresh (no refresh condition) when there multiple integrate locations
No skipping; only be resilient. This should be used in combination with Restrict /RefreshCondition (for example, when not all data is being refreshed).
After the refresh, the integrate jobs can be restarted. During the first integrate cycle, changes before refresh will be skipped (if that option was chosen) and changes from during refresh will be applied with ‘resilience’.
The target databases will be consistent after the first integrate cycle has finished.