HVR High Availability

This section provides a high-level description of how to make HVR data replication highly available in your environment.

Introduction

An essential consideration for High Availability (HA) in data replication is understanding the impact of downtime on operations and its associated business costs. When replication is inactive, the flow of data between source(s) and target(s) ceases. If the replication downtime is unrelated to the availability of source(s) and target(s), latency will gradually increase until replication is restored. While data remains accessible on the target during this period, it becomes progressively stale until replication resumes, providing access to the most current data from the source. Encountering high latency is similar to facing replication unavailability.

The availability of data replication depends on various factors, including the availability of source(s) and target(s), open network communication, and the proper functioning of software. Failures in individual components within this complex setup can lead to replication downtime. However, downtime in replication does not necessarily imply a complete loss of data access; instead, data on the target(s) may become stale, similar to the latency introduced during replication unavailability.

Strategically planning your High Availability (HA) strategy for data replication is crucial to ensure optimal data access across your organization. For further assistance with implementing these strategies, reach out to your Fivetran representative for professional services help.

Fundamentally, the HVR technology is designed to resume replication from the point of interruption. With an HA approach to data replication, downtime is either avoided or limited, ensuring continuous data flow between source(s) and target(s).

The ability to replicate data between data stores has a number of dependencies, including:

Source and target databases/data stores must be available and accessible.
All of the infrastructure involved in the data replication must be available and allow connectivity.
The software that makes data replication possible must be functioning as designed and configured.

At any given moment, one or more components may fail, prompting your organization to implement strategies aimed at minimizing or preventing data replication downtime.

Backup, Disaster Recovery, High Availability

The three strategies to continue data replication that faced an unforeseen failure are:

Restore from backup, and recover: A backup is a copy of your data replication definitions. If the data replication setup fails due to a component failure or corruption it can be restored on the same or new equipment, and from there, data replication can be recovered.
Implement Disaster Recovery (DR): A DR environment is a completely separate setup that can take over in case of a major event - a disaster such as flooding or an earthquake - taking out many components at a time (e.g. entire data centers, the electricity grid, or network connectivity for a large region).
Implement High Availability (HA) architecture: A setup with no single point of failure. HA introduces redundancy into the setup to allow for components to step in if/when there is a failure. The goal of an HA implementation is to prevent replication downtime.

The combination of availability/recovery strategies that might work best for your organization to minimize data replication downtime depends on a number of factors, which include the complexity of the environment, the budget required to ensure availability, and the extent of replication downtime your organization can afford. The length of downtime you can afford guides your Recovery Time Objective.

Recovery Time Objective

Recovery Time Objective (RTO) is the duration between the replication becoming unavailable and its complete restoration. Do note that there are two important moments following replication downtime:

When replication resumes.
When replication has caught up from its backlog.

The RTO for replication is defined for the second moment (i.e. when replication operations have completely been restored). The amount of time between when replication resumes, and when the backlog has been processed, depends on a number of factors including:

Duration of the downtime.
Volume of the transactions generated on the source database, both during the downtime and when the replication is processing the backlog.
How quickly the replication system can process changes compared to the source generating them.

Establish an RTO for the worst-case scenario, which is an outage during peak load. For some organizations, particularly those with mission-critical systems relying on data replication for high availability, the RTO may be as low as minutes or seconds. In contrast, organizations running non-mission-critical workloads on replicated datasets may find occasional data replication downtime more manageable.

Weigh your RTO against the cost and complexity of implementing HA, as you architect your data replication configuration.

HVR Architecture

HVR’s real-time replication solution features a distributed approach to deploying log-based Change Data Capture (CDC) and continuous or scheduled integration. The HVR software is a single installation that on Linux and Windows platforms can act as a hub server, controlling data replication, or as an agent, performing work as instructed by a hub. An installation on Unix can only act as an agent.

The hub server and agents communicate over TCP/IP, always using encrypted messages. However, the architecture is flexible, and the use of agents in a setup is optional.

The image below shows the distributed architecture, in this case, a single source agent, a hub server, and a single target agent. Real-world deployments often use a separate agent per source, and a separate agent per target.

HVR_Architecture

A distributed architecture using a combination of HVR agents with a hub provides several benefits, including:

Scalability: offloading/distributing some of the most resource-intensive processing to avoid a bottleneck processing changes centrally.
Performance: optimizing network communication with data compressed by the agent before being sent over the wire.
Security: improved security with unified authorization and always encrypted communication.

Hub Server

The hub server is an installation of HVR on Linux or Windows that is used for configuring the data replication flows. The metadata is stored in a repository database that can be one of multiple relational databases that HVR supports. A hub server can manage one or more hubs.

The hub server runs a scheduler for each of the hubs the hub server manages. The role of the scheduler is to start the jobs that perform the data movement between source and target, and to make sure jobs are restarted if they fail.

To operate, the hub server requires a connection to the repository database to serve the hubs. The repository database can either be local to the host running the hub, or remote using a remote database connection.

The main functions of the hub server are:

Host the lightweight web server for the browser-based UI and allow REST requests.
Be the access point for operator maintenance and monitoring.
Manage (start/stop) the hub scheduler(s).

Hub

The main functions of a hub are:

Maintain the replication state.
Restart/resume replication in case of a failure.
Route compressed transaction files arriving from the source(s) to the correct target(s).

Environments with a dedicated hub server will see the bulk of the data replication processing take place on the agent servers, with very little data replication load on the hub server.

HVR Agent

Agents are typically installed directly on or close to the source and destination systems. Agents may store limited state information (configuration dependent). Based on the state information stored on the hub, data replication can always resume at the point of stoppage if replication is interrupted or fails.

A single hub may orchestrate data replication between hundreds of agents.

HA for the HVR Hub Server and Hub

For replication to function, the hub scheduler must be running. The HVR hub server starts the scheduler and makes sure it keeps running. In case of a scheduler failure the hub server automatically restarts the scheduler so replication can continue.

The hub server requires a connection to the repository to function properly. However, the hub server does not fail if/when the repository database is (temporarily) unavailable. To ensure the hub server is always running, set the service to automatically restart.

The current condition of the replication jobs is stored in the repository (e.g. RUNNING, SUSPENDED, etc.). Data replication definitions, stored also in the repository, are not required at runtime. However, from current definitions, data replication can be re-instantiated.

HVR data replication is recoverable based on the state information stored on the hub. Any state information about the data replication beyond just the job condition is stored on the file system in a directory identified by the environment setting HVR_CONFIG/hubs/hubname. To resume replication where it left off prior to a failure and with no additional downtime to re-activate the channel(s), files in this directory must be accessible.

An HA setup for the HVR hub server requires:

Repository database to be accessible; note that the repository database can be remote from the hub server or local to it.
Access to up-to-date and consistent data in the HVR_CONFIG directory.

One way to implement HA for the HVR hub is to use a cluster with shared storage. Fivetran does not certify specific cluster solutions. The HVR hub server is a self-contained software package that should be interoperable with cluster management software. Use a REST call to the HVR hub server to validate if the web server is running. Utilities like hvrtestscheduler can be used to identify the state of a scheduler and help decide whether to failover.

HA on Cluster

In a clustered setup, the cluster manager must manage the HVR hub server as a cluster resource, making sure that within the cluster only one hub server runs at any one point in time. HVR users must have a cluster identifier, virtual IP address or floating DNS name to always connect to the active hub server.

The HVR_CONFIG directory must be on attached storage, shared between the nodes in the cluster, or switched over during the failover of the cluster (e.g. in a Windows cluster). If the repository database is local to the hub like an Oracle RAC Database, or a SQL Server AlwaysOn cluster, then the connection to the database can always be established to the database on the local node, or using the cluster identifier. For a remote database, (network) connectivity to the database must be available, and the database should have its own HA setup that allows remote connectivity in case of failure.

HA on Cloud

Cloud environments provide services to support an HVR HA configuration for the HVR_CONFIG file system like Elastic File System (EFS) on Amazon Web Services (AWS), Azure Files on Microsoft Azure, and Google Cloud Filestore on Google Cloud Services (GCS). The cloud providers also provide database services with built-in HA capabilities and redundancies in the connectivity to allow for failures without impacting availability.

HA for HVR Agents

HA for an HVR agent can be achieved by having a duplicate installation for the agent available. HVR identifies an agent in the HVR location configuration through a hostname or IP address where the HVR agent remote listener must be running to establish a connection.

The HVR agent remote listener is a lightweight process, similar to a database listener, that forks new processes to perform work. If the agent runs on the source or target database server then the agent should be available if the server is available, and the HVR agent remote listener is running. Ensure to configure the remote listener to start upon system startup through systemd or (x)inetd on Linux/Unix or a service on Windows. For more information about the configuration steps, see section Configuring HVR Agent.

Consider using a floating virtual IP address/host name to identify the agent, or use a load balancer in front of the agent for automatic failover.

Cloud providers have services available that can help implement HA for agents.

State Information Stored by the HVR Agent

The HVR agent performing CDC on the source may store state information about long-running open transactions to prevent a lengthy recovery time if/when the capture process restarts. By default, the capture process writes a checkpoint to disk every five minutes, storing the in-memory state of the transactions that are open longer than 5 minutes, to disk. Upon restart the most recent complete checkpoint becomes the starting point of CDC, followed by re-reading (backed up) transaction logs from the point of the checkpoint forward, ensuring no data changes are lost. If the checkpoint is not available then replication will resume from the beginning of the oldest open transaction. This can lead to longer recovery times, and, in case of very long-running transactions, issues with recovery because log backups (or archived logs) are no longer accessible.

Long-running transactions may occur on Oracle Databases, especially when using a packaged application such as Oracle eBusiness Suite. Long-running transactions are less likely on other databases.

The default location to store the checkpoint is at the agent location. The Capture Checkpoint Tuning option available under the Advanced Capture Properties (on the Create New Location page or on the Source and Target Properties pane in the Location Details page) allows you to store checkpoints on the hub server, so that in case of a restart the checkpoint stored on the hub server can be the starting point. Note that storing the checkpoint on the hub server will take more time than storing the checkpoint locally where the agent runs.

The integration agent stores state information:

In the state tables, in the target location (only if the target is a database).
Implicitly using a directory _hvr_state if the target is a file system (except for S3 as a file system, when the hvrmanifestagent plugin should be used to publish files).

With the agent state information stored in the actual target, location data replication can leverage the current state irrespective of whether the same or a different instance of an agent is used, and there is no need to consider integrate state when considering HA for the integration agents.

Recovering HVR Replication

If HVR replication fails and there is no HA setup, or for whatever reason the HA configuration was unable to prevent an outage (e.g., in a perfect storm of failures or a disaster scenario, when DR could have provided business continuity but HA cannot), then you must recover replication.

This section describes how to recover replication. Irrespective of the strategy you choose, take advantage of HVR Compare to assess at any point in time, whether source or target are out of sync.

Restore and Recover from a Backup

Make sure to always have a current backup of your HVR hub definitions by using the Export Hub Definition function. In the worst case, with data replication otherwise unavailable, you can restore the environment from the hub definition export:

Create a new repository database (with empty tables).
Import the export file.
Depending on your target, use either one of the following approaches:
- For a database target, the state information required for loss-less capture rewind is stored in the HVR state table. Activate the channel with the capture rewind option Recovery Rewind to Target Databases' Integrate Sequence selected to recreate the channel state and resume replication where it left off.
- Activate the channel with capture rewind, using resilient processing on the target if applicable. It is important to re-capture any open database transactions at the point when replication failed in their entirety, so use a start time (well) before data replication failed to avoid loss of data.
  For a target that writes an audit trail of changes (a TimeKey channel), you will need to identify the most recently applied transaction and use this information to re-initialize data replication to avoid data overlap, or you must have downstream data consumption cope with possible data overlap.
Use HVR Refresh as needed to resync the tables between source and target. Consider the use of row-wise Refresh, also referred to as Repair, for a database target if data volumes are modest. Also consider the use of action Restrict to define a suitable filter condition when resyncing the data.

Recovering replication is generally a lot easier if database transaction file backups are still available and you can use CDC to get the environments back in sync. Consider the backup retention policy for your transaction logs if your recovery strategy for data replication includes recovery from a backup. If transaction log backups are no longer available, then you have to resync data from the source.

Disaster Recovery

Recovering data replication using a DR setup for any failed component(s) is similar to the recovery from a backup. In the DR case however, the recovery time is likely lower, because the DR environment is already configured. The DR environment may even be able to connect to the primary repository database for up-to-date data replication definitions. However, the DR environment would not have access to an up-to-date HVR_CONFIG location, so the data replication state would have to be recreated.

The steps to recover in a DR environment are:

Depending on your target, use either one of the following approaches:
- For a database target, the state information required for loss-less capture rewind is stored in the HVR state table. Activate the channel with the capture rewind option Recovery Rewind to Target Databases' Integrate Sequence selected to recreate the channel state and resume replication where it left off.
- Activate the channel with capture rewind, using resilient processing on the target if applicable. It is important to re-capture any open database transactions at the point when replication failed in their entirety, so use a start time (well) before data replication failed to avoid loss of data.
  For a target that writes an audit trail of changes (a “TimeKey” channel), you will need to identify the most recently applied transaction and use this information to re-initialize data replication to avoid data overlap, or you must have downstream data consumption cope with possible data overlap.
Use HVR Refresh as needed to resync the table definitions between source and target.