Hub Disk Requirements

This section describes the disk requirements for the machine running HVR Hub System.

The HVR Hub System runs Scheduler service(s) to manage jobs that move data flow between source location(s) and target location(s) (Capture jobs, Integrate jobs, Refresh jobs, Compare jobs).

In order to operate, the Scheduler must connect to a repository database consisting of a number of relational tables. By design, the HVR Hub System is limited to job orchestration, recording system state information, and temporary storage of router files and transaction files. For the Refresh process, no data is persisted on the HVR Hub System, so the hub acts as a simple pass through. Therefore, the HVR Hub System needs storage to hold the following:

HVR state information (HVR_CONFIG)
Repository database (including statistics retention, location information, etc)
HVR installation (<1 GB standard)
Metering data for consumption-based pricing. Starting with 6.1.0/76, 6.2.0/14 and 6.2.5/5 HVR collects resync detection samples. The samples take up to 50 MB per replicated table, for every source/destination pair.

Resource Consumption

HVR is designed to distribute work between HVR Agents. As a result, resource-intensive processing rests on the HVR Agents, with the HVR Hub System machine performing as little processing as possible. The HVR Hub System machine controls all the jobs that move data between sources and targets, and stores the system's state to enable recovery without any loss of changes. All data transferred between sources and targets pass through the HVR Hub System machine, including data from a one-time load (hvrrefresh) and detailed row-wise comparison (hvrcompare).

The HVR Hub System machine needs resources to:

Run the Scheduler.
Spawn jobs to perform one-time data movement (Compare and Refresh) and continuous replication (Capture and Integrate). In all cases, the resource-intensive part of data processing is implemented on the HVR Agent machine, including data compression, with the HVR Hub System machine simply passing the data from source to target. For Compare or Refresh, the data is simply transferred skipping the disk. During normal capture activity, data is temporarily stored on the disk to allow the quickest possible recovery, with capture(s) and integrate(s) running asynchronously for optimal efficiency. If the data transfer is encrypted, the HVR Hub System machine decrypts the data and encrypts it again (typically using different encryption certificates) as needed to deliver it to the target.
Transfer compressed data from source to target. Since the amount of data transferred is reduced by 5-10 times, large amounts of data can be transferred without the need for very high network bandwidth.
Collect metrics from the log files to be stored in the repository database.
Provide real-time process metrics to any Graphical User Interfaces (GUIs) connected to the HVR Hub System machine. HVR runs as a service, regardless of whether any UI is connected and real-time metrics are provided for monitoring purposes.
Allow configuration changes in the design environment.

CPU

Every HVR job spawns a process – i.e. one for every Capture, one for every Integrate. The CPU utilization for each of these processes on the HVR Hub System machine is generally very low unless some heavy transformations are processed on the HVR Hub System machine (i.e. depending on the channel design). Besides, Refresh or Compare may spawn multiple processes when running. A lot of CPU can be used when performing a row-by-row refresh/compare.

Memory

Memory consumption is slightly higher on the HVR Hub System machine than on the source, but still fairly modest. Some customers run dozens of channels on a dedicated HVR Hub System machine with a fairly modest configuration. Row-by-row Refresh and Compare may use a lot of memory but are not run on an ongoing basis.

Storage Space

Storage utilization on the HVR Hub System machine can be high. If Capture is running but Integrate is not into at least one destination, the transaction files get accumulated on the HVR Hub System machine. These files are compressed, but depending on the activity on the source database and the amount of time it takes until the target starts processing transactions, a lot of storage space may be used. Start with at least 10 GB, but possibly more if the HVR Hub System machine manages multiple channels and network connectivity is unreliable. Large row-by-row Refresh or Compare can also use a lot of storage space.

Starting with 6.1.0/76, 6.2.0/14 and 6.2.5/5 HVR collects resync detection samples. The samples take up to 50 MB per replicated table, for every source/destination pair, per hub.

I/O

If Capture is running and keeping up with the transaction log generation on a busy system that processes many small transactions, then transaction files will be created at a rapid pace. Ensure that the file system can handle frequent I/O operations. Typically, a storage system cache or file system cache or SSD (or a combination of these) can take care of this.

Sizing Guidelines for Hub Machine

The most important factor impacting the HVR Hub System size is whether the hub also performs the role of a source and/or a target HVR Agent. General recommendations include:

Co-locate the HVR Hub System with a production source database only if the server(s) hosting the production database has (have) sufficient available resources (CPU, memory, storage space, and I/O capacity) to support the HVR Hub System for your setup.
Capture may run against a physical standby of the source database with no direct impact on the source production database. In this case, consider CPU utilization of the capture process(es) running on the source database. For the Oracle RAC production database, there is one log parser per node in the source database, irrespective of the standby database configuration.
Sorting data to coalesce changes for burst mode and to perform row-wise data Compare (also part of the row-wise Refresh) are CPU, memory and (temporary) storage space intensive.
Utilities to populate the database target like TPT (for Teradata) and gpfdist (for Greenplum) can be very resource-intensive.

The change rate mentioned in the sizing guideline below is the volume of transaction log changes produced by the database (irrespective of whether HVR captures all table changes from the source or only a subset).

HUB SIZE	RESOURCES	STANDALONE HUB	ONLY CAPTURE, NO INTEGRATE	ONLY INTEGRATE, NO CAPTURE	BOTH CAPTURE AND INTEGRATE
Small	CPU cores: 4-8 Memory: 16-32 GB Disk: 50-500 GB SSDNetwork: 10GigE HBA (or equivalent)	5 channels with average change rate up to 20 GB/hour	2 channels with average change rate up to 20 GB/hour	2 channels with average change rate up to 20 GB/hour	1 channel processing up to 20 GB/hour
Medium	CPU cores: 8-32 Memory: 32-128 GB Disk: 300 GB - 1 TB SSDNetwork: 2x10 GigE HBA	20 channels, up to 5 with high average change rate of 100 GB/hour	8 channels, up to 2 with high average change rate of 100 GB/hour	6 channels, up to 2 with high average change rate of 100 GB/hour	4 channels, up to 2 with high average change rate of 100 GB/hour
Large	CPU cores: 32+ Memory: 128 GB+ Disk: 1 TB+ SSDNetwork: 4+ x10 GigE HBA	50+channels	15+ channels	12+ channels	8+ channels

Storage for HVR_CONFIG

The most important resource for the HVR Hub System machine to function well is fast I/O operations (in terms of IOPS), especially for the HVR_CONFIG directory, where runtime data and state are written to. To support capture on a busy source system, transaction files can be written to the disk every second or two, with updates to the (tiny) capture state file at the same rate, as well as very frequent updates to the log files that keep track of the activity. With multiple channels running, there will be many small I/O operations into the HVR_CONFIG directory every second. The disk subsystem with a sizable cache and preferably Solid-State Drives (SSDs) is a good choice for the HVR Hub System storage.

Starting with 6.1.0/76, 6.2.0/14 and 6.2.5/5 HVR collects resync detection samples. The samples take up to 50 MB per replicated table, for every source/destination pair, per hub.

Repository Database

The HVR Hub System stores channels metadata, a very limited amount of runtime data, as well as aggregated process metrics (statistics) in its repository database. The most important resource for the repository database is storage, with even quite modest needs in order to support a single hub (up to 20 GB of disk space allocated for the repository database can support virtually all hub setups). Traditionally, the repository database is stored locally to the HVR Hub System, but there are cases when a database service is used to host the repository database away from the HVR Hub System. The main advantage of a local repository database is a lower likelihood that the database connection fails (resulting in all data flows to stop because the Scheduler fails in such a case) versus offloading any resources the repository requires with a database elsewhere.

The statistics data stored in the repository database (hvr_stats) can take up a large amount of storage space.

Sizing Guidelines for Repository Database

Review the guidelines and decide based on your situation what is the best HVR Hub System configuration. For example:

Your HVR Hub System may capture changes for one of multiple sources, using HVR Agent for the other sources.
One of your sources may be a heavily-loaded 8-node Oracle Exadata database that requires far more resources to perform CDC than a single mid-size SQL Server database.
You may plan to run very frequent (resource-intensive) CDC jobs, etc.

Monitoring Disk Space on Hub Machine

Even though the HVR Hub System uses limited storage, a shortage of free disk space can significantly impact the repository database performance and therefore the performance of HVR. Standard database monitoring tools can be employed to verify the amount of disk space left on the HVR Hub System machine – considering the type of repository database that has been installed. Since every database has unique requirements in terms of optimum storage required for operational health, it is important to set these alerting thresholds accordingly. These are to be used as guidelines only and not as reference architectures. In most cases, the disk alerts must be set for 80%, 85% and 90% capacity. Any higher than 90% is considered as a production support call to immediately add disk or free up storage. Standard database monitoring solutions can be helpful to monitor the disk usage from the repository database perspective.