Compare
After you have run the Refresh job and the data replication process has started, you might want to check that the data in source and target Locations is in sync. For that, Fivetran HVR has the Compare feature. It allows you to compare data in two or more locations (databases) in a Channel. It compares the table structures as well as the data they contain. You can compare a single source location against multiple target locations.
The benefits of using HVR Compare feature are the following:
Compare will ensure that no data is missing, for example due to a human error
Compare allows to compare an entire database (all tables) or specific tables only
Compare works heterogeneously (across a variety of databases). This means that your source can be compared against the target location(s) no matter if they are the same or different database types
The following methods are available for executing Compare:
- via UI – see Comparing Data
- via CLI – see hvrcompare
- via API – see /api/latest/hubs/{hub}/channels/{channel}/compare in Activate, Refresh, and Compare Interface.
Compare Types
If your source table has been pre-populated with data, there are two types of Compare you can choose from:
Bulk Compare: HVR calculates the checksum for each table in the channel and compares these checksum to report whether the replicated tables are identical.
Row-by-Row Compare: HVR extracts the data from a source (read) location, compresses it, and transfers the data to a target (write) location(s) to perform a row level Compare. Each individual row is compared to produce a 'diff' result. For each detected difference, an SQL statement is written: an insert, update, or delete. This compare type is also referred to as row-wise compare in this documentation.
Bulk and online row-wise compare are not supported for packed SAP tables; only offline row-wise compare is available.
You can set a Compare type in the CLI using -g option of hvrcompare or in the UI enable Bulk Compare by selecting the Table Checksums Only option.
Online Compare
You can choose to perform online compare, which is a live Compare between locations with rapidly changing data. While performing a compare, if the online compare option is defined, HVR processes the changes that occur during the compare and does not miscount them as differences. You can define the online compare in CLI using -o option of hvrcompare or in the UI by selecting the Online Compare option.
Direct File Compare
HVR also allows you to perform the Compare on the file locations using the Direct File Compare method, which is performed against a file location. This Compare method is a faster alternative for file compare via Hive External Tables and also helps to avoid compare mismatches caused by data type coercion through Hive deserializer.
During direct file Compare, HVR reads and parses (deserialize) files directly from the file location instead of using the HIVE external tables (even if it is configured for that location). In direct file compare, the files of each table are sliced and distributed to prereader subtasks. Each prereader subtasks reads, sorts and parses (deserialize) the files to generate compressed(encrypted) intermediate files. These intermediate files are then compared with the database on the other side.
The number of prereader subtasks used during direct file Compare can be configured using the compare option File Prereaders per Table (CLI option -w).
The location to store the intermediate files generated during Compare can be configured using the location property Intermediate_Directory.
To perform a direct file compare:
against a source file location, action Capture with parameter Pattern should be defined.
against a target file location, action Integrate with parameter ComparePattern should be defined.
Limitations
Direct file Compare does not support Avro, Parquet or JSON file formats.
Direct file Compare is not supported if action Restrict with parameter RefreshCondition is defined on a file location involved in the compare.
Direct file Compare is not supported when the channel is a 'blob' file channel. A blob file channel has no table information and simply treats each file as a sequence of bytes without understanding their file format.
Direct file Compare for XML files requires each XML file to contain a single table.
Slicing
Sometimes, the amount of data that the Compare job needs to process is too big. In this case, you can choose to divide the table into a few batches and process them in parallel. In HVR, this is achieved via the Slicing functionality. By configuring Slicing, you can divide your database table into a few pieces that will be processed in parallel saving you a lot of time.
HVR suggests a few types of slicing, each fitting best for a specific business case. For more information about slicing types and when it is best to use them, refer to the Slicing article.