Keys
In a database management system, a key is an attribute or a set of attributes that help to uniquely identify a row in a table. This section describes the types of keys and their roles in Fivetran HVR.
Primary Keys
Primary Key is an ID unique for each specific row in a database table. It consists of a single mandatory column or a set of mandatory columns. Primary keys are typically validated by transaction processing databases. However, analytical databases can consider the primary key to be declarative and not enforce it.
Replication Keys
Replication Key is a unique key that HVR uses to identify a row in a target table. HVR requires every table in a channel to have a replication key. When a channel is created, HVR inspects a source database for a "good" replication key. If it sees a primary key, then that is used for its replication key. If there is no primary key, HVR will look for a unique index as a replication key. If none of the above exists, then HVR will define the replication key using all non-LOB columns. This last case is called an implicit replication key.
If there are multiple uniqueness constraints on the capture table (e.g., a primary key and several unique indices), HVR uses a certain hierarchy rule to decide which one to use as a replication key (e.g. a primary key would 'win' and be considered).
The hierarchy rule is based on:
the source dictionary;
a unique key with mandatory columns;
or all non-LOB columns in the table.
You can add a column to a replication key (assuming that the implicit key has no explicitly-defined columns). This can be achieved using action ColumnProperties with parameter Key or TimeKey.
NOTE: HVR ignores a table's non-unique indexes. On Oracle, it also ignores unique indexes if that index only contains a single nullable column.
Distribution Keys
For distributed DBMSes, such as Greenplum, HVR has a concept of Distribution Key for each table. A distribution key is a column or group of columns used to determine the distribution of portions of the table across logical slices or nodes. These are the column(s) that HVR uses for hashing the storage when HVR creates its tables in a target database.
Some databases do not support distribution keys, and some have their own limitations. For example, Redshift only supports a single column as a distribution key, while Greenplum and Teradata support multi-column distribution keys. For good distribution of data across a database cluster, the distribution key values should ideally be unique, or otherwise, be well-distributed and almost unique.
There are two types of distribution keys in HVR:
- Explicit Distribution Key means that the distribution key has been explicitly defined by a user in HVR using action ColumnProperties with parameter DistributionKey.
- Implicit Distribution Key means that the distribution key has not been explicitly defined by a user in HVR. In this case, HVR will derive the distribution key from the first column of the replication key. To influence the selection of the key, refer to the TableProperties action. For example, you can eliminate unsuitable columns (DistributionKeyAvoidPattern) or set a maximum number of columns (DistributionKeyLimit) to be considered for the distribution key.
If action Integrate with parameter Method=Burst is defined, HVR creates target tables in a distributed DBMS, but they can also be created directly. When HVR creates the target tables, it will use a distribution key for the distributed DBMS. Additionally, the table will have the HVR replication key as an index (either with or without uniqueness depending on whether action TableProperties with parameter NoDuplicateRows is defined).
HVR also creates staging tables (ending with '__b') for delivering bursts of changes into its target tables. They do not get an index (not needed), but HVR will use for them the same distribution key as in the target tables. It is essential that the distribution key of the burst table matches the distribution key of the target table, otherwise the burst updates will be misaligned and the integrate job will either fail or be too slow.
NoDuplicateRows Parameter
Parameter NoDuplicateRows only affects tables that have an implicit replication key. It has no effect on tables with an explicit replication key. This parameter can be used if a table without a unique constraint does not contain multiple identical rows in the database. Doing so will increase performance because more efficient queries can be used in the target database.
If no replication key column(s) are defined on the table and this parameter is not set, then duplicates are allowed and all updates are treated as key updates and are replicated as a delete and an insert. Additionally, each delete is integrated using a special SQL subselect, which ensures only a single row is deleted, not multiple rows. Incorrect use of the NoDuplicateRows parameter can lead to errors if there are duplicates.