Primary, Replication, and Distribution Keys in HVR
On this page:
- PK = Primary Key
- RK = Replication Key
- DK = Distribution Key
Question
- Is a PK mandatory for HVR replication?
- If a table does not have a PK, how does HVR handle updates and deletes?
- What effect does the parameter /DuplicateRows have?
- Does HVR take all the columns in a table to define a distribution key?
- Does HVR create the target table in a distributed DBMS itself, and if so, how does it know which distribution key to use?
Environment
HVR 5
Answer
Is a PK mandatory for HVR replication?
No. When a channel is built, HVR will inspect the source database for a good Replication Key (RK). If there is a Primary Key (PK), it will be used as RK. If there is no PK, it will look for a unique index to use as RK. If there are multiple unique indexes, it decides which to use as RK. If no unique key exists, then HVR will define a RK using all non-LOB datatypes. This last case is called an implicit replication key.
HVR also allows the RK to be defined or overridden manually, so it can (for example) contain different columns from the PK in the source.
NOTE: HVR ignores table's non-unique indexes. On Oracle, it also ignores unique indexes if that index only contain a single nullable column (because Oracle has a weak grasp of "uniqueness").
If a table does not have a PK, how does HVR handle updates and deletes?
As described above, all HVR tables have a RK. This key will typically match the PK if that existed. Normally HVR will do an UPDATE or a DELETE by putting the RK in the SQL WHERE clause.
However, if the RK is "implicit" (see above) and the HVR parameter /DuplicateRows is defined, then the updates will be performed instead as a DELETE
(with the RK in the WHERE
clause) and then an INSERT
(see next question for more details on /DuplicateRows).
What effect does the parameter /DuplicateRows have?
The parameter /DuplicateRows only affects tables that have an "implicit key" (see above). It has no effect on tables with an explicit RK, so it can safely be defined with \[table='*'\]
to only affect tables with implicit RKs. The parameter /DuplicateRows should be used if a table can actually contain multiple identical rows (for many tables with no PK or unique key in the DBMS dictionary the application will ensure duplicate rows never really happen).
In this case, /DuplicateRows will change the way how updates are integrated, i.e., updates are applied as a DELETE
and an INSERT
instead. Additionally, the SQL flavor affected that is used by HVR for deletes. Unnecessary use of /DuplicateRows can have performance costs, but a missing /DuplicateRows can lead to errors (e.g., "update affected more than X rows expected").
Does HVR take all the columns in a table to define a distribution key?
For a distributed DBMS, e.g., Greenplum, HVR has a concept of the Distribution Key (DK) for each table. These are the column(s) that HVR uses for hashing the storage when HVR creates its tables in the target DB. The DK should be defined manually using HVR parameter /DistributionKey. Choosing a good DK for each table is a manual process HVR is not involved in.
If no explicit DK is defined, then HVR just uses the first column of the RK. This default rule is weak normally, and terrible for a SAP table, because the first column of the RK is typically mandt
, which is constant.
Does HVR create the target table in a distributed DBMS itself, and if so, how does it know which distribution key to use?
If Integrate /Burst is defined, HVR is used to create the DBMS target table, but they can also be created directly. When HVR creates the DBMS target table, it will use the DK as a "distribution key" for the distributed DBMS.
Additionally, the table will have HVR's RK as an index; either with or without uniqueness, depending on whether HVR parameter /DuplicateKeys is defined. HVR also needs a temp
table for delivering bursts of changes into its target tables. These are called __b
. HVR always creates these itself. They don't get an index (not needed), but HVR will use the same DK for these as for the target table.
It is essential that the DK of this burst table matches the DK of the target table, otherwise HVR's burst-updates will be misaligned, which will either fail or be too slow. So if the DK of a target table is changed, the __b
burst table must also be dropped and recreated with the new DK.