Slicing
The more data you have, the more time it takes to process. When the amount of data you want to replicate is massive, it is more efficient to separate the data into a few slices and process them separately. In Fivetran HVR, this is achieved with the Slicing functionality.
Slicing divides your source table into a few pieces while Refreshing or Comparing the data.
Slicing has the following benefits:
It speeds up the Refresh or Compare job;
It helps to avoid high memory or CPU load for long-running jobs.
Slicing Types
HVR provides a few types of slicing, each of them best fitting a specific business case.
Modulo Slicing
During Modulo Slicing, HVR performs a modulo operation using the values from the column you choose. This type of slicing is the easiest of the four Slicing types to set up.
Basically, the modulo operation returns the remainder of a division.
For example:
- 15 mod 5 = 0 (15 divided by 5 equals 3, with the remainder of 0);
- 15 mod 4 = 3 (15 divided by 4 equals 3, with the remainder of 3).
When it comes to databases, the same operation is performed to slice the source table. The modulo value you choose will define two things: which rows will be in each slice, and the number of slices.
For example, you have a table with the column customer_id that has values from 0 to 1000. If you set the modulo value to 10, HVR will slice your table into 10 slices as follows:
customer_id mod 10 = 0 (where customer_id = 0, 10, 20, ... 1000)
customer_id mod 10 = 1 (where customer_id = 1, 11, 21, ... 991)
[...]
customer_id mod 10 = 8 (where customer_id = 8, 18, 28, ... 998)
customer_id mod 10 = 9 (where customer_id = 9, 19, 29, ... 999)
For each customer_id, HVR performs a modulo operation and sorts the row into the respective slice depending on the remainder.
- Only a column with numeric values can be used for Modulo Slicing.
- Modulo slicing is limited only to the databases that support modulo database function.
- For Modulo slicing, parameters SliceCountCondition and SliceSeriesCondition in the Restrict action must not be defined.
Count Slicing
Count Slicing is a bit more advanced version of Modulo slicing. With it, you can choose to perform Slicing on a column that contains values such as strings or dates.
To use Count Slicing, you need to define the Restrict action with the parameter SliceCountCondition={hvr_slice_num} and {hvr_slice_total}.
For example, for the Oracle source, the defined Restrict might look as follows:
Restrict SliceCountCondition=mod(coalesce(ora_hash(mycol),0),{hvr_slice_total})={hvr_slice_num}
Boundary Slicing
Boundary Slicing requires a column with the known range of possible values. It groups the data set depending on defined boundaries for the chosen column.
For this type of Slicing, you set the boundaries for each slice of your table. For example you have chosen the column user_id (the boundaries you set can be strings, numerics, or dates) and specify boundaries to 1000, 3500, and 6000.
As the result, you'll have 4 slices:
with user_id from the minimum user_id to 1000
with user_id from 1001 to 3500
with user_id from 3501 to 6000
with user_id from 6001 to the maximum user_id in the table
For Boundary slicing, parameters SliceCountCondition and SliceSeriesCondition in the Restrict action must not be defined.
Series Slicing
Series Slicing is an advanced option of Boundary Slicing. It groups the data set on the basis of a column value, where you define a slice condition that uses a particular value. It will work well if you have a few distinct groups of data, for example divided by country code.
Series Slicing requires a column with a small set of unique values.
For example, one of the columns in this table has country code values, and the rows are distributed as follows:
mycol | Number of Rows with a specific 'mycol' value |
---|---|
DE | 19081 |
LX | 10094 |
BE | 14159 |
NL | 13196 |
Here, if you apply Slicing based on the mycol column, the NL slice would have 13196 rows, and the LX slice would have 10094 rows.
To use Series Slicing, you need to define action Restrict with parameter SliceSeriesCondition={hvr_slice_value}.
Example
Assuming SliceSeriesCondition="{hvr_slice_value}", the Series Values field should be filled in as follows:
mycol = 'DE' mycol = 'LX' mycol = 'BE' mycol = 'NL'
Assuming SliceSeriesCondition="mycol = {hvr_slice_value}", the Series Values field should be filled in as follows:
'DE' 'LX' 'BE' 'NL'
Slicing Limitations
For Bulk Refresh, slicing can only be done on the source table.
For Compare and for row-wise Refresh, slicing can be done on the source or the target table. But only the columns that are present in both source and target can be used for slicing.
Limitations of Slicing for Refresh
Modulo Slicing
Following are the limitations when using slicing with modulo of numbers (col%num):
- Modulo Slicing only works on numeric data types. It may work with binary float values depending on DBMS data type handling.
- For some supported DBMSes (SQL Server, PostgreSQL, Greenplum, Redshift), Modulo Slicing on a float column is not allowed as it may result in SQL query error.
- For some DBMSes, float values above the limits of DBMS’s underlying precision ("big float values") may produce inaccurate results during Modulo Slicing. This affects only heterogeneous environments.
- Refresh with Modulo Slicing on a column with "big float values" may produce inaccurate results in HANA even in a homogeneous environment (HANA-to-HANA).
- Heterogeneous Refresh (between different DBMSes or file locations) has a limitation with Modulo slicing on the Oracle’s NUMBER(*) column: if a value has an exponent larger than 37 (e.g. if a number is larger than 1E+37 or smaller than -1E+37), then this row might be associated with a wrong slice. This column should not be used for Modulo Slicing. The exact limits of the values depend on the number of slices.
A workaround for the above limitations is to use Boundary or Count Slicing with custom SQL expressions.
Boundary Slicing
Boundary Slicing does not work for dates in heterogeneous DBMSes.
Limitations of Slicing for Direct File Compare
An Oracle’s NUMBER(*) column or a similar column with big numbers (namely, if a number is larger than 1E+37 or smaller than -1E+37) cannot be used in direct file compare with slicing. A workaround would be to use another column for slicing.
Limitations of Slicing for SAP Unpack
Slicing is only possible on columns that exist in the source and only on columns that do not break document consistency. For cluster tables, slicing on PAGENO is not allowed, for long text (STXL) slicing on SRTF2 is not allowed.
For SAP Unpack, slicing based on row-count from last Refresh or Compare might not work, since the information on the cluster table RFBLG may not be there.
When slicing tables that have SapUnpack parameter, you must select the PackedInside table during Refresh. Slicing will be done on the PackedInside table. For example, when refreshing BSEG, select both BSEG and RFBLG and slice on RFBLG.
In most SAP systems, slicing on Client ID (MANDT) does not result in efficient slicing.
Proposed slicing:
- Important cluster tables include BSEG and KONV.
- Slicing on RFBLG (BSEG) can be done on columns BUKRS (company code), BELNR (document number) and GJAHR (fiscal year).
- Slicing on KOCLU (KONV) can be done on the column KNUMV (document number) only.
- Slicing on STXL is best done on the column TDOBJECT (text object information).
- Pool tables are best sliced on column TABNAME. This column contains the name of the unpack target table, values should be based on tables selected to unpack.