Databricks as Target

Fivetran HVR supports integrating changes into Databricks location. This section describes the configuration requirements for integrating changes using Integrate and Refresh into Databricks location. For the list of supported Databricks versions into which HVR can integrate changes, see Integrate changes into location in Capabilities.

The preferred methods for writing data into Databricks are using Burst Integrate and Bulk Refresh as they provide better performance. However, it is required to create staging files on a temporary location to perform Burst Integrate and Bulk Refresh. For more information about staging, see section Staging for Databricks.

Due to technical limitations, Databricks on Azure is not supported in the HVR releases since 6.1.5/3 to 6.1.5/9.

Grants for Integrate and Refresh

This section lists the grants/permissions required for integrating changes into Databricks.

Following are the permissions required for a cluster that is part of a Unity Catalog:

To create target tables using Refresh, the HVR database user must be granted CREATE TABLE, USE SCHEMA, and USE CATALOG privileges:

GRANT CREATE TABLE ON SCHEMA schema_name TO `user or group`;

GRANT USE SCHEMA ON SCHEMA schema_name TO `user or group`;

GRANT USE CATALOG ON CATALOG catalog_name TO `user or group`;

To Integrate changes or Refresh data into target tables, the HVR database user must be granted MODIFY privilege:
```
GRANT MODIFY ON TABLE table_name TO `user or group`;
```

To query a table, the HVR database user must be granted SELECT, USE SCHEMA, and USE CATALOG privileges:

GRANT SELECT ON TABLE table_name TO `user or group`;

GRANT USE SCHEMA ON SCHEMA schema_name TO `user or group`;

GRANT USE CATALOG ON CATALOG catalog_name TO `user or group`;

For more information about these privileges, refer to the Databricks documentation - Azure, AWS.

Additional Considerations or Permissions

Cluster access to cloud store: The Databricks cluster must have access to the cloud store configured as the staging area. This is crucial for the COPY INTO command to function without errors when using Staging for Databricks.
External tables and locations: When using Unity Catalog and external tables, configure an external location for the root location of external tables for Azure Databricks or AWS Databricks platforms. This ensures external tables are correctly accessed and managed within Databricks. For more information about external locations, refer to the Databricks documentation - Azure, AWS.

Dropping Columns from Delta Tables

Databricks Runtime version 11 introduced support for the ALTER TABLE DROP COLUMN command. If you have action AdaptDDL configured and want to avoid a full Refresh of the target table when using ALTER TABLE DROP COLUMN on a source table, you must enable the Delta Lake column mapping on the target table in Databricks.

For more information about enabling Delta Lake column mapping, refer to the Databricks documentation.

Prior to Databricks Runtime version 11, only the ALTER TABLE ADD COLUMN operation was supported by Databricks. HVR managed any other DDL changes through an extensive process, which involved copying the target table in Databricks to a temporary table, dropping and re-creating the target table, and then transferring the rows from the temporary table back to the new target table.

To enable column mapping for the existing tables, you must use the following command:

  ALTER TABLE table_name SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
  )

For new tables, Delta Lake column mapping is not enabled by default. When creating target tables in Databricks using Refresh, you must set the Delta table properties to enable column mapping. This setting affects only new tables and does not override or replace properties set on existing tables.

For more information about Delta table properties, refer to the Databricks documentation.

To enable column mapping for new tables, use the following SparkSession configuration commands:

SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true
SET spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true
SET spark.databricks.delta.properties.defaults.minReaderVersion = 2
SET spark.databricks.delta.properties.defaults.minWriterVersion = 5
SET spark.databricks.delta.properties.defaults.columnMapping.mode = name

Alternatively, you can set these Delta table properties by defining the action Environment for the Databricks location:

Action	Parameters
Environment	Name=HVR_DATABRICKS_TBLPROPERTIES Value="delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true, delta.minReaderVersion = 2, delta.minWriterVersion = 5, delta.columnMapping.mode = 'name'"

Column Names Forced Case

Since v6.1.0/37

This option allows you to manage the case sensitivity of column names created in the target Databricks tables. This applies to Activating Replication, Refresh, and Compare.

Available options are:

No default: Create column names in the same case as received from the source, which can be either lowercase or uppercase.
Upper: Create column names in uppercase.
Lower: Create column names in lowercase.

This option is equivalent to the location property ForceCase.

Location Time Zone

Since v6.1.5/5

The time zone of the HVR Agent and the Databricks database must be the same. This option allows you to specify the timezone for Databricks location. In the HVR User Interface, you can specify the timezone using this option while creating a location (under the Configure Capture/Integrate for Databricks Location section) or by editing the existing location's source and target properties.

The equivalent location property for this field is Connection_Timezone_Name.

Instructions for HVR versions up to 6.1.5/5

If the time zone of the HVR Agent and the Databricks database are not the same, the following environment variables TZ and HVR_SQL_INIT should be defined to make them same:

Group	Table	Action	Parameter(s)
Databricks	*	Environment	Name=TZ Value=time_zone
Databricks	*	Environment	Name=HVR_SQL_INIT Value=SET TIME ZONE 'time_zone'

The environment variable HVR_SQL_INIT runs the SQL statement SET TIME ZONE 'time_zone' when the database connection is established.

Intermediate Directory

This option in the HVR UI allows you to specify a directory path for storing intermediate (temporary) files generated during Compare. These files are created during both "direct file compare" and "online compare" operations.

This option is displayed in the location creation dialog when creating a new location, and in the Source and Target Properties pane on the Location Details page when editing an existing location.

Using an intermediate directory can enhance performance by ensuring that temporary files are stored in a location optimized for the system's data processing needs.

This setting is particularly relevant for target file locations, as it determines where the intermediate files are placed during the Compare operation. If this option is not enabled, the intermediate files are stored by default in the integratedir/_hvr_intermediate directory, where integratedir is the replication DIRECTORY (File_Path) defined for the target file location.

This option is equivalent to the location property Intermediate_Directory.

Intermediate Directory is Local

This option indicates that the Intermediate Directory will be created on the local drive of the location's server. This option is displayed when you select the Intermediate Directory option in the HVR UI. It is selected by default and cannot be modified.

Storing intermediate files locally is crucial for optimizing performance by reducing network latency and avoiding potential permission issues associated with remote storage. It enables HVR to process data more efficiently by leveraging the speed and reliability of local storage. This is particularly beneficial when the HVR Agent has access to ample local storage, allowing it to handle large data volumes without relying on networked storage solutions.

This option is equivalent to the location property Intermediate_Directory_Is_Local.

Version Recovery

This option allows you to enable version-based recovery for target Databricks tables. When enabled, it lets HVR restore the table to its previous state if the Integrate job fails and avoid creating duplicate rows.

HVR achieves this by utilizing the DESCRIBE HISTORY and RESTORE TABLE ... TO VERSION AS OF ... statements.

When this option is selected, and if BurstCommitFrequency parameter in the Integrate action is set to CYCLE, HVR will treat it as TABLE to support version-based recovery. For best performance, we recommend using TABLE. Setting it to STATEMENT results in a higher number of SQL statements being issued.

When this option is selected, it adds a new column (last_successful_version) to the burst state table. You do not need to recreate the table manually; if the additional column is missing, it will be automatically added.

This option does not handle recovery of enrichment rows. To prevent duplicates created by enrichment, enable enrichment cleanup by setting the HVR_BURST_ENRICHMENT_CLEANUP environment variable to -1 or 1.

This option is equivalent to the location property Databricks_Version_Recovery.

Limitations

For Databricks runtime version earlier than 11.2, the DEFAULT constraint is not supported. As a result, it cannot automatically provide a default value for a column when changing its nullability. Therefore, if you need to alter a nullable column to NOT NULL in the target, you must first ensure that all existing NULL values in the source are updated with non-NULL values.