Databricks as Target
Fivetran HVR supports integrating changes into Databricks location. This section describes the configuration requirements for integrating changes using Integrate and Refresh into Databricks location. For the list of supported Databricks versions into which HVR can integrate changes, see Integrate changes into location in Capabilities.
The preferred methods for writing data into Databricks are using Burst Integrate and Bulk Refresh as they provide better performance. However, it is required to create staging files on a temporary location to perform Burst Integrate and Bulk Refresh. For more information about staging, see section Staging for Databricks.
Due to technical limitations, Databricks on Azure is not supported in the HVR releases since 6.1.5/3 to 6.1.5/9.
Grants for Integrate and Refresh
This section lists the grants/permissions required for integrating changes into Databricks.
Following are the permissions required for a cluster that is part of a Unity Catalog:
To create target tables using Refresh, the HVR database user must be granted
CREATE TABLE
,USE SCHEMA
, andUSE CATALOG
privileges:GRANT CREATE TABLE ON SCHEMA schema_name TO `user or group`; GRANT USE SCHEMA ON SCHEMA schema_name TO `user or group`; GRANT USE CATALOG ON CATALOG catalog_name TO `user or group`;
To Integrate changes or Refresh data into target tables, the HVR database user must be granted
MODIFY
privilege:GRANT MODIFY ON TABLE table_name TO `user or group`;
To query a table, the HVR database user must be granted
SELECT
,USE SCHEMA
, andUSE CATALOG
privileges:GRANT SELECT ON TABLE table_name TO `user or group`; GRANT USE SCHEMA ON SCHEMA schema_name TO `user or group`; GRANT USE CATALOG ON CATALOG catalog_name TO `user or group`;
For more information about these privileges, refer to the Databricks documentation - Azure, AWS.
Additional Considerations or Permissions
Cluster access to cloud store: The Databricks cluster must have access to the cloud store configured as the staging area. This is crucial for the
COPY INTO
command to function without errors when using Staging for Databricks.External tables and locations: When using Unity Catalog and external tables, configure an external location for the root location of external tables for Azure Databricks or AWS Databricks platforms. This ensures external tables are correctly accessed and managed within Databricks. For more information about external locations, refer to the Databricks documentation - Azure, AWS.
Dropping Columns from Delta Tables
Databricks Runtime version 11 introduced support for the ALTER TABLE DROP COLUMN
command. If you have action AdaptDDL configured and want to avoid a full Refresh of the target table when using ALTER TABLE DROP COLUMN
on a source table, you must enable the Delta Lake column mapping on the target table in Databricks.
For more information about enabling Delta Lake column mapping, refer to the Databricks documentation.
Prior to Databricks Runtime version 11, only the
ALTER TABLE ADD COLUMN
operation was supported by Databricks. HVR managed any other DDL changes through an extensive process, which involved copying the target table in Databricks to a temporary table, dropping and re-creating the target table, and then transferring the rows from the temporary table back to the new target table.
To enable column mapping for the existing tables, you must use the following command:
ALTER TABLE table_name SET TBLPROPERTIES ( 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5', 'delta.columnMapping.mode' = 'name' )
For new tables, Delta Lake column mapping is not enabled by default. When creating target tables in Databricks using Refresh, you must set the Delta table properties to enable column mapping. This setting affects only new tables and does not override or replace properties set on existing tables.
For more information about Delta table properties, refer to the Databricks documentation.
To enable column mapping for new tables, use the following SparkSession configuration commands:
SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true SET spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true SET spark.databricks.delta.properties.defaults.minReaderVersion = 2 SET spark.databricks.delta.properties.defaults.minWriterVersion = 5 SET spark.databricks.delta.properties.defaults.columnMapping.mode = name
Alternatively, you can set these Delta table properties by defining the action Environment for the Databricks location:
Action | Parameters |
---|---|
Environment | Name=HVR_DATABRICKS_TBLPROPERTIES Value="delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true, delta.minReaderVersion = 2, delta.minWriterVersion = 5, delta.columnMapping.mode = 'name'" |
Column Names Forced Case
Since v6.1.0/37
This option allows you to manage the case sensitivity of column names created in the target Databricks tables. This applies to Activating Replication, Refresh, and Compare.
Available options are:
- No
default
: Create column names in the same case as received from the source, which can be either lowercase or uppercase. - Upper: Create column names in uppercase.
- Lower: Create column names in lowercase.
This option is equivalent to the location property ForceCase.
Location Time Zone
Since v6.1.5/5
The time zone of the HVR Agent and the Databricks database must be the same. This option allows you to specify the timezone for Databricks location. In the HVR User Interface, you can specify the timezone using this option while creating a location (under the Configure Capture/Integrate for Databricks Location section) or by editing the existing location's source and target properties.
The equivalent location property for this field is Connection_Timezone_Name.
Instructions for HVR versions up to 6.1.5/5
If the time zone of the HVR Agent and the Databricks database are not the same, the following environment variables TZ and HVR_SQL_INIT should be defined to make them same:
Group | Table | Action | Parameter(s) |
---|---|---|---|
Databricks | * | Environment | Name=TZ Value=time_zone |
Databricks | * | Environment | Name=HVR_SQL_INIT Value=SET TIME ZONE 'time_zone' |
The environment variable HVR_SQL_INIT runs the SQL statement SET TIME ZONE 'time_zone' when the database connection is established.
Intermediate Directory
This option in the HVR UI allows you to specify a directory path for storing intermediate (temporary) files generated during Compare. These files are created during both "direct file compare" and "online compare" operations.
Using an intermediate directory can enhance performance by ensuring that temporary files are stored in a location optimized for the system's data processing needs.
This setting is particularly relevant for target file locations, as it determines where the intermediate files are placed during the Compare operation. If this option is not enabled, the intermediate files are stored by default in the integratedir/_hvr_intermediate directory, where integratedir is the replication DIRECTORY (File_Path) defined for the target file location.
This option is equivalent to the location property Intermediate_Directory.
Intermediate Directory is Local
This option in HVR UI specifies that the Intermediate Directory will be created on the local drive of the file location's server.
This setting is crucial for optimizing performance, as it reduces network latency and avoids potential permission issues associated with remote storage. By storing intermediate files locally, HVR can process data more efficiently, taking advantage of the speed and reliability of local storage.
This option is particularly beneficial when the HVR Agent has access to ample local storage, enabling it to handle large data volumes without relying on networked storage solutions.
This option is equivalent to the location property Intermediate_Directory_Is_Local.
Limitations
For Databricks runtime version earlier than 11.2, the DEFAULT constraint is not supported. As a result, it cannot automatically provide a default value for a column when changing its nullability. Therefore, if you need to alter a nullable column to NOT NULL in the target, you must first ensure that all existing NULL values in the source are updated with non-NULL values.