Staging for Databricks
While running Burst Integrate or Bulk Refresh, Fivetran HVR can put data into a temporary directory (‘staging file') before loading data into a target database. For best performance, HVR performs Burst Integrate and Bulk Refresh into Databricks using staging files. HVR implements the file staging on Databricks as follows:
- HVR first writes data into the staging directory on the configured platform (AWS or Azure).
- HVR then uses the Databricks SQL command
COPY INTO
to ingest data into the target tables from the staging directory.
Configuring Staging
HVR must be configured to stage the data on a staging platform (AWS or Azure) before loading it into Databricks.
If the managed secrets feature is enabled, option USE TOKEN INSTEAD is displayed in the fields designated for entering secrets.
Databricks on AWS
Since v6.1.0/13
For staging the data on AWS S3 and performing Burst Integrate or Bulk Refresh, the following are required:
An AWS S3 location (bucket) to store temporary data to be loaded into Databricks. For more information about creating and configuring an S3 bucket, refer to the AWS Documentation.
An AWS user with AmazonS3FullAccess permission policy to access the S3 bucket. Alternatively, an AWS user with minimal set of permission can also be used:
Click here for minimal set of permissions.
- s3:GetBucketLocation
- s3:ListBucket
- s3:ListBucketMultipartUploads
- s3:AbortMultipartUpload
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
Sample JSON with a user role permission policy for S3 location
{ "Statement": [ { "Sid": <identifier>, "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>", }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload" ], "Resource": "arn:aws:s3:::<bucket_name>/*" }, { "Sid": <identifier>, "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": [ "s3:ListBucket", "s3:GetBucketLocation", "s3:ListBucketMultipartUploads" ], "Resource": "arn:aws:s3:::<bucket_name>" } ] }
For minimal permission, HVR also supports the AWS temporary security credentials in IAM. There are two ways to request for the AWS Security Token Service (STS) temporary credentials:
Using a combination of AWS STS Role ARN, AWS Access Key Id, and AWS Secret Access Key
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": "sts:AssumeRole" } ] }
Using a combination of AWS STS Role ARN and AWS IAM Role (a role that has access to an EC2 machine)
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<account_id>:<user>/<username>", "arn:aws:iam::<account_id>:<role>/<username>" ] }, "Action": "sts:AssumeRole" } ] }
For more information on the Amazon S3 permissions policy, refer to the AWS S3 documentation.
For more information, refer to the following AWS documentation:
Define the following location properties (while creating a location or by editing the existing location's source and target properties) for the Databricks location:
The Location Property equivalent to the UI field is shown inside (brackets) below.
STORAGE (File_Scheme): protocol for connecting HVR to S3.
- Select Amazon S3 (HTTPS).
BUCKET (S3_Bucket): Name or IP address of the Amazon S3 bucket.
Credentials or Instance Profile: AWS security credentials. For more information about getting your AWS credentials or Instance Profile Role, refer to the AWS Documentation.
- If Credentials is selected, the location property KEY ID (AWS_Access_Key_Id) and SECRET KEY (AWS_Secret_Access_Key) must be supplied.
KEY ID: Access key ID of IAM user for connecting HVR to Amazon S3.
SECRET KEY: Secret access key of IAM user for connecting HVR to Amazon S3.
- If Instance Profile is selected, the location property IAM ROLE (AWS_IAM_Role) must be supplied.
- IAM ROLE: AWS IAM role name for connecting HVR to Amazon S3.
- If Credentials is selected, the location property KEY ID (AWS_Access_Key_Id) and SECRET KEY (AWS_Secret_Access_Key) must be supplied.
STAGING DIRECTORY (Staging_Directory): location where HVR will create the temporary staging files inside the S3 bucket (e.g. /my_staging_dir).
Create Delta Tables in an External Location: Select this option if you want to create external tables, and then define the following location properties available under this option -
EXTERNAL LOCATION (Databricks_Location): Path for the external tables in Databricks. This can be a mount path /mnt/... (optionally prefixed with dbfs:) or an s3:// URL. If a path is defined without specifying the dbfs:/ or s3://, it is assumed to be a mount path beginning with dbfs:/.
EXTERNAL LOCATION S3S URL (Databricks_Location_S3S): URL (s3s://) for the external tables in Databricks. This is required only if the EXTERNAL LOCATION is set to a mount path (/mnt/.. or dbfs:/).
s3s is same as accessing Amazon S3 using HTTPS.
By default, HVR connects to us-east-1 once for determining your S3 bucket region. If a firewall restriction or a service such as Amazon Private Link is preventing the determination of your S3 bucket region, you can change this region (us-east-1) to the region where your S3 bucket is located by defining the following action:
Group Table Action/Parameters Databricks * Environment -
Name=HVR_S3_BOOTSTRAP_REGION,
Value=s3_bucket_region
Databricks on Azure
Due to technical limitations, Databricks on Azure is not supported in the HVR releases since 6.1.5/3 to 6.1.5/9.
For staging the data on Azure (ADLS) and performing Burst Integrate or Bulk Refresh, the following are required:
An ADLS storage location (container) to store temporary data that are to be loaded into Databricks on Azure.
An ADLS user (storage account) to access the container.
Define the following location properties (while creating a location or by editing the existing location's source and target properties) for the Databricks on Azure location:
The Location Property equivalent to the UI field is shown inside (brackets) below.
- STORAGE (File_Scheme): Protocol for connecting HVR to Databricks on Azure.
- Select Azure Data Lake Storage Gen2 (HTTPS).
- ACCOUNT (ABFS_Account): Name of the ADLS account.
- CONTAINER (ABFS_Container): Name of the container available within the ADLS account.
- AUTHENTICATION METHOD (ABFS_Authentication_Method): Authentication method for connecting HVR to the ADLS server. Available option is
- Client Credentials
- OAUTH2 ENDPOINT (Azure_OAuth2_Endpoint): URL used for obtaining the bearer token with credential token.
- CLIENT ID (Azure_OAuth2_Client_Id): Client ID (or application ID) used for obtaining the Microsoft Entra ID (formerly Azure Active Directory) access token.
- CLIENT SECRET KEY (Azure_OAuth2_Client_Secret): Secret key of the CLIENT ID.
- STAGING DIRECTORY (Staging_Directory): Location where HVR will create the temporary staging files inside the Google Cloud Storage bucket (e.g. /my_staging_dir).
- Create Delta Tables in an External Location: If you want to create external tables, then define the following location properties available under this option -
- EXTERNAL LOCATION (Databricks_Location): Path for the external tables in Databricks. This can be a mount path /mnt/... (optionally prefixed with dbfs:) or an abfss://... URL. If a path is defined without specifying the dbfs:/ or s3://, it is assumed to be a mount path beginning with dbfs:/.
- EXTERNAL LOCATION ABFSS (Databricks_Location_ABFSS): URL (abfss://) for the external tables in Databricks. This is required only if the EXTERNAL LOCATION is set to a mount path (/mnt/... or dbfs:/...).
- STORAGE (File_Scheme): Protocol for connecting HVR to Databricks on Azure.