S3 Data Lake Setup Guide
Follow our setup guide to connect your Amazon S3 data lake to Fivetran.
Prerequisites
To connect your Amazon S3 data lake to Fivetran, you need the following:
- An AWS account that does not have multiple resource groups in the same AWS Region.
- An Amazon S3 bucket in one of the supported AWS Regions. For faster uploads and downloads and for optimum load times, we recommend that you create the bucket in the same Region as the data processing location of your destination. For more information about creating an Amazon S3 bucket, see AWS' documentation.
- (Applicable only to Iceberg tables) Access to AWS Glue Data Catalog in the same Region as the S3 bucket.
NOTE: In your AWS account, you can create multiple groups within the same AWS Region. However, it's important to note that all groups in a particular AWS Region share the same AWS Glue database. Therefore, you must avoid having the same schema/table combination across multiple groups within the same Region, as it could lead to conflicts in AWS Glue database tables and potential synchronization failures.
Setup instructions
Choose your deployment model
Before setting up your destination, decide which deployment model best suits your organization's requirements. This destination supports both SaaS and Hybrid deployment models, offering flexibility to meet diverse compliance and data governance needs.
See our Deployment Models documentation to understand the use cases of each model and choose the model that aligns with your security and operational requirements.
NOTE:
- Hybrid Deployment support for S3 Data Lake destinations is in Private Preview.
- To use the Hybrid Deployment model, you must deploy your agent container on Kubernetes.
- You must have an Enterprise or Business Critical plan to use the Hybrid Deployment model.
Find External ID
In the destination setup form, find the automatically-generated External ID and make a note of it. You will need it to create an IAM role for Fivetran.
NOTE: The automatically-generated External ID is tied to your account. The ID does not change even if you close and re-open the setup form. For your convenience, you can keep the browser tab open in the background while you configure your destination.
Create IAM policy for S3 bucket
Open your Amazon IAM console.
Go to Policies, and then click Create Policy.
Go to the JSON tab.
Copy the following policy and paste it in the JSON editor.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowListBucketOfASpecificPrefix", "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::{your-bucket-name}" ], "Condition": { "StringLike": { "s3:prefix": [ "{prefix_path}/*" ] } } }, { "Sid": "AllowAllObjectActionsInSpecificPrefix", "Effect": "Allow", "Action": [ "s3:DeleteObjectTagging", "s3:ReplicateObject", "s3:PutObject", "s3:GetObjectAcl", "s3:GetObject", "s3:DeleteObjectVersion", "s3:PutObjectTagging", "s3:DeleteObject", "s3:PutObjectAcl" ], "Resource": [ "arn:aws:s3:::{your-bucket-name}/{prefix_path}/*" ] } ] }
NOTE: Setting the "s3:prefix" condition to
["*"]
grants access to all prefixes within the specified bucket, while setting it to["{prefix_path}/*"]
restricts access to a specific prefix path within the bucket.In the policy, replace
{your-bucket-name}
with the name of your S3 bucket and{prefix_path}
with the prefix path of your S3 bucket.NOTE: If you do not specify a prefix path, the policy will grant access to the entire S3 bucket instead of limiting our access to the objects of a specific prefix.
Click Next.
In the Policy name field, enter a name for your policy, and then click Create policy
Create IAM policy for AWS Glue Data Catalog
IMPORTANT:
- Perform this step only if you want to use Iceberg tables in your destination. Skip to the next step if you want to use Delta Lake tables.
- We do not recommend configuring multiple connectors with the same schema and table name in a single AWS Glue database. For more information, see our troubleshooting article.
In the Policies page, click Create Policy, and then go to the JSON tab.
Depending on your access requirements, copy one of the following policies and paste it in the JSON editor:
To enable the policy to access all your Glue databases and its tables, copy the following policy.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "SetupFormTest", "Effect": "Allow", "Action": [ "glue:DeleteDatabase" ], "Resource": [ "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/fivetran*", "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog", "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/fivetran*/*", "arn:aws:glue:{your-catalog-region}:{your-account-id}:userDefinedFunction/fivetran*/*" ] }, { "Sid": "AllConnectors", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:UpdateDatabase", "glue:CreateTable", "glue:GetTables", "glue:CreateDatabase", "glue:UpdateTable", "glue:BatchDeleteTable", "glue:DeleteTable", "glue:GetTable" ], "Resource": [ "arn:aws:glue:{your-catalog-region}:{your-account-id}:*" ] } ] }
To limit the access of the policy to specific Glue databases, copy the following policy.
NOTE: Whenever you add a new connector for your destination, you must update the policy with the new connector's details under the
Sid:AllConnectors
identifier.{ "Version": "2012-10-17", "Statement": [ { "Sid": "SetupFormTest", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:UpdateDatabase", "glue:DeleteDatabase", "glue:CreateTable", "glue:GetTables", "glue:CreateDatabase", "glue:UpdateTable", "glue:BatchDeleteTable", "glue:DeleteTable", "glue:GetTable" ], "Resource": [ "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/fivetran*", "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog", "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/fivetran*/*", "arn:aws:glue:{your-catalog-region}:{your-account-id}:userDefinedFunction/fivetran*/*" ] }, { "Sid": "AllConnectors", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:UpdateDatabase", "glue:CreateTable", "glue:CreateDatabase", "glue:UpdateTable", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:GetTable", "glue:GetTables" ], "Resource": [ "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/{schema_name}", "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog", "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/{schema_name}/*" ] } ] }
NOTE: We need the
DeleteDatabase
permission only to perform the setup tests.In the policy, replace
{your-catalog-region}
with the Region of your S3 bucket and{your-account-id}
with your AWS account ID.If you copied the policy with limited access to specific databases, replace
{schema_name}
with your connector's schema name.Click Next.
In the Policy name field, enter a name for your policy, and then click Create policy.
Create IAM role
IMPORTANT: This step applies only to SaaS Deployment and not to Hybrid Deployment.
Go to Roles, and then click Create role.
Select AWS account, and then select Another AWS account.
In the Account ID field, enter Fivetran's account ID,
834469178297
.Select the Require external ID checkbox, and then enter the External ID you found.
Click Next.
Select the checkboxes for the IAM policies you created in Step 3 and Step 4.
NOTE: For Delta Lake tables, the IAM policy in Step 4 is not required.
Click Next.
In the Role name field, enter a name for the role, and then click Create role.
In the Roles page, select the role you created.
Make a note of the ARN. You will need it to configure Fivetran.
(Optional) Configure IAM policy for AWS Lake Formation
IMPORTANT: This step is mandatory if AWS Lake Formation is enabled for your S3 bucket. Skip this step if AWS Lake Formation is not enabled for your bucket.
Go to the AWS Lake Formation console.
On the navigation menu, go to Permissions > Data locations.
Click Grant.
Choose My account.
In the IAM users and roles drop-down menu, select the IAM role you created for your bucket.
In the Storage locations field, enter the prefix of your S3 bucket where you want to store your data.
Click Grant.
(Optional) Configure AWS PrivateLink
Create IAM user and security credentials
IMPORTANT: This step applies only to Hybrid Deployment and not to SaaS Deployment.
Create IAM user
On the Amazon IAM console, go to Users, and then click Create user.
Enter a User name and click Next.
Select Attach policies directly.
Select the checkboxes for the IAM policies you created for your S3 bucket and AWS Glue Data Catalog.
Click Next.
Click Create User.
Create security credentials for IAM user
Go to Users, and then select the IAM user you created.
Go to the Security credentials tab, and then click Create access key.
Select Application running outside AWS, and then click Next.
(Optional) Enter the Description tag value for the access key.
Click Create access key.
Save the Access key and make a note of the Secret access key. You will need them to configure Fivetran.
Click Done.
(Optional) Configure IAM policy for AWS Lake Formation
IMPORTANT: This step is mandatory if AWS Lake Formation is enabled for your S3 bucket. Skip this step if AWS Lake Formation is not enabled for your bucket.
Go to the AWS Lake Formation console.
On the navigation menu, go to Permissions > Data locations.
Click Grant.
Choose My account.
In the IAM users and roles drop-down menu, select the IAM role you created for your bucket.
In the Storage locations field, enter the prefix of your S3 bucket where you want to store your data.
Click Grant.
(Optional) Configure AWS PrivateLink
IMPORTANT:
- This step applies only to SaaS Deployment and not to Hybrid Deployment.
- You must have a Business Critical plan to use AWS PrivateLink.
AWS PrivateLink allows VPCs and AWS-hosted or on-premises services to communicate with one another without exposing traffic to the public internet. PrivateLink is the most secure connection method. Learn more in AWS’ PrivateLink documentation.
Follow our AWS PrivateLink setup guide to configure PrivateLink for your S3 bucket.
(Optional) Set up automatic schema migration for Delta Lake tables in Databricks
IMPORTANT: This step applies only to SaaS Deployment and not to Hybrid Deployment.
Expand for instructions
Prerequisites
To configure automatic schema migration for the Delta Lake tables, you need the following:
- A Databricks account.
- Unity Catalog enabled in your Databricks workspace. Unity Catalog is a unified governance solution for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud. We recommend that you use Fivetran with Unity Catalog as it simplifies access control and sharing of the tables that Fivetran creates. Legacy deployments can continue to use Databricks without Unity Catalog.
- A SQL warehouse. SQL warehouses are optimized for data ingestion and analytics workloads, start and shut down rapidly and are automatically upgraded with the latest enhancements by Databricks. Legacy deployments can continue to use Databricks clusters with Databricks Runtime v7.0 or above.
Configure Unity Catalog
NOTE: Skip this step if your Unity Catalog is already configured in Databricks.
Create workspace
Login to the Databricks account console as an account admin.
Create a workspace by following the instructions in Databricks documentation.
Create metastore and attach workspace
Create a metastore and attach your workspace by following the instructions in Databricks documentation.
Enable Unity Catalog for workspace
Enable Unity Catalog for your workspace by following the instructions in Databricks documentation.
Configure external data storage
NOTE: Skip this step if your external data storage is already configured in Databricks.
Create storage credentials
Create your storage credentials by following the instructions in Databricks documentation.
Create external location
Log in to your Databricks workspace.
Go to Catalog > External Data.
Click External Locations.
Click Create location.
Select Manual and then click Next.
Enter the External location name.
In the Storage credential drop-down menu, select the credential you created.
In the URL field, enter the path to the S3 bucket you configured for your S3 Data Lake destination in Fivetran.
Click Create.
(Optional) Connect using AWS PrivateLink
IMPORTANT: You must have a Business Critical plan to use AWS PrivateLink.
Expand for instructions
AWS PrivateLink allows VPCs and AWS-hosted or on-premises services to communicate with one another without exposing traffic to the public internet. PrivateLink is the most secure connection method. Learn more in AWS PrivateLink's documentation.
How it works:
Fivetran accesses the data plane in your AWS account using the control plane network in Databricks' account.
You set up a back-end AWS PrivateLink connection between your AWS account and Databricks' AWS account (shown as
(Workspace 1/2) Link-1
in the diagram above).Fivetran creates and maintains a front-end AWS PrivateLink connection between Fivetran's AWS account and Databricks' AWS account (shown as
Regional - Link-2
in the diagram above).
Prerequisites
To set up AWS PrivateLink, you need:
- A Fivetran instance configured to run in AWS
- A Databricks destination in one of our supported regions
- All of Databricks' requirements
Configure AWS PrivateLink
Follow the instructions in Databricks documentation to enable private connectivity for your workspaces. Your workspaces must have the following:
- A registered back-end VPC endpoint for secure cluster connectivity relay
- A registered back-end VPC endpoint for REST APIs
- A PAS object with access to Fivetran's VPC endpoints
- If the Private Access Level on the PAS object is set to Account, a Fivetran VPC endpoint (for the applicable AWS region) that is registered once per account
- If the Private Access Level on the PAS object is set to Endpoint, a Fivetran VPC endpoint (for applicable AWS region) that is registered using the
allowed_vpc_endpoint_ids
property
Register Fivetran endpoint details
Register the Fivetran endpoint for the applicable AWS region with your Databricks workspaces. We cannot access your workspaces until you register the endpoint.
AWS Region | VPC endpoint |
---|---|
ap-south-1 Asia Pacific (Mumbai) | vpce-089f13c9231c2b729 |
ap-southeast-1 Asia Pacific (Singapore) | vpce-03f0abf5b0d840936 |
ap-southeast-2 Asia Pacific (Sydney) | vpce-0e5f79a1613d0cf05 |
ca-central-1 Canada (Central) | vpce-09f0049f9a92177f1 |
eu-central-1 Europe (Frankfurt) | vpce-049699737170c880d |
eu-west-1 Europe (Ireland) | vpce-0b32cb6c08f6fe0df |
eu-west-2 Europe (London) | vpce-03fde3e4804f537eb |
us-east-1 US East (N. Virginia) | vpce-0ff9bd04153060180 |
us-east-2 US East (Ohio) | vpce-05153aa99bf7a4575 |
us-west-2 US West (Oregon) | vpce-0884ff0f23dcbf0dc |
Connect SQL warehouse
NOTE: Perform this step only if you want to use a Databricks SQL warehouse with Fivetran. We recommend that you always use Databricks SQL warehouses with Fivetran.
In the Databricks console, go to SQL > SQL warehouses > Create SQL warehouse. If you want to select an existing SQL warehouse, skip to step 5 in this section.
In the New SQL warehouse window, enter a Name for your warehouse.
Choose your Cluster Size and configure the other warehouse options.
Click Create.
Go to the Connection details tab.
Make a note of the following values. You will need them to configure Fivetran.
- Server Hostname
- Port
- HTTP Path
IMPORTANT: Skip to the Choose authentication type step after completing this step.
(Optional) Connect Databricks cluster
NOTE: Perform this step only if you want to use a Databricks cluster instead of a Databricks SQL warehouse. We recommend that you use Databricks SQL warehouses with Fivetran.
Expand for instructions
Log in to your Databricks workspace.
In the Databricks console, go to Data Engineering > Cluster > Create Cluster.
Enter a Cluster name of your choice.
Set the Databricks Runtime Version to the latest LTS release.
NOTE: Make sure you choose v7.3 or above.
Select the Cluster mode.
Set the Databricks Runtime Version to 7.3 or above (recommended version: 10.4).
In the Advanced Options window, in the Security mode drop-down menu, select either Single user or User isolation.
In the Advanced Options section, select Spark.
If your Databricks Runtime Version is older than v9.1, copy the following code and paste in the Spark config field:
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl.disable.cache true spark.hadoop.fs.s3.impl.disable.cache true spark.hadoop.fs.s3a.impl.disable.cache true spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
Click Create Cluster.
In the Advanced Options window, select JDBC/ODBC.
Make a note of the following values. You will need them to configure Fivetran.
- Server Hostname
- Port
- HTTP Path
Choose authentication type
You can use one of the following authentication types for Fivetran to connect to Databricks:
Databricks personal access token authentication
To use the Databricks personal access token authentication type, create a personal access token by following the instructions in Databricks documentation.
Assign the following catalog privileges to the user or service principal you want to use to create your access token:
- CREATE SCHEMA
- CREATE TABLE
- MODIFY
- REFRESH
- SELECT
- USE CATALOG
- USE SCHEMA
To create external tables in a Unity Catalog-managed external location, assign the following privileges to the user or service principal you want to use to create your access token:
- On the external location:
- CREATE EXTERNAL TABLE
- READ FILES
- WRITE FILES
- On the storage credentials:
- CREATE EXTERNAL TABLE
- READ FILES
- WRITE FILES
- On the external location:
NOTE: When you grant a privilege on the catalog, it is automatically granted to all current and future schemas in the catalog. Similarly, the privileges that you grant on a schema are inherited by all current and future tables in the schema.
OAuth machine-to-machine (M2M) authentication
To use the OAuth machine-to-machine (M2M) authentication type, create your OAuth client ID and secret by following the instructions in Databricks documentation.
NOTE: You cannot use this authentication type if you use the AWS PrivateLink or Azure Private Link connection method.
Complete Fivetran configuration
Log in to your Fivetran account.
Go to the Destinations page and click Add destination.
Enter a Destination name of your choice and then click Add.
Select S3 Data Lake as the destination type.
(Enterprise and Business Critical accounts only) Select the deployment model of your choice:
- SaaS Deployment
- Hybrid Deployment
If you choose Hybrid Deployment, select an existing Hybrid Deployment Agent in the Select an existing agent drop-down menu or configure a new agent.
NOTE: For more information about configuring a new agent, see our Hybrid Deployment with Kubernetes setup guide.
Enter your S3 Bucket name.
NOTE: You cannot change the bucket name after you save the setup form.
(SaaS Deployment only) In the Fivetran Role ARN field, enter the ARN you found.
(Hybrid Deployment only) Enter the AWS Access Key ID and AWS Secret Access Key of the IAM user you created.
(Optional) Enter the S3 Prefix Path of your bucket.
NOTE: The prefix path must not start or end with a forward slash (/).
You cannot change the prefix path after you save the setup form.
Enter your S3 Bucket Region.
(SaaS Deployment only) To always connect using AWS PrivateLink, set the Require PrivateLink toggle to ON.
NOTE: By default, we use PrivateLink to connect if your S3 bucket and Fivetran are in the same AWS Region. Enabling this option ensures that we always use PrivateLink to connect. If you set this toggle to OFF and if your S3 bucket and Fivetran are not in the same AWS region, Fivetran does not use a PrivateLink connection and skips the PrivateLink setup test.
(SaaS Deployment only) In the Table Format drop-down menu, select the format you want to use for your destination tables.
NOTE: You cannot change the table format after you save the setup form.
In the Snapshot Retention Period drop-down menu, select how long you want us to retain your table snapshots.
NOTE: We perform regular table maintenance operations to delete the table snapshots that are older than the retention period you select in this field. You can select Retain All Snapshots to disable the deletion of table snapshots.
(Optional and not applicable to Hybrid Deployment) To automate schema migration of Delta Lake tables in Databricks, set the Maintain Delta Tables In Databricks toggle to ON and do the following:
i. Choose the Databricks Connection Method.
ii. Enter the following details of your Databricks account: - Catalog name - Server Hostname - Port number - HTTP Path
iii. Select the Authentication Type you configured.
iv. If you selected PERSONAL ACCESS TOKEN as the Authentication Type, enter the Personal Access Token you created.
v. If you selected OAUTH 2.0 as the Authentication Type, enter the OAuth 2.0 Client ID and OAuth 2.0 Secret you created.
(SaaS Deployment only) Choose your Data processing location.
(SaaS Deployment only) Choose your Cloud service provider and its region as described in our Destinations documentation.
NOTE:
- For faster uploads and downloads and for optimum load times, we recommend that you select AWS as the Cloud service provider and the Region in which your S3 bucket is located as the AWS Region.
- For S3 data lake destinations, AWS is supported in all pricing plans. For information about the supported AWS Regions, see our destination overview documentation.
Choose your Time zone.
(Optional for Business Critical accounts and not applicable to Hybrid Deployment) To enable regional failover, set the Use Failover toggle to ON, and then select your Failover Location and Failover Region. Make a note of the IP addresses of the secondary region and safelist these addresses in your firewall.
Click Save & Test.
Fivetran tests and validates the S3 Data Lake connection. On successful completion of the setup tests, you can sync your data using Fivetran connectors to the S3 Data Lake destination.
In addition, Fivetran automatically configures a Fivetran Platform Connector to transfer the connector logs and account metadata to a schema in this destination. The Fivetran Platform Connector enables you to monitor your connectors, track your usage, and audit changes. The connector sends all these details at the destination level.
IMPORTANT: If you are an Account Administrator, you can manually add the Fivetran Platform Connector on an account level so that it syncs all the metadata and logs for all the destinations in your account to a single destination. If an account-level Fivetran Platform Connector is already configured in a destination in your Fivetran account, then we don't add destination-level Fivetran Platform Connectors to the new destinations you create.
Setup tests
Fivetran performs the following S3 Data Lake connection tests:
The S3 Read and Write Access test checks the accessibility of your S3 bucket and validates the resources you provided in the IAM policy.
(Applicable only to Iceberg tables) The Glue Access test checks the accessibility of AWS Glue Data Catalog and validates the resources you provided in the IAM policy.
(Applicable only to SaaS Deployment) The PrivateLink test checks whether your S3 bucket is in the same AWS Region as Fivetran. We perform this test only if you set the Require PrivateLink toggle to ON.
The Bucket Region test checks whether you specified a valid S3 Bucket Region.
The Validate Permissions test checks on the databricks creds if we have the necessary READ/WRITE permissions to
CREATE
,ALTER
, orDROP
tables in the database. We perform this test only if you set the Require PrivateLink toggle to ONNOTE: The tests may take a couple of minutes to complete.
Related articles
description Destination Overview
settings API Destination Configuration