S3 Data Lake Setup Guide

Follow our setup guide to connect your Amazon S3 data lake to Fivetran.

Prerequisites

To connect your Amazon S3 data lake to Fivetran, you need the following:

An AWS account that does not have multiple resource groups in the same AWS Region.
An Amazon S3 bucket in one of the supported AWS Regions. For faster uploads and downloads and for optimum load times, we recommend that you create the bucket in the same Region as the data processing location of your destination. For more information about creating an Amazon S3 bucket, see AWS' documentation.
(Applicable only to Iceberg tables) Access to AWS Glue Data Catalog in the same Region as the S3 bucket.

In your AWS account, you can create multiple groups within the same AWS Region. However, it's important to note that all groups in a particular AWS Region share the same AWS Glue database. Therefore, you must avoid having the same schema/table combination across multiple groups within the same Region, as it could lead to conflicts in AWS Glue database tables and potential synchronization failures.

Setup instructions

Find External ID

In the destination setup form, find the automatically-generated External ID and make a note of it. You will need it to create an IAM role for Fivetran.

The automatically-generated External ID is tied to your account. The ID does not change even if you close and re-open the setup form. For your convenience, you can keep the browser tab open in the background while you configure your destination.

Create IAM policy for S3 bucket

Open your Amazon IAM console.
Go to Policies, and then click Create Policy.
Go to the JSON tab.

Copy the following policy and paste it in the JSON editor.

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Sid": "AllowListBucketOfASpecificPrefix",
        "Effect": "Allow",
        "Action": [
            "s3:ListBucket"
        ],
        "Resource": [
            "arn:aws:s3:::{your-bucket-name}"
        ],
        "Condition": {
            "StringLike": {
                "s3:prefix": [
                    "{prefix_path}/*"
                 ]
             }
          }
        },
        {
        "Sid": "AllowAllObjectActionsInSpecificPrefix",
        "Effect": "Allow",
        "Action": [
            "s3:DeleteObjectTagging",
            "s3:ReplicateObject",
            "s3:PutObject",
            "s3:GetObjectAcl",
            "s3:GetObject",
            "s3:DeleteObjectVersion",
            "s3:PutObjectTagging",
            "s3:DeleteObject",
            "s3:PutObjectAcl"
        ],
        "Resource": [
            "arn:aws:s3:::{your-bucket-name}/{prefix_path}/*"
        ]
        }
    ]
}

Setting the "s3:prefix" condition to ["*"] grants access to all prefixes within the specified bucket, while setting it to ["{prefix_path}/*"] restricts access to a specific prefix path within the bucket.

In the policy, replace {your-bucket-name} with the name of your S3 bucket and {prefix_path} with the prefix path of your S3 bucket.
If you do not specify a prefix path, the policy will grant access to the entire S3 bucket instead of limiting our access to the objects of a specific prefix.
Click Next.
In the Policy name field, enter a name for your policy, and then click Create policy

Create IAM policy for AWS Glue Data Catalog

Perform this step only if you want to use Iceberg tables in your destination. Skip to the next step if you want to use Delta Lake tables.
We do not recommend configuring multiple connections with the same schema and table name in a single AWS Glue database. For more information, see our troubleshooting article.

In the Policies page, click Create Policy, and then go to the JSON tab.

Depending on your access requirements, copy one of the following policies and paste it in the JSON editor:

To enable the policy to access all your Glue databases and its tables, copy the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
          "Sid": "SetupFormTest",
          "Effect": "Allow",
          "Action": [
              "glue:DeleteDatabase"
           ],
           "Resource": [
               "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/fivetran*",
               "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog",
               "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/fivetran*/*",
               "arn:aws:glue:{your-catalog-region}:{your-account-id}:userDefinedFunction/fivetran*/*"
           ]
       },
       {
          "Sid": "AllConnectors",
          "Effect": "Allow",
          "Action": [
              "glue:GetDatabase",
              "glue:UpdateDatabase",
              "glue:CreateTable",
              "glue:GetTables",
              "glue:CreateDatabase",
              "glue:UpdateTable",
              "glue:BatchDeleteTable",
              "glue:DeleteTable",
              "glue:GetTable"
           ],
           "Resource": [
               "arn:aws:glue:{your-catalog-region}:{your-account-id}:*"
           ]
       }
    ]
}

To limit the access of the policy to specific Glue databases, copy the following policy.

Whenever you add a new connection for your destination, you must update the policy with the new connection's details under the Sid:AllConnectors identifier.

  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "SetupFormTest",
              "Effect": "Allow",
              "Action": [
                  "glue:GetDatabase",
                  "glue:UpdateDatabase",
                  "glue:DeleteDatabase",
                  "glue:CreateTable",
                  "glue:GetTables",
                  "glue:CreateDatabase",
                  "glue:UpdateTable",
                  "glue:BatchDeleteTable",
                  "glue:DeleteTable",
                  "glue:GetTable"
              ],
              "Resource": [
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/fivetran*",
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog",
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/fivetran*/*",
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:userDefinedFunction/fivetran*/*"
              ]
          },
          {
              "Sid": "AllConnectors",
              "Effect": "Allow",
              "Action": [
                  "glue:GetDatabase",
                  "glue:UpdateDatabase",
                  "glue:CreateTable",
                  "glue:CreateDatabase",
                  "glue:UpdateTable",
                  "glue:DeleteTable",
                  "glue:BatchDeleteTable",
                  "glue:GetTable",
                  "glue:GetTables"
              ],
              "Resource": [
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:database/{schema_name}",
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:catalog",
                  "arn:aws:glue:{your-catalog-region}:{your-account-id}:table/{schema_name}/*"
              ]
          }
       ]
      }

We need the DeleteDatabase permission only to perform the setup tests.

In the policy, replace {your-catalog-region} with the Region of your S3 bucket and {your-account-id} with your AWS account ID.
If you copied the policy with limited access to specific databases, replace {schema_name} with your connection's schema name.
Click Next.
In the Policy name field, enter a name for your policy, and then click Create policy.

Create IAM role

Go to Roles, and then click Create role.
Select AWS account, and then select Another AWS account.
In the Account ID field, enter Fivetran's account ID, 834469178297.
Select the Require external ID checkbox, and then enter the External ID you found.
Click Next.
Select the checkboxes for the IAM policies you created in Step 2 and Step 3.
For Delta Lake tables, the IAM policy in Step 3 is not required.
Click Next.
In the Role name field, enter a name for the role, and then click Create role.
In the Roles page, select the role you created.
Make a note of the ARN. You will need it to configure Fivetran.

Note ARN

(Optional) Configure IAM policy for AWS Lake Formation

This step is mandatory if AWS Lake Formation is enabled for your S3 bucket. Skip this step if AWS Lake Formation is not enabled for your bucket.

Go to the AWS Lake Formation console.
On the navigation menu, go to Permissions > Data locations.
Click Grant.
Choose My account.
In the IAM users and roles drop-down menu, select the IAM role you created for your bucket.
In the Storage locations field, enter the prefix of your S3 bucket where you want to store your data.
Click Grant.

(Optional) Set up automatic schema migration for Delta Lake tables in Databricks

Expand for instructions

Prerequisites

To configure automatic schema migration for the Delta Lake tables, you need the following:

A Databricks account.
Unity Catalog enabled in your Databricks workspace. Unity Catalog is a unified governance solution for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud. We recommend that you use Fivetran with Unity Catalog as it simplifies access control and sharing of the tables that Fivetran creates. Legacy deployments can continue to use Databricks without Unity Catalog.
A SQL warehouse. SQL warehouses are optimized for data ingestion and analytics workloads, start and shut down rapidly and are automatically upgraded with the latest enhancements by Databricks. Legacy deployments can continue to use Databricks clusters with Databricks Runtime v7.0 or above.

Configure Unity Catalog

Skip this step if your Unity Catalog is already configured in Databricks.

Create workspace

Login to the Databricks account console as an account admin.
Create a workspace by following the instructions in Databricks documentation.

Create metastore and attach workspace

Create a metastore and attach your workspace by following the instructions in Databricks documentation.

Enable Unity Catalog for workspace

Enable Unity Catalog for your workspace by following the instructions in Databricks documentation.

Configure external data storage

Skip this step if your external data storage is already configured in Databricks.

Create storage credentials

Create your storage credentials by following the instructions in Databricks documentation.

Create external location

Log in to your Databricks workspace.
Go to Catalog > External Data.
Click External Locations.
Click Create location.
Select Manual and then click Next.
Enter the External location name.
In the Storage credential drop-down menu, select the credential you created.
In the URL field, enter the path to the S3 bucket you configured for your S3 Data Lake destination in Fivetran.
Click Create.

(Optional) Connect using AWS PrivateLink

You must have a Business Critical plan to use AWS PrivateLink.

Expand for instructions

AWS PrivateLink allows VPCs and AWS-hosted or on-premises services to communicate with one another without exposing traffic to the public internet. PrivateLink is the most secure connection method. Learn more in AWS PrivateLink's documentation.

How it works:

Fivetran accesses the data plane in your AWS account using the control plane network in Databricks' account.
You set up a back-end AWS PrivateLink connection between your AWS account and Databricks' AWS account (shown as (Workspace 1/2) Link-1 in the diagram above).
Fivetran creates and maintains a front-end AWS PrivateLink connection between Fivetran's AWS account and Databricks' AWS account (shown as Regional - Link-2 in the diagram above).

Prerequisites

To set up AWS PrivateLink, you need:

A Fivetran instance configured to run in AWS
A Databricks destination in one of our supported regions
All of Databricks' requirements

Configure AWS PrivateLink

Follow the instructions in Databricks documentation to enable private connectivity for your workspaces. Your workspaces must have the following:

A registered back-end VPC endpoint for secure cluster connectivity relay
A registered back-end VPC endpoint for REST APIs
A PAS object with access to Fivetran's VPC endpoints
If the Private Access Level on the PAS object is set to Account, a Fivetran VPC endpoint (for the applicable AWS region) that is registered once per account
If the Private Access Level on the PAS object is set to Endpoint, a Fivetran VPC endpoint (for applicable AWS region) that is registered using the allowed_vpc_endpoint_ids property

Register Fivetran endpoint details

Register the Fivetran endpoint for the applicable AWS region with your Databricks workspaces. We cannot access your workspaces until you register the endpoint.

AWS Region	VPC endpoint
ap-south-1 Asia Pacific (Mumbai)	`vpce-089f13c9231c2b729`
ap-southeast-1 Asia Pacific (Singapore)	`vpce-03f0abf5b0d840936`
ap-southeast-2 Asia Pacific (Sydney)	`vpce-0e5f79a1613d0cf05`
ap-northeast-2 Asia Pacific (Seoul)	`vpce-08125a2271630478c`
ca-central-1 Canada (Central)	`vpce-09f0049f9a92177f1`
eu-central-1 Europe (Frankfurt)	`vpce-049699737170c880d`
eu-west-1 Europe (Ireland)	`vpce-0b32cb6c08f6fe0df`
eu-west-2 Europe (London)	`vpce-03fde3e4804f537eb`
us-east-1 US East (N. Virginia)	`vpce-0ff9bd04153060180`
us-east-2 US East (Ohio)	`vpce-05153aa99bf7a4575`
us-west-2 US West (Oregon)	`vpce-0884ff0f23dcbf0dc`

Connect SQL warehouse

Perform this step only if you want to use a Databricks SQL warehouse with Fivetran. We recommend that you always use Databricks SQL warehouses with Fivetran.

In the Databricks console, go to SQL > SQL warehouses > Create SQL warehouse. If you want to select an existing SQL warehouse, skip to step 5 in this section.
In the New SQL warehouse window, enter a Name for your warehouse.
Choose your Cluster Size and configure the other warehouse options.
Click Create.
Go to the Connection details tab.
Make a note of the following values. You will need them to configure Fivetran.
- Server Hostname
- Port
- HTTP Path

Skip to the Choose authentication type step after completing this step.

(Optional) Connect Databricks cluster

Perform this step only if you want to use a Databricks cluster instead of a Databricks SQL warehouse. We recommend that you use Databricks SQL warehouses with Fivetran.

Expand for instructions

Log in to your Databricks workspace.
In the Databricks console, go to Data Engineering > Cluster > Create Cluster.
Enter a Cluster name of your choice.
Set the Databricks Runtime Version to the latest LTS release.
Make sure you choose v7.3 or above.
Select the Cluster mode.
Set the Databricks Runtime Version to 7.3 or above (recommended version: 10.4).
In the Advanced Options window, in the Security mode drop-down menu, select either Single user or User isolation.
In the Advanced Options section, select Spark.

If your Databricks Runtime Version is older than v9.1, copy the following code and paste in the Spark config field:

spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl.disable.cache true
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem

set-aws-true

Click Create Cluster.
In the Advanced Options window, select JDBC/ODBC.
Make a note of the following values. You will need them to configure Fivetran.

Server Hostname
Port
HTTP Path

Choose authentication type

You can use one of the following authentication types for Fivetran to connect to Databricks:

Databricks personal access token authentication

To use the Databricks personal access token authentication type, create a personal access token by following the instructions in Databricks documentation.
Assign the following catalog privileges to the user or service principal you want to use to create your access token:
- CREATE SCHEMA
- CREATE TABLE
- MODIFY
- REFRESH
- SELECT
- USE CATALOG
- USE SCHEMA
To create external tables in a Unity Catalog-managed external location, assign the following privileges to the user or service principal you want to use to create your access token:
- On the external location:
  - CREATE EXTERNAL TABLE
  - READ FILES
  - WRITE FILES
- On the storage credentials:
  - CREATE EXTERNAL TABLE
  - READ FILES
  - WRITE FILES

When you grant a privilege on the catalog, it is automatically granted to all current and future schemas in the catalog. Similarly, the privileges that you grant on a schema are inherited by all current and future tables in the schema.

OAuth machine-to-machine (M2M) authentication

To use the OAuth machine-to-machine (M2M) authentication type, create your OAuth client ID and secret by following the instructions in Databricks documentation.

Complete Fivetran configuration

Log in to your Fivetran account.
Go to the Destinations page and click Add destination.
Enter a Destination name of your choice and then click Add.
Select S3 Data Lake as the destination type.
Enter your S3 Bucket name.
You cannot change the bucket name after you save the setup form.
In the Fivetran Role ARN field, enter the ARN you found.
(Optional) Enter the S3 Prefix Path of your bucket.
The prefix path must not start or end with a forward slash (/).
You cannot change the prefix path after you save the setup form.
Enter your S3 Bucket Region.
In the Table Format drop-down menu, select the format you want to use for your destination tables.
You cannot change the table format after you save the setup form.
In the Snapshot Retention Period drop-down menu, select how long you want us to retain your table snapshots.
We perform regular table maintenance operations to delete the table snapshots that are older than the retention period you select in this field. You can select Retain All Snapshots to disable the deletion of table snapshots.
(Optional) To automate schema migration of Delta Lake tables in Databricks, set the Maintain Delta Tables In Databricks toggle to ON and do the following:
i. Choose the Databricks Connection Method.
ii. Enter the following details of your Databricks account: - Catalog name - Server Hostname - Port number - HTTP Path
iii. Select the Authentication Type you configured.
iv. If you selected PERSONAL ACCESS TOKEN as the Authentication Type, enter the Personal Access Token you created.
v. If you selected OAUTH 2.0 as the Authentication Type, enter the OAuth 2.0 Client ID and OAuth 2.0 Secret you created.
Choose your Data processing location.
Choose your Cloud service provider and its region as described in our Destinations documentation.
For faster uploads and downloads and for optimum load times, we recommend that you select AWS as the Cloud service provider and the Region in which your S3 bucket is located as the AWS Region.
For S3 data lake destinations, we support AWS in all our pricing plans. For information about the supported AWS Regions, see our destination overview documentation.
Choose your Time zone.
(Optional for Business Critical accounts) To enable regional failover, set the Use Failover toggle to ON, and then select your Failover Location and Failover Region. Make a note of the IP addresses of the secondary region and safelist these addresses in your firewall.
Click Save & Test.

Fivetran tests and validates the S3 Data Lake connection. On successful completion of the setup tests, you can sync your data using Fivetran connectors to the S3 Data Lake destination.

In addition, Fivetran automatically configures a Fivetran Platform Connector to transfer the connection logs and account metadata to a schema in this destination. The Fivetran Platform Connector enables you to monitor your connections, track your usage, and audit changes. The Fivetran Platform Connector sends all these details at the destination level.

If you are an Account Administrator, you can manually add the Fivetran Platform Connector on an account level so that it syncs all the metadata and logs for all the destinations in your account to a single destination. If an account-level Fivetran Platform Connector is already configured in a destination in your Fivetran account, then we don't add destination-level Fivetran Platform Connectors to the new destinations you create.

Setup tests

Fivetran performs the following S3 Data Lake connection tests:

The S3 Read and Write Access test checks the accessibility of your S3 bucket and validates the resources you provided in the IAM policy.
(Applicable only to Iceberg tables) The Glue Access test checks the accessibility of AWS Glue Data Catalog and validates the resources you provided in the IAM policy.
The Bucket Region test checks whether you specified a valid S3 Bucket Region.
The Validate Permissions test checks on the databricks creds if we have the necessary READ/WRITE permissions to CREATE, ALTER, or DROP tables in the database. We perform this test only if you set the Require PrivateLink toggle to ON
The tests may take a couple of minutes to complete.

Destination Overview

Release Notes

API Destination Configuration

Documentation Home

S3 Data Lake Setup Guide

Prerequisites

Setup instructions

Find External ID

Create IAM policy for S3 bucket

Create IAM policy for AWS Glue Data Catalog

Create IAM role

(Optional) Configure IAM policy for AWS Lake Formation

(Optional) Set up automatic schema migration for Delta Lake tables in Databricks

Prerequisites

Configure Unity Catalog

Create workspace

Create metastore and attach workspace

Enable Unity Catalog for workspace

Configure external data storage

Create storage credentials

Create external location

(Optional) Connect using AWS PrivateLink

Prerequisites

Configure AWS PrivateLink

Register Fivetran endpoint details

Connect SQL warehouse

(Optional) Connect Databricks cluster

Choose authentication type

Complete Fivetran configuration

Setup tests

Related articles