S3 Data Lake

Amazon Simple Storage Service (Amazon S3) provides scalable cloud storage services to build secure data lakes. Fivetran supports data lakes built on Amazon S3 as a destination.

Our S3 Data Lake destination can sync your data from multiple sources to S3 data lakes. We use AWS Glue as the data catalog for the Iceberg tables in your destination. AWS Glue is a serverless data integration service that enables other services to quickly query and integrate the data stored in your data lake.

We will gradually discontinue S3 Data Lake and migrate all existing S3 Data Lake destinations to Managed Data Lakes Service.

Supported deployment models

We support the SaaS Deployment model for the destination.

Setup guide

Follow our step-by-step S3 Data Lake setup guide to connect your S3 Data Lake destination with Fivetran.

Type transformation and mapping

The data types in your S3 Data Lake destination follow Fivetran's standard data type storage.

We use the following data type conversions:

Fivetran Data Type	Destination Data Type (Delta Lake Table Format)	Destination Data Type (Iceberg Table Format)	Notes
BOOLEAN	BOOLEAN	BOOLEAN
SHORT	SHORT	INTEGER
INT	INTEGER	INTEGER
LONG	LONG	LONG
BIGDECIMAL	DECIMAL(38, 10), DOUBLE, or STRING	DECIMAL(38, 10), DOUBLE, or STRING	If a primary key's precision exceeds `38` or scale exceeds `10`, we convert its data type to STRING. If a column has no precision or scale defined, we convert its data type to DOUBLE. For all other columns, we convert the data type to DECIMAL(38,10).
FLOAT	FLOAT	FLOAT
DOUBLE	DOUBLE	DOUBLE
LOCALDATE	DATE	DATE
LOCALDATETIME	TIMESTAMP	TIMESTAMP
INSTANT	TIMESTAMP	TIMESTAMPTZ
STRING	STRING	STRING
XML	STRING	STRING
JSON	STRING	STRING
BINARY	BINARY	BINARY

Supported AWS Regions

We can store your data in S3 buckets located in the following AWS regions:

Name	Code
US East (N. Virginia)	us-east-1
US East (Ohio)	us-east-2
US West (Oregon)	us-west-2
Europe (Frankfurt)	eu-central-1
Europe (Ireland)	eu-west-1
Asia Pacific (Mumbai)	ap-south-1
Asia Pacific (Singapore)	ap-southeast-1
Canada (Central)	ca-central-1
Europe (London)	eu-west-2
Asia Pacific (Sydney)	ap-southeast-2
Asia Pacific (Tokyo)	ap-northeast-1

We use gateway endpoints to bypass the public internet and communicate directly with S3 through the AWS network. If your S3 bucket is in the same AWS region as your Fivetran account, network traffic does not traverse the public internet. Learn more in AWS' Types of VPC endpoints for Amazon S3 documentation.

Data format

Fivetran stores your data in a structured format in the destination. We write your source data to Parquet files in the Fivetran pipeline and then store these files in specific tables in your data lake. We support the following table formats for S3 Data Lake:

While setting up your destination, you can choose the table format you want us to use for your destination.

Supported query engines

You can use various query engines to extract data from your destination tables. For example:

See Apache Iceberg's documentation and Delta Lake's documentation for more query engines you can use to query your tables.

We support Starburst Galaxy only for the Iceberg tables. Also, to use Starburst Galaxy as your query engine, you must integrate it using AWS Glue metastore.
If you are unable to extract your data using the query engine of your choice, contact our support team.

Folder structure

We can sync your data to any destination folder of your choice. If you do not specify any folder, we write the data to the following directory: <root_folder>/schema name/<table_name>

Unity Catalog

You can create external tables in Databricks Unity Catalog for the data stored in the Delta Lake tables of your S3 Data Lake destination. You can then query your data from these external tables.

To integrate Unity Catalog with your S3 Data Lake destination, do one of the following:

Configure automatic schema migration of Delta Lake tables in Databricks. You can do this in your destination setup form. Once configured, Fivetran will automatically create and maintain the Delta Lake tables in Databricks. The schema and table names in Databricks are same as the corresponding names in your S3 Data Lake destination. The schema migration does not impact your syncs. If the schema migration fails due to any error, the sync between your data source and destination does not fail and we display a warning on the Fivetran dashboard. For more information about configuring automatic schema migration, see our setup guide.
Create the tables manually by following the instructions in our Unity Catalog setup instructions.

Databricks uses the table definition to understand the structure of the data. It stores the metadata of the tables in the metastore and allows us to interact with them like regular tables within Databricks by accessing the data in its original location.

Table maintenance operations

We regularly perform maintenance operations on your destination tables to maintain an efficient data storage environment. The maintenance operations vary based on the format of the table.

You may observe a sync delay for your connections while the table maintenance operations are in progress.

Maintenance operations for Delta Lake tables

We perform the following maintenance operations on the Delta Lake tables in your destination:

Delete old snapshots and removed files: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. However, we always retain the last 4 checkpoints of a table before deleting its snapshots. We also delete removed files. Removed files are the files that are not referenced in the latest table snapshots but were referenced in the older snapshots. These removed files contribute to your AWS subscription costs. We identify such files that are older than the snapshot retention period and delete them. This cleanup process runs once daily.
Delete orphan files: Orphan files are created because of unsuccessful operations within your data pipeline. The orphan files are stored in your S3 bucket but are no longer referenced in the Delta Lake table metadata. These orphan contribute to your AWS subscription costs. We identify such files that are older than 7 days and delete them every alternate Saturday.

Maintenance operations for Iceberg tables

We perform the following maintenance operations on the Iceberg tables in your destination:

Delete old snapshots: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. We also delete the data files that were referenced only by the deleted snapshots and are not referenced by any active snapshot. This cleanup process runs once daily.
Delete previous versions of metadata files: In addition to the current version, we retain 3 previous versions of the metadata files and delete all the prior versions.
Delete orphan files: Orphan files are created because of unsuccessful operations within your data pipeline. The orphan files are stored in your S3 bucket but are no longer referenced in the Iceberg table metadata. These files contribute to your S3 subscription costs. We identify these orphan files and delete them every alternate Saturday.

To track the changes made to the Iceberg tables, we create a sequence_number.txt file in each table's metadata folder. You must never delete these files from your destination.
To guarantee that every table is queryable, it is best practice to not delete any metadata files. Deletions can lead to a corruption of the Iceberg table. Also, review any lifecycle rule configurations, because they can cause accidental file deletions that corrupt your Iceberg table. If your data lake tables are corrupted, the connection sync fails and you won't be able to query the data downstream. If you see the Data Lake Tables are Corrupted error in your Fivetran dashboard, you must delete the AWS Glue entry for these tables, delete all the underlying files of these tables from the bucket, and perform a full re-sync.

Column statistics

We update two column-level statistics, minimum value and maximum value, for your destination tables.

Column statistics for Delta Lake tables

By default, we update the statistics for the _fivetran_synced column and all primary keys. We also update the statistics for the _fivetran_active, _fivetran_end, and _fivetran_start columns in history mode tables.

Column statistics for Iceberg tables

We update the statistics as follows:

If you set up your connection on or after June 20, 2024 and if the table contains 200 or less columns, we update the statistics for all the columns.
If you set up your connection on or after June 20, 2024 and if the table contains more than 200 columns, we update the statistics for the _fivetran_synced column and all primary keys.
If history mode is enabled for the table, we update the statistics for the _fivetran_active, _fivetran_end, and _fivetran_start columns.
If you set up your connection before June 20, 2024, we update the statistics only for the new Parquet files that we sync into your destination after June 20, 2024. We do not update the statistics for the data that was synced before June 20, 2024.

Reserved column names

Iceberg table format does not allow columns with the following names:

_deleted
_file
_partition
_pos
_spec_id
file_path
pos
row

To avoid naming conflicts, we prefix the reserved column names with # before writing them to the Iceberg tables in your S3 Data Lake destination.

For more information about Iceberg's reserved field names, see Iceberg's documentation.

Troubleshooting data issues

We can troubleshoot the issues in the data stored in your destination. If you want us to troubleshoot the issues, you must allow us to access your destination data. Follow our troubleshooting documentation to allow Fivetran to access the data in your destination.

In some cases, Fivetran Support may require you to provide server access logging or CloudTrail logs to complete our investigation. Since AWS logging has a cost, we do not require you to enable it. However, you can’t enable AWS logs retroactively because they only start capturing events from the moment they are enabled. If you want to be proactive about potential future Data Lake issues, you can enable an AWS logging mechanism.

Limitations

Fivetran does not support position deletes for Iceberg tables. To avoid errors, we recommend that you avoid running any query that generates position deletes.
Fivetran does not support the Change Data Feed feature for Delta Lake tables. You must not enable Change Data Feed for the Delta Lake tables that Fivetran creates in your S3 Data Lake destination.
The AWS Glue Catalog supports only one table per combination of schema and table names within a region. Consequently, using multiple Fivetran Platform Connectors with the same Glue Catalog across different S3 Data Lake destinations can cause conflicts and result in various integration issues. To avoid such conflicts, we recommend configuring Fivetran Platform Connectors with distinct schema names for each S3 Data Lake destination.
Fivetran does not support the following Amazon S3 storage tiers due to their long retrieval times (ranging from a few minutes to 48 hours):
- S3 Glacier Flexible Retrieval
- S3 Glacier Deep Archive
- S3 Intelligent-Tiering Archive Access tier
- S3 Intelligent-Tiering Deep Archive Access tier