Managed Data Lakes Service

Our Managed Data Lakes Service provides a flexible and open approach to storing and managing data in your data lake. It leverages open standards, formats, and interfaces, ensuring compatibility across different ecosystems. The service writes data as Parquet files and maintains metadata for both Iceberg and Delta Lake tables simultaneously, allowing you to use the format that best suits your workflows without committing to a single table type.

The service supports the following storage configurations:

By default, our service manages metadata through the Fivetran Iceberg REST Catalog that is accessible to any Iceberg REST Catalog client. You can also configure the service to update other catalog services, ensuring consistent data governance across different environments.

Fivetran simplifies the setup process by allowing you to configure multiple catalogs from a single setup interface. This flexibility lets you choose from various catalog and query engine options, minimizing vendor lock-in and enabling you to work with the tools that best suit your needs.

Supported storage providers

Managed Data Lakes Service supports data lakes built using the following storage providers:

Amazon Web Services (AWS)
Azure Data Lake Storage (ADLS)
Google Cloud Storage (GCS)

Supported catalogs

Our Managed Data Lakes Service enables you to efficiently organize and manage the metadata of your data lake tables using various catalog options. By default, it includes a pre-configured Fivetran Iceberg REST Catalog for managing the metadata of Iceberg tables. In addition to the default catalog, you have the option to integrate the following catalogs with your data lake:

AWS Glue for Iceberg tables in AWS data lakes
BigLake metastore for Iceberg tables in GCS data lakes
Databricks Unity Catalog for Delta Lake tables in AWS and Azure data lakes

These additional catalog options provide enhanced flexibility and control over your data management.

Fivetran manages all tables in your data lake. Do not modify these tables manually or by using any external catalog, as doing so may lead to data integrity issues.

Fivetran Iceberg REST Catalog

Fivetran Iceberg REST Catalog serves as the default catalog for all Iceberg tables in all data lakes. Each data lake you set up has its own dedicated Fivetran Iceberg REST Catalog, which we configure based on your data lake details. We leverage Apache Polaris to implement the Fivetran Iceberg REST Catalog. The catalog is read-only for query engines, and we update it during every sync to reflect the most accurate and up-to-date data. If any discrepancies exist between catalogs, you must use this catalog to update your tables.

We use the OpenAPI Specification to build the catalog, allowing compatibility with any query engine for reading data from your data lake. For more information about integrating query engines with Fivetran Iceberg REST Catalog, see our integration guide.

AWS Glue

Managed Data Lakes Service provides you the option to integrate AWS Glue with your AWS data lake and use it to manage the Iceberg tables. This is in addition to the pre-configured Fivetran Iceberg REST Catalog we provide for all AWS data lakes. To integrate this catalog with your data lake, you must create an IAM policy for AWS Glue Data Catalog. For more information about integrating the catalog, see our setup guide.

BigLake metastore

With Managed Data Lakes Service, integrate BigLake metastore with your GCS data lake and then query your data using BigQuery. To integrate BigLake metastore with your data lake, see our setup guide.

Unity Catalog

Managed Data Lakes Service enables you to integrate Databricks Unity Catalog with your AWS and Azure data lakes, providing a structured and efficient way to organize and manage Delta Lake tables.

To integrate Unity Catalog with your data lake, you must create an external location in Databricks and specify the storage credentials required to access the data lake. The storage location in Databricks must match the data lake storage used for Delta Lake tables. Once the external location is set up, you can query your data directly from these external tables. For more information about integrating the catalog with your data lake, see our setup guide.

While filtering timestamp values from Unity Catalog, use the string(<col_name>) clause to get the accurate values.

Supported deployment models

We support the SaaS Deployment model for the destination.

Setup guide

Follow our step-by-step Managed Data Lakes Service setup guide to connect your data lake with Fivetran.

Type transformation and mapping

The data types in your data lake follow Fivetran's standard data type storage.

We use the following data type conversions:

FIVETRAN DATA TYPE	DESTINATION DATA TYPE (ICEBERG TABLE FORMAT)	DESTINATION DATA TYPE (DELTA LAKE TABLE FORMAT)	Notes
BOOLEAN	BOOLEAN	BOOLEAN
INT	INTEGER	INTEGER
LONG	LONG	LONG
BIGDECIMAL	DECIMAL(38, 10), DOUBLE, or STRING	DECIMAL(38, 10), DOUBLE, or STRING	If a primary key's precision exceeds `38` or scale exceeds `10`, we convert its data type to STRING. If a column has no precision or scale defined, we convert its data type to DOUBLE. For all other columns, we convert the data type to DECIMAL(38,10).
FLOAT	FLOAT	FLOAT
DOUBLE	DOUBLE	DOUBLE
LOCALDATE	DATE	DATE
INSTANT	TIMESTAMPTZ	TIMESTAMP
LOCALDATETIME	TIMESTAMP	TIMESTAMP
STRING	STRING	STRING
BINARY	BINARY	BINARY
SHORT	INTEGER	INTEGER
JSON	STRING	STRING
XML	STRING	STRING

Supported query engines

You can use various query engines to retrieve data from your data lake. The following table provides examples of query engines you can use for each data lake:

Query Engine	AWS	ADLS	GCS
Amazon Athena
Apache Spark
Databricks
Azure Synapse Analytics
BigQuery
Dremio
Redshift
Snowflake
Starburst Galaxy

For more information about the query engines you can use, see Apache Iceberg and Delta Lake documentation. If you fail to retrieve your data using the query engine of your choice, contact our support team.

Data and table formats

Fivetran organizes your data in a structured format within the data lake. We convert your source data into Parquet files within the Fivetran pipeline and store them in designated tables in your data lake. We support Delta Lake and Iceberg table formats for managing these tables in the data lake.

Storage directory

We store your data in the <root>/<prefix_path>/<schema_name>/<table_name> directory of your data lake.

Table maintenance operations

To maintain an efficient and optimized data storage environment, Fivetran performs regular maintenance operations on your data lake tables. These operations are designed to manage storage consumption and enhance query performance for your data lake.

The maintenance operations we perform are as follows:

Deletion of old snapshots and removed files: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. For Delta Lake tables, we always retain the last 4 checkpoints of a table before deleting its snapshots. In addition to the table snapshots, we also delete the files that are not referenced in the latest table snapshots but were referenced in the older snapshots. These removed files contribute to your storage provider subscription costs. We identify such files that are older than the snapshot retention period and delete them. This cleanup process runs once daily.
Deletion of previous versions of metadata files: In addition to the current version, we retain 3 previous versions of the metadata files and delete all the prior versions.
Deletion of orphan files: Orphan files are created in your storage because of unsuccessful operations within your data pipeline. These files are no longer referenced in any table metadata and contribute to your storage provider subscription costs. We identify these orphan files and delete them every alternate Saturday.

To ensure that every table is queryable, we recommend not deleting any metadata file because such deletions can corrupt the Iceberg tables.

Column statistics

Fivetran updates two column-level statistics, minimum value and maximum value, for your data lake tables. We update column-level statistics to enhance query performance and optimize storage in your data lake.

If the table contains 200 or less columns, we update the statistics for all the columns.
If the table contains more than 200 columns, we update the statistics for the _fivetran_synced column and all primary keys. If history mode is enabled for the table, we also update the statistics for the _fivetran_active, _fivetran_end, and _fivetran_start columns.

Reserved column names

The Iceberg table format does not allow columns with the following names:

_deleted
_file
_partition
_pos
_spec_id
file_path
pos
row

To avoid the reserved column names, Fivetran prefixes these reserved column names with a hash symbol (#) before writing them to the Iceberg tables.

For more information about Iceberg's reserved field names, see Iceberg documentation.

Supported regions

Supported AWS Regions

For AWS data lakes, we support S3 buckets located in the following AWS Regions:

Region	Code
US East (N. Virginia)	us-east-1
US East (Ohio)	us-east-2
US West (Oregon)	us-west-2
AWS GovCloud (US-West)	us-gov-west-1
Europe (Frankfurt)	eu-central-1
Europe (Ireland)	eu-west-1
Asia Pacific (Mumbai)	ap-south-1
Asia Pacific (Singapore)	ap-southeast-1
Canada (Central)	ca-central-1
Europe (London)	eu-west-2
Asia Pacific (Sydney)	ap-southeast-2
Asia Pacific (Tokyo)	ap-northeast-1

Supported Azure regions

For Azure data lakes, we support ADLS containers located in the following Azure regions:

Region	Code
East US 2	eastus2
Central US	centralus
East US	eastus
West US 3	westus3
Australia East	australiaeast
UK South	uksouth
West Europe	westeurope
Germany West Central	germanywestcentral
Canada Central	canadacentral
UAE North	uaenorth
Southeast Asia	southeastasia
Japan East	japaneast
Central India	centralindia

Supported Google Cloud regions

For GCS data lakes, we support Cloud Storage buckets in all Google Cloud regions.

To minimize data transfer costs from your cloud storage provider, we recommend selecting a Data processing location in the same region as your data lake during setup.

Limitations

Common limitations for all storage providers

Managed Data Lakes Service does not support private networking. Due to this limitation, we do not support the use of private endpoints within data lake environments.
Fivetran does not support position deletes for Iceberg tables. To avoid errors, we recommend that you avoid running any query that generates position deletes.
Fivetran does not support the Change Data Feed feature for Delta Lake tables. You must not enable Change Data Feed for the Delta Lake tables that Fivetran creates in your data lake.
Each data lake must have a unique combination of bucket or container and prefix path. Even after you delete a destination, you must not reuse the same bucket or container and prefix path for another data lake.

Limitations for AWS data lakes

Fivetran only supports AWS Standard storage class for AWS data lakes.
Fivetran does not support the following storage tiers due to their long retrieval times (ranging from a few minutes to 48 hours):
- S3 Glacier Flexible Retrieval
- S3 Glacier Deep Archive
- S3 Intelligent-Tiering Archive Access tier
- S3 Intelligent-Tiering Deep Archive Access tier
AWS Glue Catalog supports only one table per combination of schema and table names within a Region. Consequently, using multiple Fivetran Platform Connectors with the same Glue Catalog across different AWS data lakes can cause conflicts and result in various integration issues. To avoid such conflicts, we recommend configuring Fivetran Platform Connectors with distinct schema names for each AWS data lake.

Limitations for Azure data lakes

Fivetran creates DECIMAL columns with maximum precision and scale (38, 10).
Spark SQL pool queries cannot read the maximum values of DOUBLE and FLOAT data types.
Fivetran does not support archive access tier because its retrieval time can extend to several hours.
Spark SQL pool queries truncate the TIMESTAMP values to seconds. To query any table using a TIMESTAMP column, you can use the unixtime(unix_timestamp(<col_name>, 'yyyy-MM-dd HH:mm:ss.SSS'),'yyyy-MM-dd HH:mm:ss.ms') clause in your queries to get the accurate values, including milliseconds and microseconds.