Managed Data Lakes Service
Managed Data Lakes Service provides a flexible and open approach to storing and managing data in your data lake. It leverages open standards, formats, and interfaces, ensuring compatibility across different ecosystems. The service writes data as Parquet files and maintains metadata for both Iceberg and Delta Lake tables simultaneously, allowing you to use the format that best fits your workflows without committing to a single table type.
The service supports Amazon S3 and Azure Data Lake Storage (ADLS), accommodating a range of storage configurations. By default, it manages metadata through the Fivetran Iceberg REST Catalog, which any Iceberg REST Catalog client can access. Additionally, you can configure the service to update AWS Glue or Databricks Unity Catalog, ensuring consistent data governance across different environments.
Fivetran simplifies the setup process by allowing you to configure multiple catalogs from a single setup form. This flexibility lets you choose from various catalog and query engine options, minimizing vendor lock-in and enabling you to work with the tools that best suit your needs.
Supported storage providers
Managed Data Lakes Service supports data lakes built using the following storage providers:
- Amazon Web Services (AWS)
- Azure Data Lake Storage (ADLS)
Supported deployment models
We support the SaaS Deployment model for the destination.
Setup guide
Follow our step-by-step Managed Data Lakes Service setup guide to connect your data lake with Fivetran.
Type transformation and mapping
The data types in your data lake follow Fivetran's standard data type storage.
We use the following data type conversions:
FIVETRAN DATA TYPE | DESTINATION DATA TYPE (ICEBERG TABLE FORMAT) | DESTINATION DATA TYPE (DELTA LAKE TABLE FORMAT) |
---|---|---|
BOOLEAN | BOOLEAN | BOOLEAN |
INT | INTEGER | INTEGER |
LONG | LONG | LONG |
BIGDECIMAL | DECIMAL | DECIMAL |
FLOAT | FLOAT | FLOAT |
DOUBLE | DOUBLE | DOUBLE |
LOCALDATE | DATE | DATE |
INSTANT | TIMESTAMPTZ | TIMESTAMP |
LOCALDATETIME | TIMESTAMP | TIMESTAMP |
STRING | STRING | STRING |
BINARY | BINARY | BINARY |
SHORT | INTEGER | INTEGER |
JSON | STRING | STRING |
XML | STRING | STRING |
Supported catalogs
Managed Data Lakes Service enables you to efficiently organize and manage the metadata of your data lake tables using various catalog options. By default, it includes a pre-configured Fivetran Iceberg REST Catalog for managing the metadata of Iceberg tables. In addition to the default catalog, you have the option to integrate AWS Glue for Iceberg tables in AWS data lakes and Databricks Unity Catalog for Delta Lake tables in both AWS and Azure data lakes. These additional catalog options provide enhanced flexibility and control over your data management.
Fivetran Iceberg REST Catalog
Fivetran Iceberg REST Catalog serves as the default catalog for all Iceberg tables in AWS and Azure data lakes. Each data lake you set up has its own dedicated Fivetran Iceberg REST Catalog, which we configure based on your data lake details. You can use this catalog with any query engine to retrieve data from your data lake. Additionally, we provide a unique SQL script for each data lake that you can run in Snowflake to integrate Snowflake with your catalog. For more information about integrating Snowflake with Fivetran Iceberg REST Catalog, see our integration guide.
AWS Glue
Managed Data Lakes Service provides you the option to integrate AWS Glue with your AWS data lake and use it to manage the Iceberg tables. This is in addition to the pre-configured Fivetran Iceberg REST Catalog we provide for all AWS data lakes. To integrate this catalog with your data lake, you must create an IAM policy for AWS Glue Data Catalog. For more information about integrating the catalog, see our setup guide.
Unity Catalog
Managed Data Lakes Service enables you to integrate Databricks Unity Catalog with your AWS and Azure data lakes, providing a structured and efficient way to organize and manage Delta Lake tables.
To integrate this catalog with your data lake, you must create an external location in Databricks and specify the storage credentials required to access the data lake. The storage location in Databricks must match the data lake storage used for Delta Lake tables. Once the external location is set up, you can query your data directly from these external tables. For more information about integrating the catalog with your data lake, see our setup guide.
TIP: While filtering timestamp values from Unity Catalog, use the
string(<col_name>)
clause to get the accurate values.
Supported query engines
You can use various query engines to retrieve data from your data lake. The following table provides examples of query engines you can use for each data lake:
Query Engine | AWS | ADLS |
---|---|---|
Amazon Athena | check | |
Apache Spark | check | check |
Azure Databricks | check | |
Azure Synapse Analytics | check | |
Dremio | check | check |
Redshift | check | |
Snowflake | check | |
Starburst Galaxy | check |
NOTE: For more information about the query engines you can use, see Apache Iceberg and Delta Lake documentation. If you fail to retrieve your data using the query engine of your choice, contact our support team.
Data and table formats
Fivetran organizes your data in a structured format within the data lake. We convert your source data into Parquet files within the Fivetran pipeline and store them in designated tables in your data lake. We support Delta Lake and Iceberg table formats for managing these tables in the data lake.
Storage directory
We write your data to the following directory in your data lake: <root>/<prefix_path>/<schema_name>/<table_name>
Table maintenance operations
To maintain an efficient and optimized data storage environment, Fivetran performs regular maintenance operations on your data lake tables. These operations are designed to manage storage consumption and enhance query performance for your data lake.
The maintenance operations we perform are as follows:
- Deletion of old snapshots and removed files: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. For Delta Lake tables, we always retain the last 4 checkpoints of a table before deleting its snapshots. In addition to the table snapshots, we also delete the files that are not referenced in the latest table snapshots but were referenced in the older snapshots. These removed files contribute to your storage provider subscription costs. We identify such files that are older than the snapshot retention period and delete them. This cleanup process runs once daily.
- Deletion of previous versions of metadata files: In addition to the current version, we retain 3 previous versions of the metadata files and delete all the prior versions.
- Deletion of orphan files: Orphan files are created because of unsuccessful operations within your data pipeline. The orphan files are stored in your S3 bucket or ADLS container but are no longer referenced in the table metadata. These files contribute to your storage provider subscription costs. We identify these orphan files and delete them every alternate Saturday.
IMPORTANT:
- To track the changes made to the Iceberg tables, we create a
sequence_number.txt
file in each table's metadata folder. You must never delete these files from your data lake.- To ensure that every table is queryable, we recommend not deleting any metadata file because such deletions can corrupt the Iceberg tables.
Column statistics
Fivetran updates two column-level statistics, minimum value and maximum value, for your data lake tables. We update column-level statistics to enhance query performance and optimize storage in your data lake. The specific statistics maintained depend on the table format and the number of columns in your tables.
Depending on the number of columns in the table, we update the statistics as follows:
- If the table contains 200 or less columns, we update the statistics for all the columns.
- If the table contains more than 200 columns, we update the statistics only for the primary keys.
Reserved column names
The Iceberg table format does not allow columns with the following names:
_deleted
_file
_partition
_pos
_spec_id
file_path
pos
row
To avoid the reserved column names, Fivetran prefixes these reserved column names with a hash symbol (#
) before writing them to the Iceberg tables.
NOTE: For more information about Iceberg's reserved field names, see Iceberg documentation.
Supported regions
Supported AWS Regions
For AWS data lakes, we support S3 buckets located in the following AWS Regions:
Region | Code |
---|---|
US East (N. Virginia) | us-east-1 |
US East (Ohio) | us-east-2 |
US West (Oregon) | us-west-2 |
AWS GovCloud (US-West) | us-gov-west-1 |
Europe (Frankfurt) | eu-central-1 |
Europe (Ireland) | eu-west-1 |
Asia Pacific (Mumbai) | ap-south-1 |
Asia Pacific (Singapore) | ap-southeast-1 |
Canada (Central) | ca-central-1 |
Europe (London) | eu-west-2 |
Asia Pacific (Sydney) | ap-southeast-2 |
Asia Pacific (Tokyo) | ap-northeast-1 |
Supported Azure regions
For Azure data lakes, we support ADLS container located in the following Azure regions:
Region | Code |
---|---|
East US 2 | eastus2 |
Central US | centralus |
East US | eastus |
West US 3 | westus3 |
Australia East | australiaeast |
UK South | uksouth |
West Europe | westeurope |
Germany West Central | germanywestcentral |
Canada Central | canadacentral |
UAE North | uaenorth |
Southeast Asia | southeastasia |
Japan East | japaneast |
Central India | centralindia |
Limitations
Limitations for AWS
Fivetran does not support position deletes for Iceberg tables. To avoid errors, we recommend that you avoid running any query that generates position deletes.
Fivetran supports only the AWS Standard storage class for AWS data lakes.
Fivetran does not support the Change Data Feed feature for Delta Lake tables. You must not enable Change Data Feed for the Delta Lake tables that Fivetran creates in your AWS data lake.
Fivetran does not support the following storage tiers due to their long retrieval times (ranging from a few minutes to 48 hours):
- S3 Glacier Flexible Retrieval
- S3 Glacier Deep Archive
- S3 Intelligent-Tiering Archive Access tier
- S3 Intelligent-Tiering Deep Archive Access tier
AWS Glue Catalog supports only one table per combination of schema and table names within a Region. Consequently, using multiple Fivetran Platform Connectors with the same Glue Catalog across different AWS data lakes can cause conflicts and result in various integration issues. To avoid such conflicts, we recommend configuring Fivetran Platform Connectors with distinct schema names for each AWS data lake.
Limitations for ADLS
Fivetran does not support position deletes for Iceberg tables. To avoid errors, we recommend that you avoid running any query that generates position deletes.
Fivetran creates DECIMAL columns with maximum precision and scale (38, 10).
Spark SQL pool queries cannot read the maximum values of DOUBLE and FLOAT data types.
Fivetran does not support the
archive access tier
because its retrieval time can extend to several hours.Spark SQL pool queries truncate the TIMESTAMP values to seconds. To query any table using a TIMESTAMP column, you can use the
unixtime(unix_timestamp(<col_name>, 'yyyy-MM-dd HH:mm:ss.SSS'),'yyyy-MM-dd HH:mm:ss.ms')
clause in your queries to get the accurate values, including milliseconds and microseconds.Fivetran does not support the Change Data Feed feature for Delta Lake tables. You must not enable Change Data Feed for the Delta Lake tables that Fivetran creates in your data lake.