Managed Data Lake
Our Managed Data Lake is the single destination for all data lake integrations, consolidating support for Amazon S3, Azure Data Lake Storage (ADLS), and OneLake. This unified approach simplifies data management, ensuring consistent ingestion, transformation, and query capabilities across all supported storage providers.
Supported storage providers
Fivetran’s Unified Data Lake supports the following storage providers:
Amazon S3
Azure Data Lake Storage (ADLS)
OneLake
Google Cloud Storage (GCS) (Planned Support)
Setup guide
Follow our step-by-step Managed Data Lake setup guide to connect your Azure Data Lake Storage destination with Fivetran.
Type transformation and mapping
This table outlines how Fivetran's standard data types are converted into the corresponding destination data types for each storage provider.
Fivetran Data Type | Amazon S3 Data Type | ADLS Data Type (Delta Lake) | ADLS Data Type (Iceberg) | OneLake Data Type |
---|---|---|---|---|
BOOLEAN | BOOLEAN | BOOLEAN | BOOLEAN | BOOLEAN |
SHORT | SHORT | SHORT | INTEGER | SHORT |
INT | INTEGER | INTEGER | INTEGER | INTEGER |
LONG | LONG | LONG | LONG | LONG |
BIGDECIMAL | DECIMAL (38, 10) | DECIMAL (38, 10) | DECIMAL (38, 10) | DECIMAL (38, 10) |
FLOAT | FLOAT | FLOAT | FLOAT | FLOAT |
DOUBLE | DOUBLE | DOUBLE | DOUBLE | DOUBLE |
LOCALDATE | DATE | DATE | DATE | DATE |
INSTANT | TIMESTAMP | TIMESTAMP | TIMESTAMPTZ | TIMESTAMP |
STRING | STRING | STRING | STRING | STRING |
XML | STRING | STRING | STRING | STRING |
JSON | STRING | STRING | STRING | STRING |
BINARY | BINARY | BINARY | BINARY | BINARY |
Supported query engines
This table provides a comprehensive view of the query engines compatible with each data lake destination.
Query Engine | Amazon S3 Data Lake | ADLS Data Lake | OneLake Data Lake |
---|---|---|---|
Amazon Athena | ✓ | ||
Azure Databricks | ✓ | ✓ | |
Azure Synapse Analytics | ✓ | ✓ | |
Dremio | ✓ | ✓ | |
Redshift | ✓ | ||
Snowflake | ✓ | ||
Starburst Galaxy | ✓ |
NOTE
- Make sure Unity Catalog is not integrated with your Databricks workspace.
- We support Starburst Galaxy only for the Iceberg tables. Also, to use Starburst Galaxy as your query engine, you must integrate it using AWS Glue metastore. If you are unable to extract your data using the query engine of your choice, contact our support team
Data formats
Fivetran stores your source data in a structured format within the destination. The data is written to Parquet files during the Fivetran pipeline process and subsequently stored in specific tables in your data lake.
Supported table formats
Depending on your chosen storage provider, Fivetran supports the following table formats:
Amazon S3 Data Lake:
- Delta Lake (Beta)
- Apache Iceberg
Azure Data Lake Storage (ADLS):
- Delta Lake
- Apache Iceberg (Beta)
OneLake:
NOTE: During the destination setup, you can select the table format you prefer for your data lake.
Folder structure
Fivetran organizes your data into destination folders based on your configuration:
Amazon S3 Data Lake:
- Default directory:
<root_folder>/schema_name/<table_name>
- Default directory:
Azure Data Lake Storage (ADLS):
- If a prefix path is specified:
<root>/<prefix_path>/<schema>/<table>
- If no prefix path is specified, the default is:
<root>/fivetran/<schema>/<table>
- If a prefix path is specified:
OneLake:
- Directory structure:
<lakehouse_name>.lakehouse/Tables/<table_name>
or<lakehouse_guid>/Tables/<table_name>
- Directory structure:
This structured approach ensures organized storage and efficient data retrieval across all supported data lake destinations.
Table maintenance operations
To maintain an efficient and optimized data storage environment, Fivetran performs regular maintenance operations on your destination tables. These operations vary based on the table format and are designed to manage storage consumption and enhance query performance.
Maintenance Operations for Delta Lake tables
Fivetran performs the following maintenance tasks on Delta Lake tables across all supported destinations:
Delete old snapshots and removed files: We remove table snapshots older than the Snapshot Retention Period specified during your destination setup. However, we always retain the last four checkpoints of a table before deleting its snapshots. Additionally, we delete removed files—those not referenced in the latest table snapshots but present in older snapshots—to optimize storage costs.
Delete orphan files: Orphan files result from unsuccessful operations within your data pipeline and are no longer referenced in the Delta Lake table metadata. These files, if left unmanaged, contribute to unnecessary storage costs. Fivetran identifies orphan files older than seven days and deletes them at regular intervals of two weeks.
NOTE: You may observe a sync delay for your connectors while table maintenance operations are in progress. To minimize disruptions, we schedule these operations only on Saturdays.
Maintenance operations for Iceberg tables
For Iceberg tables, Fivetran performs the following maintenance tasks:
Expire snapshots: We remove snapshots and associated metadata files that are older than the configured retention threshold to free up storage space.
Remove orphan files: Similar to Delta Lake tables, we identify and delete orphan files not referenced by any snapshots to maintain storage efficiency.
NOTE: The specific maintenance operations for Iceberg tables may vary based on the destination and configuration. Please refer to the destination-specific documentation for detailed information.
By performing these maintenance operations, Fivetran ensures that your data lake remains optimized for both storage and performance, providing a seamless and efficient data experience.
Column statistics
Fivetran updates column-level statistics to enhance query performance and optimize storage in your data lake destination. The specific statistics maintained depend on the table format and the number of columns in your tables.
Delta lake tables
For tables using the Delta Lake format:
Primary key columns: Fivetran updates the following statistics:
- Minimum Value
- Maximum Value
NOTE: Statistics are maintained only for primary key columns in Delta Lake tables.
Iceberg tables
For tables using the Iceberg format, the maintenance of column statistics is determined by the number of columns:
Tables with 200 or Fewer Columns:
- Fivetran updates the minimum and maximum values for all columns.
Tables with More Than 200 Columns:
- Fivetran updates the minimum and maximum values only for primary key columns.
Note: This approach ensures efficient storage and performance optimization for large tables.
Implementation details
For Connectors Set Up On or After June 20, 2024:
- The above policies are applied based on the number of columns in the table.
For Connectors Set Up Before June 20, 2024:
- Fivetran updates statistics only for new Parquet files synced into your destination after June 20, 2024.
- Statistics for data synced before this date remain unchanged.
Reserved column names
Certain column names are reserved in specific table formats and may cause conflicts during data ingestion. To prevent these conflicts, Fivetran modifies such column names before writing them to your destination tables.
Iceberg table format
The Iceberg table format reserves the following column names:
_deleted
_file
_partition
_pos
_spec_id
file_path
pos
row
To avoid naming conflicts, Fivetran prefixes these reserved column names with a hash symbol (#
) before writing them to the Iceberg tables in your destination. :contentReference[oaicite:0]{index=0}
Delta Lake table format
The Delta Lake table format does not have specific reserved column names that require modification. Therefore, Fivetran does not alter column names when writing to Delta Lake tables.
OneLake
In OneLake destinations, Fivetran uses the Delta Lake table format. As mentioned above, there are no specific reserved column names in Delta Lake that necessitate modification.
By handling reserved column names appropriately, Fivetran ensures seamless data ingestion and prevents potential conflicts in your data lake destination.
Limitations
When using Fivetran's Unified Data Lake as your destination, please be aware of the following limitations:
General limitations
Decimal Data Type Precision: Fivetran creates
DECIMAL
columns with a maximum precision and scale of(38, 10)
.Table and Schema Naming:
- Names must not start or end with an underscore (
_
). - Names must not contain multiple consecutive underscores (
__
).
- Names must not start or end with an underscore (
Change Data Feed: Fivetran does not support the Change Data Feed feature for Delta Lake tables. Do not enable this feature for tables that Fivetran creates in your destination.
Query Engine Compatibility (only for OneLake): Ensure that Unity Catalog is not integrated with your Databricks workspace when querying data from your OneLake destination.
Data type limitations
Floating-Point Precision:
- Spark SQL and SparkR queries cannot read the maximum values of
DOUBLE
andFLOAT
data types. - SparkR queries cannot read the minimum and maximum values of the
LONG
data type.
- Spark SQL and SparkR queries cannot read the maximum values of
Timestamp Precision: Spark SQL and SparkR queries truncate timestamp values to seconds. To retrieve data with millisecond or microsecond precision, use the following clause in your queries:
unixtime(unix_timestamp(<col_name>, 'yyyy-MM-dd HH:mm:ss.SSS'), 'yyyy-MM-dd HH:mm:ss.SSS')