Azure Data Lake Storage
Azure Data Lake Storage (ADLS) is a cloud-based, scalable data storage solution for big data analytics. ADLS allows you to store and manage massive amounts of data in any format. Fivetran supports data lakes built on ADLS as a destination.
We use Hadoop Catalog as the data catalog for the Iceberg tables in your destination. Hadoop Catalog follows the specifications from file system tables.
Setup guide
Follow our step-by-step Azure Data Lake Storage setup guide to connect your Azure Data Lake Storage destination with Fivetran.
Type transformation and mapping
The data types in your Azure Data Lake Storage follow Fivetran's standard data type storage.
We use the following data type conversions:
Fivetran Data Type | Destination Data Type (Delta Lake Table Format) | Destination Data Type (Iceberg Table Format) |
---|---|---|
BOOLEAN | BOOLEAN | BOOLEAN |
SHORT | SHORT | INTEGER |
INT | INTEGER | INTEGER |
LONG | LONG | LONG |
BIGDECIMAL | DECIMAL(38, 10) | DECIMAL(38, 10) |
FLOAT | FLOAT | FLOAT |
DOUBLE | DOUBLE | DOUBLE |
LOCALDATE | DATE | DATE |
INSTANT | TIMESTAMP | TIMESTAMPTZ |
STRING | STRING | STRING |
XML | STRING | STRING |
JSON | STRING | STRING |
BINARY | BINARY | BINARY |
Supported query engines
To extract data from your ADLS destination, use the following query engines:
To extract data using Azure Synapse Analytics, use the following query engines:
Data format
Fivetran stores your data in a structured format in the destination. We write your source data to Parquet files in the Fivetran pipeline and then store these files in specific tables in your data lake. We support the following table formats for Azure Data Lake Storage:
NOTE: While setting up your destination, you can choose the table format you want us to use for your destination.
Folder structure
We can sync your data to any destination folder of your choice. If you specify the prefix path, we write your data to the following directory: <root>/<prefix_path>/<schema>/<table>
. If you do not specify the prefix path, we create a folder and set the prefix path to fivetran
by default. We then write your data to the following directory: <root>/<fivetran>/<schema>/<table>
.
Unity Catalog
You can create external tables in Databricks Unity Catalog for the data stored in the Delta Lake tables of your ADLS destination. You can then query your data from these external tables.
To integrate Unity Catalog with your ADLS destination, do one of the following:
- Configure automatic schema migration of Delta Lake tables in Databricks. You can do this in your destination setup form. Once configured, Fivetran will automatically create and maintain the Delta Lake tables in Databricks. The schema and table names in Databricks are same as the corresponding names in your ADLS destination. The schema migration does not impact your syncs. If the schema migration fails due to any error, the sync between your data source and destination does not fail and we display a warning on the Fivetran dashboard. For more information about configuring automatic schema migration, see our setup guide.
- Create the tables manually by following the instructions in our Unity Catalog setup instructions.
NOTE: Databricks uses the table definition to understand the structure of the data. It stores the metadata of the tables in the metastore and allows us to interact with them like regular tables within Databricks by accessing the data in its original location.
Table maintenance operations
We regularly perform maintenance operations on your destination tables to maintain an efficient data storage environment. The maintenance operations vary based on the format of the table.
NOTE: You may observe a sync delay for your connectors while the table maintenance operations are in progress. To ensure a seamless experience with minimal sync delays, we perform the table maintenance operations only on Saturdays.
Maintenance operations for Delta Lake tables
We perform the following maintenance operations on the Delta Lake tables in your destination:
- Delete old snapshots: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. However, we always retain the last 4 checkpoints of a table before deleting its snapshots.
- Delete orphan and removed files: Orphan files are created because of unsuccessful operations within your data pipeline. The orphan files are stored in your ADLS container but are no longer referenced in the Delta Lake table metadata. Removed files are the files that are not referenced in the latest table snapshots but were referenced in the older snapshots. These orphan and removed files contribute to your ADLS subscription costs. We identify such files that are older than 7 days and delete them in regular intervals of 2 weeks.
Maintenance operations for Iceberg tables
We perform the following maintenance operations on the Iceberg tables in your destination:
- Delete old snapshots: We delete the table snapshots that are older than the Snapshot Retention Period you specify in the destination setup form. We also delete the data files that were referenced only by the deleted snapshots and are not referenced by any active snapshot.
- Delete previous versions of metadata files: In addition to the current version, we retain 3 previous versions of the metadata files and delete all the prior versions.
- Delete orphan files: Orphan files are created because of unsuccessful operations within your data pipeline. The orphan files are stored in your ADLS container but are no longer referenced in the Iceberg table metadata. These files contribute to your ADLS subscription costs. We identify these orphan files and delete them in regular intervals of 2 weeks.
IMPORTANT:
- To track the changes made to the Iceberg tables, we create a
sequence_number.txt
file in each table's metadata folder. You must never delete these files from your destination.- To guarantee that every table is queryable, it is best practice to not delete any metadata files. Deletions can lead to corruption of the Iceberg tables.
Column statistics
We update two column-level statistics, minimum value and maximum value, for your destination tables.
Column statistics for Delta Lake tables
We update the statistics only for the primary keys.
Column statistics for Iceberg tables
Depending on the number of columns in the table, we update the statistics as follows:
- If the table contains 200 or less columns, we update the statistics for all the columns.
- If the table contains more than 200 columns, we update the statistics only for the primary keys.
Limitations
Fivetran does not support position deletes for Iceberg tables. To avoid errors, we recommend that you avoid running any query that generates position deletes.
Fivetran creates DECIMAL columns with maximum precision and scale (38, 10).
Spark SQL pool queries cannot read the maximum values of DOUBLE and FLOAT data types.
Spark SQL pool queries truncate the TIMESTAMP values to seconds. To query any table using a TIMESTAMP column, you can use the
unixtime(unix_timestamp(<col_name>, 'yyyy-MM-dd HH:mm:ss.SSS'),'yyyy-MM-dd HH:mm:ss.ms')
clause in your queries to get the accurate values, including milliseconds and microseconds.Fivetran does not support the Change Data Feed feature for Delta Lake tables. You must not enable Change Data Feed for the Delta Lake tables that Fivetran creates in your ADLS destination.
Unity Catalog limitations
To query any table using an XML column, you can use a regex.
To filter timestamp values, you can use the
string(<col_name>)
clause to get the accurate values.