Databricks Beta
The Databricks Lakehouse Platform combines the key features of data lakes and data warehouses. The platform is built on open source and open standards.
The Fivetran Databricks connector allows you to sync tables from your Databricks catalog to your destination. We sync all the schemas (databases) and tables from within your Databricks catalog.
NOTE: Fivetran supports Databricks as both a database connector and a destination.
TIP: If you want to sync multiple Databricks catalogs, we recommend that you create a connector for each catalog.
Features
Feature Name | Supported | Notes |
---|---|---|
Capture deletes | check | |
History mode | check | |
Custom data | check | |
Data blocking | check | |
Column hashing | check | |
Re-sync | check | |
API configurable | check | API configuration |
Priority-first sync | ||
Fivetran data models | ||
Private networking | ||
Authorization via API |
Setup guide
Follow our step-by-step Databricks setup guide to connect your Databricks with Fivetran.
Sync overview
Once Fivetran is connected to your Databricks account, we pull a full dump of all selected data from your catalog. We then use change data feed from Databricks to pull all your new and changed data at regular intervals. If data in your source catalog changes (for example, you add new table), Fivetran automatically detects and persists these changes into your destination.
Limitations
The Databricks connector only supports tables that are stored in delta format. Learn more in Databricks' Delta Tables documentation.
Schema information
Fivetran tries to replicate the exact schema and tables from your Databricks catalog to your destination.
We append the source schema (database) with the schema name you provided in the connector setup form. We name the destination schema with the connector name appended by the schema (database) name. For example, if the connector name is databricks
and the schema (database) name is schema
and the table name is table
, the destination table name is databricks_schema.table
.
Type transformations and mapping
As we extract your data, we match Databricks data types to types that Fivetran supports. If we don't support a data type, we automatically change that type to the closest supported type or, in some cases, don't load that data at all. Our system automatically skips columns with data types that we don't accept or transform.
The following table illustrates how we transform your Databricks data types into Fivetran-supported types:
Databricks Type | Fivetran Type | Supported |
---|---|---|
ARRAY | STRING | Yes |
BIGINT | LONG | Yes |
BINARY | BINARY | Yes |
BOOLEAN | BOOLEAN | Yes |
DATE | LOCALDATE | Yes |
DOUBLE | DOUBLE | Yes |
DECIMAL | BIGDECIMAL | Yes |
FLOAT | BIGDECIMAL | Yes |
INT | INTEGER | Yes |
INTERVAL | N/A | No |
MAP | JSON | Yes |
SMALLINT | SHORT | Yes |
STRING | STRING | Yes |
STRUCT | JSON | Yes |
VARCHAR | STRING | Yes |
VOID | N/A | No |
TIMESTAMP | INSTANT | Yes |
TIMESTAMP_NTZ | INSTANT | Yes |
TINYINT | SHORT | Yes |
Fivetran does not support Void and Interval data types as databricks delta lakes does not support these data types.
Updating data
Fivetran performs incremental updates of any new or modified data from your source catalog. We use the change data feed of a table and use it to update the data in your warehouse
Table updates
Databricks does not enforce the primary keys on a table, so we designate the Fivetran-generated _fivetran_id
column as the primary key.
We merge the following changes into the corresponding tables in your destination:
- An INSERT in the source table generates a new row in the destination with
_fivetran_deleted
=FALSE
. - A DELETE in the source table updates the corresponding row in the destination with
_fivetran_deleted
=TRUE
. - An UPDATE in the source table updates the existing row in the destination with
_fivetran_deleted
=TRUE
and generates a new row in the destination with_fivetran_deleted
=FALSE
.