Azure Cosmos DB
Azure Cosmos DB is Microsoft's fully-managed NoSQL database. It is serverless and designed for high-performance applications.
Supported services
Fivetran supports the following Azure Cosmos DB database services:
Supported configurations
Fivetran supports the following Azure Cosmos DB configurations:
Supportability Category | Supported Values |
---|---|
Connector limit per database | Depends on how many RUs you have provisioned. Each connector can consume up to 2,000 RUs. |
Transport Layer Security (TLS) | TLS 1.1 - 1.3 |
Limitations
- We support Azure Cosmos DB with the native MongoDB API and NoSQL API. We do not support the Cassandra API, Gremlin API, and Table API instances. Contact Fivetran Support if you would like us to integrate with these other API instances.
- We utilize Azure Cosmos DB's change feed to sync your data. The change feed includes INSERT and UPDATE operations made to items within the container, but delete operations are not captured by the change feed. To capture deleted source data, we suggest that you add a soft-delete flag within your documents. Alternatively, if you are interested in tracking deleted data through Fivetran Teleport Sync, refer to our official documentation.
Features
Azure Cosmos DB for MongoDB
Feature Name | Supported | Notes |
---|---|---|
Capture deletes | ||
History mode | check | Selectable for all tables |
Custom data | check | All collections and fields |
Data blocking | check | Data blocking for databases and containers is supported for all Cosmos DB connectors. Partial data blocking is supported for connectors created after July 17, 2023 only |
Column hashing | check | Connectors created after July 17, 2023 |
Re-sync | check | Collection level |
API configurable | check | API configuration |
Priority-first sync | ||
Fivetran data models | ||
Private networking | check | Azure Private Link |
Azure Cosmos DB for NoSQL
Feature Name | Supported | Notes |
---|---|---|
Capture deletes | check | Enabled upon request |
History mode | check | Selectable for all tables |
Custom data | check | All containers and fields |
Data blocking | check | Data blocking for databases and containers is supported for all Azure Cosmos DB connectors. Partial data blocking is supported for connectors created after July 17, 2023 only |
Column hashing | check | Connectors created after July 17, 2023 |
Re-sync | check | Container level |
API configurable | check | API configuration |
Priority-first sync | ||
Fivetran data models | ||
Private networking | check | Azure Private Link |
Setup guide
This overview will give you a general idea of the kind of work needed to set up a Azure Cosmos DB connector. For specific instructions on how to set up your database, see the guide for your Azure Cosmos DB database type:
Sync overview
Once Fivetran is connected to your Azure Cosmos DB resource, we pull a full dump of all selected data from your database. The initial sync finishes when all containers that existed when the sync started have finished importing. Once the initial sync is complete, we use each container's change feed to pull all your new and changed data at regular intervals.
Data access methods
Azure Cosmos DB for MongoDB
Fivetran uses a username and password to access your Azure Cosmos DB for MongoDB source database.
Azure Cosmos DB for NoSQL
We use one of the following methods to access your Azure Cosmos DB for NoSQL source data:
TIP: To learn more about data access control in Azure Cosmos DB, see Microsoft's Secure access to data in Azure Cosmos DB documentation.
Account key(recommended)
Fivetran uses an account key to authenticate to the source database. Primary/secondary keys provide access to all administrative resources for the database account.
Choose this method if you want Fivetran to automatically detect all readable databases and containers.
Resource token
Fivetran uses a resource token to access the source database. Resource tokens provide access to specific Azure Cosmos DB resources within a database. For this method, you must provide the source database name, the container name, and the matching resource token.
Choose this method if you want Fivetran to only access specific Azure Cosmos DB resources within your database.
Pack mode options
Pack modes determine the form in which Fivetran delivers your data. To sync your data in Fivetran, you must select a pack mode. There are two pack modes, packed and unpacked.
NOTE: In the tables below, the text in parentheses next to the column name indicates the data type of that column. For example, "
foo
(INTEGER)" means the column name isfoo
and it stores INTEGER data.
Unpacked mode
Fivetran unpacks one layer of nested fields and infers data types. For example, the following source data:
{
"_id": 1, <== key
"foo": 2,
"nested": {
"baz": 3
}
}
For Azure Cosmos DB for MongoDB, is delivered to your destination as follows:
_id (INTEGER) | foo (INTEGER) | nested (JSON) |
---|---|---|
1 | 2 | {"baz":3} |
For Azure Cosmos DB for NoSQL, is delivered to your destination as follows:
_fivetran_id (STRING) | _id (INTEGER) | foo (INTEGER) | nested (JSON) |
---|---|---|---|
356a192b7913b04c54574d18c28d46e6395428ab | 1 | 2 | {"baz":3} |
Packed mode (default)
In packed mode, the following source data:
{
"_id": 1, <== key
"foo": 2,
"nested": {
"baz": 3
}
}
For Azure Cosmos DB for MongoDB, is delivered to your destination as follows:
_id (INTEGER) | data (JSON) |
---|---|
1 | {"_id":1, "foo":2, nested":{"baz":3}} |
For Azure Cosmos DB for NoSQL, is delivered to your destination as follows:
_fivetran_id (STRING) | data (JSON) |
---|---|
356a192b7913b04c54574d18c28d46e6395428ab | {"_id":1, "foo":2, nested":{"baz":3}} |
Switching pack modes
You can switch pack modes for a table at any time in your Fivetran dashboard.
IMPORTANT: We automatically perform a full connector re-sync when you change pack modes.
To change the pack mode for a connector, do the following:
- In the connector dashboard, go to the Setup tab.
- Click Edit connection details.
- In the connector setup form, change the Pack Mode.
- Click Save & Test.
History mode Private Preview
History mode is a sync mode that tracks the history of the changes in your source data. We leverage Azure Cosmos DB's all versions and deletes change feed mode to capture all intermediate changes. This change feed mode must be enabled in order for your connector to use history mode. The all versions and deletes change feed mode is currently in the preview phase, and it is only compatible with Azure Cosmos DB for NoSQL accounts.
TIP: To sign up for the all versions and deletes change mode preview, follow Microsoft's Get Started with Change feed modes in Azure Cosmos DB instructions.
Once you've enabled this feature along with continuous backups on your Azure Cosmos DB account, reach out to our Support Team to enable history mode for your existing connectors.
Replication speeds
Two major factors can cause disparities between our estimated and the exact replication speed for your Fivetran-connected databases: network latency and the amount of request units (RUs) you have provisioned for your Azure Cosmos DB resource. Make sure your monitored container is not experiencing throttling; otherwise, you will experience delays when syncing the change feed.
Azure Cosmos DB for MongoDB
We extract data sequentially from Azure Cosmos DB for MongoDB.
Azure Cosmos DB for NoSQL
We recommend that you provision at least 10,000 RUs for each container, though the actual number may vary depending on your Cosmos usage. We scale up the number of parallel processing threads for data extraction proportionally to the number of RUs available. Each thread can achieve up to 2.5MB/s in data extraction speed, so more parallel threads allow for faster syncs.
Container Throughput (RU/s) | Parallel Threads | Max extraction rate (MB/s) |
---|---|---|
400 - 9,999 | 1 | 2.5 |
10,000 - 19,999 | 2 | 5.0 |
20,000 - 29,999 | 3 | 7.5 |
30,000 - 39,999 | 4 | 10 |
40,000 - 49,999 | 5 | 12.5 |
50,000 - 59,999 | 6 | 15 |
60,000 - 69,999 | 7 | 17.5 |
70,000 - 79,999 | 8 | 20 |
80,000 - 89,999 | 9 | 22.5 |
90,000 and higher | 10 | 25.0 |
The ability to sync changes quickly also depends on the sync frequency you configure. The risk of the sync falling behind, or being unable to keep up with data changes, decreases as the sync frequency increases. We recommend a higher sync frequency for data sources with a high rate of data changes.
Schema information
Fivetran tries to replicate the exact databases and containers from your Azure Cosmos DB resource to your destination according to our standard database update strategies. For every schema in the Azure Cosmos DB container that you connect, we create a schema in your destination that maps directly to its native schema. This ensures that the data in your destination is in a familiar format to work with.
Fivetran-generated columns
Fivetran adds the following columns to every table in your destination:
_fivetran_id
(STRING) a one-way hashed value that uniquely identifies each row. This is generated from theid
and optional partition key value of each Azure Cosmos DB item._fivetran_deleted
(BOOLEAN) marks rows that were deleted in the source collection._fivetran_synced
(UTC TIMESTAMP) indicates the time when Fivetran last successfully synced the row.
We add these columns to give you insight into the state of your data and the progress of your data syncs. For more information about these columns, see our System Columns and Tables documentation.
NOTE:
_fivetran_id
is not applicable to Azure Cosmos DB for MongoDB
Type transformations and mapping
As we extract your data, we match Azure Cosmos DB document-based data types to types that Fivetran supports. Fivetran supports all Azure Cosmos DB CORE data types.
The following table illustrates how we transform your Azure Cosmos DB data types into Fivetran-supported types:
Azure Cosmos DB Type | Fivetran Type | Fivetran Supported |
---|---|---|
BOOLEAN | BOOLEAN | True |
TEXT | STRING | True |
INTEGER | INT | True |
LONG | LONG | True |
SHORT | SHORT | True |
DOUBLE | DOUBLE | True |
FLOAT | FLOAT | True |
OBJECT | JSON | True |
ARRAY | JSON | True |
NOTE: We do not support OBJECT as a primary key(_id) type for Azure Cosmos DB for MongoDB API.
If we are missing an important data type that you need, reach out to support.
In some cases, when loading data into your destination, we may need to convert Fivetran data types into data types that are supported by the destination. For more information, see the individual data destination pages.
Nested data in unpacked mode
If the first-level field is a simple data type, we map it to its own type. If it's a complex data type such as an array or JSON data, we map it to a JSON type without unpacking. We do not automatically unpack nested JSON objects to separate tables in the destination. Any nested JSON objects are preserved as is in the destination so that you can use JSON processing functions.
For example, the following JSON...
{"street" : "Main St."
"city" : "New York"
"country" : "US"
"phone" : "(555) 123-5555"
"zip code" : 12345
"people" : ["John", "Jane", "Adam"]
"car" : {"make" : "Honda",
"year" : 2014,
"type" : "AWD"}
}
...is converted to the following table when we load it into your destination:
_id | street | city | country | phone | zip code | people | car |
---|---|---|---|---|---|---|---|
1 | Main St. | New York | US | (555) 123-5555 | 12345 | ["John", "Jane", "Adam"] | {"make" : "Honda", "year" : 2014, "type" : "AWD"} |
Excluding source data
If you don’t want to sync all the data from your source database, you can exclude databases, containers, or partial data from your syncs on your Fivetran dashboard. To do so, go to your connector details page and uncheck the objects you would like to omit in subsequent syncs. For more information, see our Data Blocking and Column Hashing documentation.
Initial sync
When Fivetran connects to a new Azure Cosmos DB resource, we first copy all data from every container in every database (except for those you have excluded in your Fivetran dashboard) and add Fivetran-generated columns. We copy data by performing a read on the container change feed from its beginning.
NOTE: We mark the progress at a regular interval throughout the initial sync. In case of sync stoppage or failure, we will pick up from the last successful point of data replication and continue importing in the next sync.
Updating data
Fivetran performs incremental updates of any new or modified data from your source database. We use Azure Cosmos DB's change feed to detect changes to the selected containers.
Fivetran uses Azure Cosmos DB's built-in id
field, along with the partition key value that may be present in each container, to uniquely identify rows. This unique identifier is stored in the destination as a new column, _fivetran_id
. Once we identify updated records, we merge the changes to your documents into the corresponding tables in your destination using the identifier:
- Every inserted row in the source generates a new row in the destination with
_fivetran_deleted = FALSE
. - Every updated row in the source updates the data in the corresponding row in the destination, with
_fivetran_deleted = FALSE
.
NOTE:
_fivetran_id
is not applicable to Azure Cosmos DB for MongoDB. The_id
field uniquely identifies rows.
Deleted data
Azure Cosmos DB for MongoDB
We cannot track deleted data in Azure Cosmos DB for MongoDB.
Azure Cosmos DB for NoSQL
Azure Cosmos DB's change feed does not log deletes. To keep track of deleted data, we use Fivetran Teleport Sync to identify deleted records and apply the changes to the destination tables.
Fivetran Teleport Sync
Fivetran Teleport Sync is a proprietary incremental sync method that can incrementally replicate your database with no additional setup other than a read-only connection.
Fivetran Teleport Sync performs following operations:
- Do a full table scan of each synced table for the
id
and partition key - Aggregate a compressed table (container) snapshot in the application memory
- Compare the aggregated snapshot to the previous snapshot to deduce the differences
- If there are differences in the snapshots, delete missing source items in the corresponding destination tables
NOTE: We only perform snapshot comparisons during incremental syncs, as initial syncs cannot have deleted data.