Cosmos DB link
Updated November 16, 2023
Azure Cosmos DB is Microsoft's fully-managed NoSQL database. It is serverless and designed for high-performance applications.
Supported serviceslink
Fivetran supports the following Azure Cosmos DB database services:
Supported configurationslink
Fivetran supports the following Cosmos DB configurations:
Supportability Category | Supported Values |
---|---|
Connector limit per database | 3 |
Transport Layer Security (TLS) | TLS 1.1 - 1.3 |
Limitationslink
- We support Cosmos with the native NoSQL API and MongoDB API. We do not support the Cassandra API, Gremlin API, and Table API instances. Please contact Fivetran Support if you would like us to integrate with these other API instances.
- We utilize the Cosmos change feed to sync your data. The change feed includes INSERT and UPDATE operations made to items within the container, but delete operations are not captured by the change feed. To capture deletes, you may use the soft-delete flag within your documents. Alternatively, if you are interested in tracking deleted data through Fivetran Teleport Sync, refer to our official documentation.
Featureslink
Feature Name | Supported | Notes |
---|---|---|
Capture deletes | check | Cosmos DB: Enabled upon request Azure Cosmos DB for MongoDB: Not supported |
Custom data | ||
Data blocking | check | Data blocking for databases and containers is supported for all Cosmos DB connectors. Partial data blocking is supported for connectors created after July 17, 2023 only |
Column hashing | check | Connectors created after July 17, 2023 |
Re-sync | check | Container level |
History | check | Supports history mode. |
API configurable | check | |
Priority-first sync | ||
Fivetran data models | ||
Private networking | check | Azure Private Link |
Setup guidelink
This overview will give you a general idea of the kind of work needed to set up a Cosmos DB connector. For specific instructions on how to set up your database, see the guide for your Cosmos DB database type:
Sync overviewlink
Once Fivetran is connected to your Cosmos DB resource, we pull a full dump of all selected data from your database. The initial sync finishes when all containers that existed when the sync started have finished importing. Once the initial sync is complete, we use each container's change feed to pull all your new and changed data at regular intervals.
Data access methodslink
We use one of the following methods to access your Cosmos DB source data:
TIP: To learn more about data access control in Cosmos DB, see Microsoft's Secure access to data in Azure Cosmos DB documentation.
Account key(recommended)link
Fivetran uses an account key to authenticate to the source database. Primary/secondary keys provide access to all administrative resources for the database account.
Choose this method if you want Fivetran to automatically detect all readable databases and containers.
Resource tokenlink
Fivetran uses a resource token to access the source database. Resource tokens provide access to specific Cosmos DB resources within a database. For this method, you must provide the source database name, the container name, and the matching resource token.
Choose this method if you want Fivetran to only access specific Cosmos DB resources within your database.
Pack mode optionslink
Pack modes determine the form in which Fivetran delivers your data. To sync your data in Fivetran, you must select a pack mode. There are two pack modes, packed and unpacked.
Unpacked modelink
Fivetran unpacks one layer of nested fields and infers data types. For example, the following source data:
{
"_id": 1, <== key
"foo": 2,
"nested": {
"baz": 3
}
}
content_copy
For CosmosDB, is delivered to your destination as follows:
_fivetran_id STRING | _id INTEGER | foo INTEGER | nested JSON |
---|---|---|---|
356a192b7913b04c54574d18c28d46e6395428ab | 1 | 2 | {"baz":3} |
For Azure Cosmos DB for MongoDB, is delivered to your destination as follows:
_id INTEGER | foo INTEGER | nested JSON |
---|---|---|
1 | 2 | {"baz":3} |
Packed mode (default)link
Example:
In packed mode, the following source data:
{
"_id": 1, <== key
"foo": 2,
"nested": {
"baz": 3
}
}
content_copy
For CosmosDB, is delivered to your destination as follows:
_fivetran_id STRING | data JSON |
---|---|
356a192b7913b04c54574d18c28d46e6395428ab | {"_id":1, "foo":2, nested":{"baz":3}} |
For Azure Cosmos DB for MongoDB, is delivered to your destination as follows:
_id INTEGER | data JSON |
---|---|
1 | {"_id":1, "foo":2, nested":{"baz":3}} |
Switching pack modeslink
You can switch pack modes for a table at any time in your Fivetran dashboard.
IMPORTANT: We automatically perform a full connector re-sync when you change pack modes.
To change the pack mode for a connector, do the following:
- In the connector dashboard, go to the Setup tab.
- Click Edit connection details.
- In the connector setup form, change the Pack Mode.
- Click Save & Test.
History mode Private Previewlink
History mode is a sync mode that tracks the history of the changes in your source data. We leverage Cosmos DB's all versions and deletes change feed mode to capture all intermediate changes. This change feed mode must be enabled in order for your connector to use history mode. The all versions and deletes change feed mode is currently in the preview phase, and it is only compatible with Azure Cosmos DB for NoSQL accounts.
TIP: To sign up for the all versions and deletes change mode preview, follow Microsoft's Get Started with Change feed modes in Azure Cosmos DB instructions.
Once you've enabled this feature along with continuous backups on your Cosmos DB account, reach out to our Support Team to enable history mode for your existing connectors.
Replication speedslink
Two major factors can cause disparities between our estimated and the exact replication speed for your Fivetran-connected databases: network latency and the amount of request units (RUs) you have provisioned for your Cosmos DB resource. Make sure your monitored container is not experiencing throttling; otherwise, you will experience delays when syncing the change feed.
We recommend that you provision at least 10,000 RUs for each container, though the actual number may vary depending on your Cosmos usage. We scale up the number of parallel processing threads for data extraction proportionally to the number of RUs available. Each thread can achieve up to 2.5MB/s in data extraction speed, so more parallel threads allow for faster syncs.
Container Throughput (RU/s) | Parallel Threads | Max extraction rate (MB/s) |
---|---|---|
400 - 9,999 | 1 | 2.5 |
10,000 - 19,999 | 2 | 5.0 |
20,000 - 29,999 | 3 | 7.5 |
30,000 - 39,999 | 4 | 10 |
40,000 - 49,999 | 5 | 12.5 |
50,000 - 59,999 | 6 | 15 |
60,000 - 69,999 | 7 | 17.5 |
70,000 - 79,999 | 8 | 20 |
80,000 - 89,999 | 9 | 22.5 |
90,000 and higher | 10 | 25.0 |
The ability to sync changes quickly also depends on the sync frequency you configure. The risk of the sync falling behind, or being unable to keep up with data changes, decreases as the sync frequency increases. We recommend a higher sync frequency for data sources with a high rate of data changes.
Schema informationlink
Fivetran tries to replicate the exact databases and containers from your Cosmos DB resource to your destination according to our standard database update strategies. For every schema in the Cosmos DB container that you connect, we create a schema in your destination that maps directly to its native schema. This ensures that the data in your destination is in a familiar format to work with.
Fivetran-generated columnslink
Fivetran adds the following columns to every table in your destination:
_fivetran_id
(STRING) a one-way hashed value that uniquely identifies each row. This is generated from theid
and optional partition key value of each Cosmos DB item._fivetran_deleted
(BOOLEAN) marks rows that were deleted in the source collection._fivetran_synced
(UTC TIMESTAMP) indicates the time when Fivetran last successfully synced the row.
We add these columns to give you insight into the state of your data and the progress of your data syncs.
NOTE:
_fivetran_id
is not applicable to Azure Cosmos DB for MongoDB
Type transformations and mappinglink
As we extract your data, we match Cosmos DB document-based data types to types that Fivetran supports. Fivetran supports all Cosmos DB CORE data types.
The following table illustrates how we transform your Cosmos DB data types into Fivetran-supported types:
Cosmos DB Type | Fivetran Type | Fivetran Supported |
---|---|---|
BOOLEAN | BOOLEAN | True |
TEXT | STRING | True |
INTEGER | INT | True |
LONG | LONG | True |
SHORT | SHORT | True |
DOUBLE | DOUBLE | True |
FLOAT | FLOAT | True |
OBJECT | JSON | True |
ARRAY | JSON | True |
NOTE: We do not support OBJECT as a primary key(_id) type for Azure Cosmos DB for MongoDB API.
If we are missing an important data type that you need, please reach out to support.
In some cases, when loading data into your destination, we may need to convert Fivetran data types into data types that are supported by the destination. For more information, see the individual data destination pages.
Nested datalink
If the first-level field is a simple data type, we map it to its own type. If it's a complex data type such as an array or JSON data, we map it to a JSON type without unpacking. We do not automatically unpack nested JSON objects to separate tables in the destination. Any nested JSON objects are preserved as is in the destination so that you can use JSON processing functions.
For example, the following JSON...
{"street" : "Main St."
"city" : "New York"
"country" : "US"
"phone" : "(555) 123-5555"
"zip code" : 12345
"people" : ["John", "Jane", "Adam"]
"car" : {"make" : "Honda",
"year" : 2014,
"type" : "AWD"}
}
content_copy
...is converted to the following table when we load it into your destination:
_id | street | city | country | phone | zip code | people | car |
---|---|---|---|---|---|---|---|
1 | Main St. | New York | US | (555) 123-5555 | 12345 | ["John", "Jane", "Adam"] | {"make" : "Honda", "year" : 2014, "type" : "AWD"} |
Excluding source datalink
If you don’t want to sync all the data from your source database, you can exclude databases, containers, or partial data from your syncs on your Fivetran dashboard. To do so, go to your connector details page and uncheck the objects you would like to omit in subsequent syncs. For more information, see our Column Blocking and Hashing documentation.
Initial synclink
When Fivetran connects to a new Cosmos DB resource, we first copy all data from every container in every database (except for those you have excluded in your Fivetran dashboard) and add Fivetran-generated columns. We copy data by performing a read on the container change feed from its beginning.
NOTE: We mark the progress at a regular interval throughout the initial sync. In case of sync stoppage or failure, we will pick up from the last successful point of data replication and continue importing in the next sync.
Updating datalink
Fivetran performs incremental updates of any new or modified data from your source database. We use Cosmos DB's change feed to detect changes to the selected containers.
Fivetran uses Cosmos DB's built-in id
field, along with the partition key value that may be present in each container, to uniquely identify rows. This unique identifier is stored in the destination as a new column, _fivetran_id
. Once we identify updated records, we merge the changes to your documents into the corresponding tables in your destination using the identifier:
- Every inserted row in the source generates a new row in the destination with
_fivetran_deleted = FALSE
. - Every updated row in the source updates the data in the corresponding row in the destination, with
_fivetran_deleted = FALSE
.
Deleted datalink
Cosmos DB's change feed does not log deletes. To keep track of deleted data, we use Fivetran Teleport Sync to identify deleted records and apply the changes to the destination tables.
Fivetran Teleport Synclink
Fivetran Teleport Sync is a proprietary database replication method that offers the completeness of snapshots while approaching the speed of log-based systems. With this sync mechanism, Fivetran can incrementally replicate your database with no additional setup other than a read-only connection.
Fivetran Teleport Sync performs following operations:
- Do a full table scan of each synced table for the
id
and partition key - Aggregate a compressed table (container) snapshot in the application memory
- Compare the aggregated snapshot to the previous snapshot to deduce the differences
- If there are differences in the snapshots, delete missing source items in the corresponding destination tables
NOTE: We only perform snapshot comparisons during incremental syncs, as initial syncs cannot have deleted data.