How to Optimize Integration Performance
Question
How can I optimize the performance of my Fivetran integration?
Environment
Fivetran
Answer
Most of our users consolidate data from multiple sources into one destination. Most integrations target an analytical rather than a transaction-processing destination.
You should optimize the performance of your integration for the following reasons:
- Replication must keep up with volumes coming from all sources. Some users have very high data volumes coming from multiple sources.
- Faster integration performance allows for earlier data availability in the target. Some use cases require data in near real-time.
- Destination costs on some platforms results in higher cost for lower latency.
How integration works
Fivetran stages the data in the destination before applying the changes to the destination tables. Unless the source data is insert-only, the destination is always updated, regardless of the synchronization mode, or deleted (in some cases). If supported by the destination, we use merge to apply the changes from the source in a single pass to the destination table.
NOTE: Fivetran's approach to integration is more efficient per row as the volume of data in a single run increases. As a result, this method generally keeps up with the change volume, based on the sync frequency or the resulting latency.
Performance bottlenecks
Fivetran proactively looks for bottlenecks for you as part of our managed service.
Target system diagnostics and execution plans
Consider using target system diagnostics. Fivetran submits statements on the target to be executed by the destination. Generally, these statements are straightforward for a database optimizer. However, if you look at the individual steps, you may find that merge statements can perform full table scans.
Depending on the data volumes and the frequency of changes, the full table scans may take a relatively long time. Your target's query execution plan visualizes this process (for example, Snowflake's Query Profile). Based on the technology you use, you may be able to optimize data delivery using strategies such as clustering or partitioning. If you access the replicated data directly, then you must weigh the data access patterns with an optimal load strategy.