How to Optimize Integration Performance
Question
How can I optimize the performance of my HVR 6 integration?
Environment
HVR 6
Answer
Most of our users consolidate data from multiple sources into one destination. Most integrations target an analytical rather than a transaction-processing destination.
You should optimize the performance of your integration for the following reasons:
- Replication must keep up with volumes coming from all sources. Some users have very high data volumes coming from multiple sources.
- Faster integration performance allows for earlier data availability in the target. Some use cases require data in near real-time.
- Destination costs on some platforms results in higher cost for lower latency.
How integration works
HVR 6 supports two integration modes, burst mode and continuous mode:
- Continuous mode replays changes from the source as single row operations on the target in change order. Continuous mode is useful for targets that must have integrity constraints that match integrity constraints on the source and to achieve minimal latency. We recommend that you only use continuous mode on a technology built for transaction processing that is well-indexed such as Oracle, SQL Server, PostgreSQL, etc.
- Burst mode stages data similarly to Fivetran, then applies the changes as micro batches to the target table. Burst mode is better at keeping up with data volumes from the source than continuous mode. We recommend that you use burst mode for analytical targets such as Snowflake, BigQuery, Redshift, etc.
NOTE: HVR's burst mode is more efficient per row as the volume of data in single run increases. As a result, this method generally keeps up with the change volume, based on the sync frequency or the resulting latency.
Performance bottlenecks
When you're optimizing your integration's performance, it's important to look out for performance bottlenecks. With the HVR 6 solution running in your environment, there are some performance aspects you can control based on the integration mode you choose.
- When running in burst mode:
- Does data reach the staging area in a timely manner? The
hvr.out
log file contains statistics that can help you answer this question. Could the network be the bottleneck and would an agent (an installation of HVR) closer to the target be beneficial? - Burst integration is temporarily CPU and memory intensive. Ideally, you'd reach 100% CPU utilization. If you have multiple sources, is your target system adequately sized? Do you have enough memory to allow for memory allocation without swapping to disk?
NOTE: HVR 6 prints a message to the log if the coalesce step cannot entirely be performed in memory. When this happens, HVR spills to disk for sorting. While running in memory would be faster, spilling to disk is not necessarily slow, provided you have fast storage for the HVR_TMP location. However, if memory is over-allocated and the OS starts swapping, then performance slows.
- Fast storage (high IOPS, but also good throughput) is also important, depending on the target platform and the method of staging.
- Does data reach the staging area in a timely manner? The
- When running in continuous mode:
- Do you have a low latency connection to the target? With single row operations, there is a lot more chatter between HVR and the target system compared to the burst method, so a local agent (installation of HVR) as close to the target database as possible is critical.
- Is your target well-indexed? If HVR created the target tables, then it creates unique indexes on key columns. If a table does not have a key column, then no unique index is created. However, that doesn't mean indexes can't be useful to you, and you may have to analyze tables without a key for optimal performance.
- Is resilient processing taking place? When source and target are not quite in sync, then replication can continue with resilient processing enabled. In continuous mode, resilient processing can be less efficient than non-resilient processing. On indexed tables, the difference is hardly noticeable. However, on tables without a key, resilient processing can be dramatically slower.
Target system diagnostics
Regardless of which method of integration you use, consider using target system diagnostics.