Optimize Database log-based Change Data Capture Performance
Question
How can I optimize the performance of change data capture (CDC) on my database?
Environment
- Fivetran
- HVR 6
Answer
Most relational online transaction processing (OLTP) databases use a transaction log. The transaction log is the ledger for changes to the database and the foundation for database recovery. Database changes are only committed when they have successfully been written to the transaction log. If the system or software crashes, the transaction log must be able to recover the database to its most recent committed state. Unrecoverable crashes are becoming increasingly rare due to the maturity of relational databases.
Log-based replication reads the committed changes asynchronously from the transaction log. It parses the changes to tables you selected to replicate. There is always some capture latency because of the asynchronous nature of log-based replication.
With HVR 6, latency can be reduced significantly since CDC is running continuously (you'd likely see at least one to two seconds of capture latency due to efficiency optimizations). With Fivetran, latency depends on the sync frequency you choose. See your sync frequency options in our Sync frequency and scheduling documentation.
Why consider maximum CDC performance?
There are several reasons to consider CDC performance, whether you sync at an interval (using Fivetran) or you run continuously (using HVR 6):
- Faster performance allows you to capture changes faster (for example, your busy database could process changes faster than CDC can capture them and CDC never catches up).
- With faster performance, you limit the amount and duration of any overhead there may be on the system.
- You can achieve closer to real-time replication with faster performance, allowing for near real-time use cases that make your organization more successful.
Maximum performance is achieved when CPU utilization for a serial process can reach 100%. Log-based CDC must be serialized at some point to follow the commit order from the source database in order to perform transactionally consistent replication in a heterogeneous setup. This serialization requirement follows the ACID properties that OLTP databases adhere to.
What limits performance?
Log-based CDC is complex because many components are involved. Common performance bottlenecks include the following:
- Reading the transaction log from disk is too slow to allow CDC to achieve maximum performance.
- Retrieving changes is limited by available network bandwidth.
- Frequent round trips slow down processing due to latency.
If CDC reaches 100% CPU, you have fewer options to further improve performance:
- Limit the amount of work CDC must perform (for example, reduce the number of tables to be replicated).
- Parallelize the processing to the extent that you can.
- Shift CDC to an environment with faster CPUs.
- Use different software that is either more efficient and doesn't reach 100% CPU utilization for the same workload, or has (native) abilities to parallelize.
Native log-based CDC methods
Many databases provide a native method to perform log-based CDC. For example, Oracle provides LogMiner and xStreams Out (with a separate license cost), SQL Server supports CDC tables, PostgreSQL provides replication slots to read from the WAL (Write Ahead Log), etc.
When you use your database's native options, you depend on the database software to parse changes, which limits your control over performance, efficiency, and throughput. Your maximum throughput is bound by the throughput the database can provide. You also must accept the level of database processing overhead that is introduced by the native CDC solution. For many applications running well below the limits of maximum database processing performance/throughput, this is not a concern, but it may be for others.
Binary log reading
HVR 6 and HVA connectors use binary log readers, providing more options to maximize performance. A binary log reader implements logic to parse changes from the transaction log. The binary log reader runs outside the database, reading the transaction logs directly at the file system level.
Using a binary log reader requires trust. You allow read access directly at file system level, which allows the binary log reader to bypass complex database processes that ultimately limit native database-based processes. In most cases, the binary log reader outperforms database-based CDC capabilities.
Agents
HVR 6 promotes a distributed architecture with the use of agents. In the context of HVR, the capture agent is an installation of the software to perform CDC where the agent runs. One of the benefits of the distributed architecture is improved performance:
- Local access to the transaction logs (at the file system level) provides the lowest latency and maximum bandwidth.
- Agents perform CDC close to the source, filtering the amount of data that must be sent downstream.
- Agents compress change data, giving another 5-10x reduction of change data that must be sent downstream.
- Communication with agents is optimized for maximum throughput and minimal sensitivity to high latency. Remote access to change data is dependent on the level of efficiency of remote database connectivity.
The distributed architecture enables near real-time CDC for mission-critical, high volume databases across industries.
Fivetran now includes the concept of agents as well. Learn more in our High-Volume Agent Connectors documentation.