Best Practices for Using an Integration Agent
Question
What are the best practices for using the Integrate HVR Agent?
Environment
HVR 6
Answer
Why should I use an integration agent?
The most important reasons to use integrate agents are as follows:
- Performance: An agent close to the target has low-latency access to the data store. Communication between the hub and the agent is always compressed, with compression ratios up to 10x. Network communication is further optimized to limit sensitivity to high latency and achieve maximum bandwidth.
- Scalability: The agent performs a subset of the work. If more work is needed, then an additional stateless agent can be added to distribute the load. When you consolidate multiple data flows (pipelines or channels) into the same target, you can use an agent farm consisting of multiple agents and a load balancer that automatically scales the number of required agents.
- Security: HVR's communication between the hub and agents uses TLS 1.3.
What processing does the integrate agent perform?
The amount of work to integrate changes depends on the destination technology and the pipeline (channel) configuration. There are two main approaches, continuous mode and burst (micro-batch) mode.
Continuous mode
Continuous mode applies to operational use cases with either a transaction processing (OLTP) database as a target or Kafka. For example, continuous mode would apply to replication between Oracle and Kafka, PostgreSQL and MySQL, or a homogeneous use case where the target technology is identical to the source. In continuous mode, HVR applies the changes to the target in commit order, row-by-row. The bulk of the processing is performed by the target technology. However, the row-by-row nature of the integration requires low latency to achieve fast performance.
The continuous integration mode benefits from an integration agent that's close to the target database, if not on the database server, for low latency.
Burst (micro-batch) mode
Burst (micro-batch) mode applies to all analytical database technologies like Snowflake, Google BigQuery and Databricks. It also applies to use cases with files as a destination (for example, S3, ADLS, GCS, etc.). HVR uses micro-batches because, without them, the destination technology would not be able to keep up with the rate of changes coming from one or more sources. The burst mode computes a net operation (insert, update or delete) for each unique row, then prepares a data set that is processed as a micro batch. The process to compute the net operation is called coalescing, which is both CPU and memory-intensive. If memory thresholds are exceeded, coalescing spills to disk writing temporary files. Also, formatting files - staging files or regular destination files - is CPU-intensive, as are operations like client-side encryption (configuration-dependent).
The burst integration mode benefits from the scalability aspect of an integration agent. With burst mode, using an agent farm is mostly popular on the target.