Data integration can easily become non-performant. Learn about some common bottlenecks.
In the context of data integration, network performance is an important determinant of how quickly your data can synchronize from source to destination.
For traffic from most sources, particularly commodity SaaS applications, network performance generally matters very little because the overall volumes of data are low, and data is quick to ingest regardless of network performance. The pace of data ingestion is limited by the behavior of the application, and/or a vendor intentionally throttling API bandwidth.
Imagine that the width of each pipe below corresponds with the available bandwidth:
However, network performance is a far more serious matter for high-volume data sources, specifically operational databases. The huge volume of records stored in an operational database heavily impacts the speed and turnaround time of both historical and incremental updates.
Historical syncs, which ingest the full contents of a data source and are necessary both to initialize a data connector and to recover from serious errors, are especially impacted by network performance. Syncs that take many days or weeks can interfere with a company’s operations and deter a company from modernizing its data integration infrastructure altogether. At sufficiently high volumes, even incremental syncs can bog down.
When the source is a database, there are several common ways syncs can be slowed down:
Database instances can be too small and lack the resources (i.e. processing power, RAM, disk space, local network bandwidth) to quickly transfer large amounts of data.
High transactional traffic can also leave databases too busy to service replication.
Databases can be gated behind other servers. Security protocols using SSH tunneling, bastion servers, port-forwarders, and network address translation (NAT) intentionally route traffic through often narrow checkpoints in order to prevent sensitive data from being intercepted.
Inter-region transfer (e.g. from AWS East to AWS West) is slower than intra-region transfer, meaning that you can face slowdowns by moving data from one region to another.
Inter-cloud transfer (e.g. from AWS to GCP), or moving data from one network to another, can cause slowdowns.
Note that the example above is just one hypothetical. The relative severity of each chokepoint can vary, and it will take careful troubleshooting to identify which one impacts your data integration workflow the most.
The measurement of bandwidth and throughput can be complicated by the presence of buffers.
Buffers are staging areas that temporarily contain data, either in memory or on disk, whenever data is set to be handed off between machines. The more bottlenecked a subsequent stage of a handoff in comparison to the previous stage, the more the buffer will fill.
The presence of buffers can radically alter the behavior of a network. Without buffers, the movement of data along every node of the network takes place at a constant rate and is limited by the slowest bottleneck. With buffers, the flow is less brittle as nodes can act as staging areas to store batches of data.
Observing how full buffers get (and how quickly they fill) can offer clues to where your network has bottlenecks.
A consideration that further complicates the measurement of bandwidth is compression (as well as deduplication). A 100 GB file may very well reduce to 40 GB. Does this mean that 100 GB is moved through the network, or 40 GB? Moreover, compression and decompression add additional complexity to a system, and the question of how quickly a file can be compressed and decompressed can determine whether compression is even appropriate.
Every point of transition from one platform to another has the potential to throttle traffic. There are several ways to address this problem. The first, and most obvious, is simply to scale up infrastructure, i.e. widen every part of the path. The second is to re-architect the data integration process so that there are fewer points of transition. Yet another way is to optimize algorithms and parallelize execution to speed up processing. Finally, it is important to properly leverage compression and decompression as data moves between platforms.