Data pipeline performance hinges on several important techniques, including algorithmic optimization and parallelization.
In this post, we will take a closer look at data pipelines, and the various practices that can facilitate or impair good performance. Data pipelines, specifically the workflows associated with extracting, loading and transforming data, feature a number of possible performance bottlenecks.
There are three basic ways to overcome bottlenecks in a data pipeline:
The first way consists of algorithmic improvements and code optimization, as they don’t involve major changes to the software’s architecture.
The next way involves architectural changes to parallelize simple, independent processes.
The final way is pipelining. This involves architectural changes to separate the data integration process into distinct, sequential stages that can each simultaneously handle a workload.
The full workflow of extracting, loading and transforming data can be a complex, path-dependent process. Architectural changes to software affect both how the code is organized and the hardware requirements of the system. This means parallelization and pipelining are potentially costly ways to improve performance, and should be pursued only when algorithmic optimization has been exhausted.
Algorithmic optimization is all about using the best method for every computation. There are better and worse ways to perform every computation. It is also one of the few ways to directly squeeze monetary savings out of improved performance, as parallelization and pipelining require additional infrastructure.
Suppose you wanted to test whether a figure is divisible by 2. The hard, slow way is to perform the actual computation, i.e., to divide the figure by 2. A more optimal approach is to see whether the number is even or not by reading the trailing bit of the binary for that number, which should be 0.
In a data pipeline, suppose some data arrives as a JSON and you want to determine if a key has already been converted into a field. Rather than looping through a list of all of the keys, you can create an index out of your fields, and then match the key against the index using a number of search or lookup methods. Most likely, hash maps will produce the best results.
Fundamentally, a data pipeline extracts data from a source and loads it into a destination, usually a data warehouse. Fetching and replicating data from one platform to another is a relatively simple process.
Suppose you have 1,000 records and spawn eight processes. The original data can be split into eight queues of 125 records each that are executed in parallel rather than 1,000 in sequence.
A concrete example is making API queries in parallel. Network request-response times impose a more-or-less fixed interval on each query. To prevent request-response times from adding up sequentially, it’s better to send a pool of requests across the network at once.
However, the full data integration workstream still involves a number of intermediate steps, including deduplication and some forms of aggregation. Operations like deduplication affect entire records. They must be performed on monolithic blocks of data where the parts aren’t separated in order to produce the correct outputs.
This means a different approach is necessary for optimizing performance. Rather than splitting the source data into pieces that work is done on and whose outputs can be reassembled later, the key is to carefully separate the process into steps that can all be simultaneously populated, with buffers between them to ensure that files are fully written before they are handed off to the next stage.
Consider a queue with 8 separate batches that must each be processed en bloc through steps A, B and C:
As you can see, the rate at which batches are completed depends on the slowest element of the sequence. With pipelining, step B takes the longest, so new batches will roll out every 5 minutes. Without pipelining, all steps are combined, so new batches can complete every 10 minutes.
You can liken this to an assembly line where there are staging areas between each process to ensure that every step remains active, so that there is always work being performed.
Pipeline performance can be optimized through algorithmic optimization, parallelization and pipelining. Algorithmic optimization is the first priority, as it is purely a matter of rewriting code to be more efficient. Parallelization and pipelining are more costly because they may involve larger-scale changes to the software, as well as to the pipeline’s infrastructure configurations.
We have also explored a specific subtopic of pipeline performance, network performance, and the various bottlenecks that can appear in a system that runs on distributed infrastructure.