Process Isolation in Data Pipelines
Process Isolation in Data Pipelines

Process Isolation in Data Pipelines

Process isolation is an important pillar of software engineering that can keep your data pipelines (and you) out of trouble.

By Charles Wang, March 18, 2021

Process isolation is a characteristic of modern operating systems that enables multiple processes to run in parallel. It strictly partitions the memory space and computational resources used in each process. By putting different instances of an application into separate processes, you can leverage the basic workings of the operating system to guarantee that one process cannot write to, or read from, another, thus preventing interference between either different applications or separate instances of the same application.

In the context of data pipelines and data integration, process isolation means that every data connector for every customer is separated at a low level, ensuring utmost security and reliability.

Security and Reliability

Well-written software should prevent data from separate instances from becoming mixed up and cross-contaminated, but hacks and programming bugs can, in theory, cause such problems. Process isolation entirely forestalls this possibility.

Process isolation guarantees security in two ways. First, different customers use separate processes, preventing one customer from accidentally receiving another’s data even if both happen to use the same machine.

Second, process isolation offers security from the standpoint of data governance. For each customer, process isolation for each connector also prevents data mixups between connectors. When customers set fine-grained access permissions to each connector, process isolation guarantees that the data flowing through each connector is only accessible to the appropriate parties. 

Assigning exclusive memory segments to each connector also promotes reliability. Crashes that result from bugs or memory shortages will only halt a single instance of a connector, without affecting others. This lowers the overall failure rate. It also makes it easier to identify and fix the exact source of a failure, and means that only the failed connector needs to be rebooted.

Challenges Posed by Process Isolation

By default, operating systems do not limit the number of processes spawned or their memory usage. However, as usage grows, the machine can run out of memory, and some or all of the processes will die. Since processes are not restricted from reserving unallocated memory, one process can easily hoard enough memory to starve and shut down other processes.

Without process isolation, it’s impossible to impose granular control over memory consumption at the connector level in the data pipeline. By contrast, assigning each connector to a separate process allows us to ration memory usage for each process. This, combined with limits on the number of processes spawned, ensures that the total system memory is never exceeded.

Determining the correct allocation of memory is difficult because the size of data syncs can be highly unpredictable. If a data pipeline allocates memory based on typical memory usage, a significant number of syncs will fail because memory is under-allocated, while memory is often over-allocated for those that do succeed. 

Specifically, under-allocating memory will cause out-of-memory failures, ensuring the total stoppage of a particular process, while over-allocating memory earmarks resources that are never used:

In order to ensure that syncs virtually always succeed, the only practical solution is to generously over-allocate memory. Ultimately, this means fewer processes per machine, purchasing additional machines, and accepting that there will be wasted and underutilized capacity.

Generously over-allocating memory still raises the question of precisely how much to over-allocate. With experience, it is possible to predict a range of practical values for different data sources, but there is no accounting for extreme edge cases with abnormally large values. Fields in databases can, without warning, contain gigabyte-scale values in JSON or binary fields. Large values can not only cause out-of-memory failures but also be refused by destinations that enforce limits on the size of fields.

There is no question that process isolation is essential to security and reliability. The chief remaining challenge is wastefulness, and its solution fundamentally depends on forecasting the right allocation of memory for every process.

Start analyzing your data in minutes, not months

Launch any Fivetran connector instantly.

Adblock Detection