Idempotence is an important characteristic of many systems, including Fivetran. Here's what it means.
Idempotence means that if you execute an operation multiple times, the result will not change after the initial execution. For example, if the operation is a mathematical function, it means that the output does not change when applied again to its own output:
f(f(x)) = f(x)
Most functions you might think of are not idempotent. If you repeatedly compute f(x) = x + 1, for instance, the result will increase by “1” with every iteration.
Given x = 1
f(x) = 1 + 1 = 2
f(f(x)) = 3
f(f(f(x))) = 4
By contrast, the function f(x) = abs(x) is idempotent.
Given x = -11
f(x) = 11
f(f(x)) = 11
f(f(f(x))) = 11
A real-world example of idempotence is a button for a floor in an elevator. An initial press of the button will cause the elevator to move its passengers toward the specified floor. Subsequent pushes will not change the elevator’s behavior.
Suppose we have a data warehouse with the following records in a table:
Let’s say the data source adds the following record and we want to sync it with our data warehouse
Our desired output looks like this:
Specifically, we want to add the record with id 3.
A good practice for syncing data is to update records incrementally, so that we don’t have to sync an entire data set every time there is an update. This means that we must perform the following actions to update our data warehouse:
Look up the previous saved progress, called a cursor, that tells us we have previously synced up to row 2
Based on this, we know we can start after row 2, so the pipeline only syncs the row with id 3
Record a new cursor pointing to id 3 so that the next sync starts after id 3.
Suppose the sync fails after saving the row with id 3 but before recording the state of the data warehouse into a new cursor. This means the pipeline will sync row 3 again on the next attempt. If our data pipeline is idempotent, the subsequent attempt will still produce the desired output with three rows.
Suppose the system is not idempotent. The data pipeline failed before recording the cursor, so we must try again. But because the system is not idempotent, we end up with the following output:
We have duplicated the row with id 3. This extraneous record will throw off any queries, analyses or operations that depend on the data. In a real-world data warehouse with billions of rows, the problem scales accordingly.
In the context of data integration, idempotence makes your data pipeline self-correcting. Most importantly, idempotence prevents duplicate records from being loaded. Duplication poses the following dangers to business intelligence and analytics:
Misleading or erroneous insights – e.g. you will not have correct rank orders, counts, or sums of records. An unbalanced duplication will alter medians, averages, distributions, and other metrics.
Broken operations downstream – e.g. any systems that rely on one-to-one correspondence between records and identifiers may throw an error or produce the wrong behavior.
Wasted storage space and computational bandwidth – you just have extra records that don’t add value yet still have to be accounted for in queries (and disk space).
To the extent that a company relies on data to make high-stakes decisions, duplication can lead to the wrong decisions, to say nothing of the additional overhead costs of wasted storage space and rewriting queries to compensate for duplicate records.
We will further discuss the importance of idempotence for data integration in a subsequent piece. Stay tuned!