Learn about the technical challenges involved in building an idempotent data pipeline.
We have previously established that idempotence is crucial from the standpoint of failure-proofing your data pipeline, and that it depends on the ability to correctly identify unique records. Identifying unique records prevents duplicate or conflicting data from being produced when the same records are sent twice. Primary keys are a field or combination of fields within a record that are unique to that record and relatively immutable. This means you can use primary keys to identify unique records. Here, we will discuss the exact mechanisms behind correctly identifying primary keys, and using them to ensure extractions without duplications.
Two of the most common types of data sources that organizations extract data from are operational databases and API feeds. In order to construct idempotent data pipelines from both databases and API endpoints, we need to correctly identify primary keys. Databases and API endpoints differ somewhat in how they identify the primary key and read updated data in general.
In databases, the primary key is generally explicitly declared, allowing easy programmatic identification of primary keys.
An initial sync from a relational database requires the use of a SELECT * query, as it must return all records and all values. By contrast, subsequent syncs should not retrieve entire tables. Instead, a data connector for a database should query the changelog. The challenge posed by changelogs is that they do not contain original data, only a list of updates. These updates may be whole-row, in which a primary key and all values are tracked, offering a full snapshot of each record. They may also be partial-row, in which only a primary key and changed values are tracked. Either way, entries from changelogs with existing primary keys are merged with corresponding rows in the destination, while entries with new primary keys are added as new rows.
Without primary keys and idempotence, the destination will create new rows for all log entries, duplicating some rows in the corresponding tables.
When primary keys are not explicitly declared, the data pipeline can’t programmatically identify them. Without a primary key, the only alternative is to impute a primary key by hashing the contents of the entire row. The problem here is that the data pipeline can only prevent exact duplication. It has no way to prevent new, conflicting rows from being created as individual values in a row change.
With an API endpoint, there is typically no programmatic way to identify primary keys. Instead, they are found in the documentation. If primary keys are not well-documented, the alternative is to construct one by choosing a field or combination of fields that are fixed and unique to a particular entity. This can be highly error-prone, as fields that appear to be fixed and unique, such as email addresses, may not actually be. In order to determine a usable primary key, you have to understand the exact behavior of the application that uses the data, specifically which fields can and can’t be changed.
For instance, a naive and tempting approach is to identify a user with contact information such as username or email address. But don’t take for granted that usernames and emails can’t change. Consider a table with a sample of basic user data:
There are plenty of practical reasons why a person might change any of these (moving, marriage, new phone plan, etc.). You must use a more-or-less fixed reference point for a unique identifier, and therefore determine exactly which fields are fixed within that particular system and which are not. In addition, you will need to ensure that any fields you use as primary keys don’t (and won’t ever) contain null values.
Consider the following record, where you guess that email is the primary key, but both email and name are mutable at the source:
What if Joe changes his name to Joseph? The data will update just fine in the destination:
But what if the application allows Joe to change his email address, and he does? The source can use its insider knowledge to make the change without duplication, but the data pipeline will produce a duplicate in the destination because the data pipeline only sees the new, changed email. Because the data pipeline considers email a primary key, it will interpret the changed record as a new, additional record:
Email and name are clearly unsuited for use as primary keys for this source.
Without documented primary keys, you will need to reverse-engineer the underlying data model and functionality of the application in order to identify the fields that really are both unique and immutable.
If it turns out that there are no unique or immutable fields, then you have no choice but to treat each full record as unique. In this case, the primary key would be imputed as a hash of the entire row. This at least prevents duplication on recovery from pipeline failure, but not from updates.
The challenge to building an idempotent data integration system fundamentally hinges on identifying correct primary keys. Your data integration platform should stipulate that a primary key is always present, whether as designated by the source, reverse-engineered from a source’s data model, or simply hashed from all values of a row.
Mandatory primary keys are not necessarily an obvious design consideration, and it can be burdensome to build and maintain a system that makes primary keys of paramount importance, especially as schemas at the source continue to change. It is tempting to build a pipeline that simply reads records and dumps them to destinations, but doing so introduces serious problems with duplication and general data integrity.
The more sustainable approach is to understand the underlying data model of each of your sources and assign the right primary keys. As your sources grow in number and complexity, the challenges associated with identifying primary keys will only grow accordingly.