Sync data incrementally, or watch your pipeline grind to a halt.
A key feature of an efficient data pipeline is the ability to incrementally update. The alternative to incremental updates is to sync entire datasets every time you update from a source.
Full syncs are necessary to capture all data during an initial sync, and occasionally to fix corrupted records or other data integrity issues. However, full syncs are inappropriate for routine updates because they:
Often take a long time, sometimes to the tune of hours, days, or even weeks for exceptionally large data sources. This makes full syncs incapable of providing timely, i.e. real-time, updates. This point is exemplified by the common practice of daily snapshots, where the term “daily” itself describes the slow rate of full syncs.
Can bog down both the data source and the destination, using resources otherwise needed for operations and analytics, respectively.
Consume excessive network bandwidth.
As data sources grow and add records, the problems listed above naturally grow, as well.
To perform incremental updates, you must be able to identify changes made to a previous state of the source. This practice is often referred to as change data capture. There are several methods of change data capture, but two of the most prominent methods use changelogs and last-modified timestamps.
Changelogs are most commonly used to make incremental updates from databases. The initial sync from a relational database is always a full sync using a SELECT * query, as it must return all records and all values. By contrast, subsequent syncs should not retrieve all records and all values.
Instead, a data connector for a database should query the changelog. Changelogs contain a full history of updates, new records, and deleted records as a list of updates. These updates may be whole-row, in which a record’s unique primary key and all values are tracked, offering a full snapshot of each record. They may also be partial-row, in which only a primary key and changed values are tracked.
The sync workflow looks like so:
Import a full table using a SELECT * query. This provides a snapshot of the table’s entire contents, primary keys and all, at a particular moment in time. Make note of the primary key column; this will later be essential for designating the correct updates and preventing duplicate or conflicting records.
Find the position in the changelog that corresponds with the time value of the snapshot described above. It is usually impractical to find a place in the log that corresponds exactly to the time value of the snapshot. In practice, you can only update from a somewhat earlier position. If you update from a later position than the snapshot, you will miss updates.
Start reading the changelog from the position you identified in the previous step. As you progress through the changelog, based on the identified primary keys:
Merge updated rows with the corresponding records in the destination
Add new records as new rows with new primary keys
Remove deleted records from the destination
The incremental data sync workflow depends on idempotence. Without idempotence and the ability to properly attribute rows using primary keys, reading from a changelog position that predates the snapshot can introduce several versions of a row, depending on how many updates occurred in the interval between the log and the snapshot. This introduces a serious danger of duplicate and conflicting rows.
API data sources may not have changelogs, but may instead have timestamps that indicate when a record was last changed. The goal is to identify, extract, and load records that have been added or updated since the last sync.
The process of incrementally updating from a timestamp looks like so:
For each entity, find the “last modified timestamp,” “last modified date,” or similar attribute that denotes when a record was last updated.
After the initial update, make subsequent updates based on the timestamp being “greater than” the initial timestamp value, called a cursor. This cursor is updated with every subsequent update.
Timestamps offer a simpler approach to incremental updates than changelogs, may be considered more intuitive, and are more commonly used. The downside to using timestamps is that:
Unlike changelogs, which track actions, timestamps don’t track deletes, because a deleted record is simply absent.
Since you have to query and scan an entire table or feed, it imposes a heavy load on source, leading to slowdowns.
Each update presents a high fixed cost because you must scan tables in their entirety. By contrast, the cost of updating from a changelog is directly proportional to the number of changed rows.
Updating from a timestamp only allows you to access a snapshot of data from any given time. If you ever need to track every change ever made to a record, you can’t use this method.
Although clearly advantageous over full syncs, you can still encounter serious engineering challenges using incremental updates.
One of the challenges is that of scale. If a data source grows quickly enough, even incremental syncs can become so large and lengthy that the duration of a sync exceeds the specified interval for update, creating a form of growing data sync debt. Suppose, for instance, that a data connector is scheduled to sync incrementally every five minutes, but the size of the sync is such that it will take seven minutes. Once seven minutes have elapsed, we are two minutes starting the next sync. This means that two minutes of extra data has accumulated at the source. This adds additional records to the next sync, which will then take even longer than seven minutes. The sync interval will continue to snowball as more and more data accumulates. The solution to this problem is to write more performant code and more efficient algorithms to minimize sync times.
Another challenge concerns the granularity of timestamps. Suppose the “last modified” field in an API feed is only accurate to the second, and your latest timestamp was 5:00:01 PM. You have no way of knowing where between 5:00:01 PM and 5:00:02 PM you ended. If you use the “greater than” approach described earlier, your next sync is likely to miss some records that were updated in the interval from 5:00:01 PM to 5:00:02 PM. These missing or outdated records are easily overlooked until they cause serious problems, because so few records tend to be affected at any particular time. It can be impossible to recover the missing data. Here, as with changelogs, you must use “greater than or equal” logic. This will likely introduce duplicate records into the workflow, meaning you must once again rely on idempotence.
Sometimes, APIs contain records that are modified at a particular timestamp but aren’t available to read until after that timestamp has already elapsed, leading them to be missed. You may have to use your discretion to determine how many records you are willing to resync in order to catch late arrivals.
Finally, sometimes either a data source, or parts of it, will lack changelogs or timestamps. In the best-case scenario, you might be able to identify a reasonable proxy for timestamp, like a transaction or serial number. In the worst case scenario, you may be forced to schedule full syncs for part or all of the data source.