I’ve been a Fivetran developer since close to the beginning of our journey from the corner of someone else's office to one of the principal players in the world of data pipelines. I work on our database connectors, a product we're happy with because it’s the result of many failed experiments that have led to our current database replication approach.
Going back to 2015, which could be described as our dark ages, we realized that some of our customers' most important data sources were databases, so we started building connectors for them.
At that stage, our database connector efforts were no more sophisticated than an average homebrew data pipeline syncing a Postgres or MySQL database:
- We had yet to optimize our data extraction methods according to the underlying storage mechanisms of the database.
- We had no concept of capturing deleted data or attempting to piece together a history of the data.
- Finally, we had yet to incorporate the zillions of little fixes and tweaks that make our current connector so much more effective than our early efforts.
Perfecting our data replication approach
To give some concrete examples of the problems one might face while trying to figure these things out, we can first look at our Postgres connector. If you had to build a database connector, how would you do it?
You might start by thinking, "Well, I guess I'll need to select everything that's in the table ... and once that's done, find some sort of cursor that works to track our progress, like a date column." That's a fine first step, but what if some of your tables don't have a handy date column you can use as a cursor? You'll have to use something every Postgres database has — and a transaction ID could do the trick. So every time you want to sync from the table, you'll have to first retrieve the max transaction ID from the last sync and then sort the whole table by transaction ID. That's ... okay. It's slow to sort a table, but at least you're not loading everything every time. You've got a cursor that works.
Success! Kinda.
We can think of this approach as "snapshot-style replication." By this, I mean we look at a snapshot of the table, and record how it was at that time. It’s kind of like taking a picture of a busy street every minute so you can count how many cars are on the road to understand traffic through an intersection. But what if some cars move through the intersection so fast, they don't spend a whole minute inside of it? They might never get captured at all!
What if someone says to you, "I want to learn which rows were deleted from this table!" That’s tricky because you’d have to carry out snapshot syncs with increasing frequencies. You could really never run them fast enough to know for sure that you caught every row. Or, trickier still, what if someone says to you, "I want to see a full history of how a row changed." You simply can’t do it, because of the fundamental fact that values can flit in and out of existence faster than your sync could possibly take place.
So you’re stuck. What now?
The preferred approach: Log-based replication
This brings us to log-based replication, which is the real key for fully replicating a source database. This approach solves many of the problems we spoke of before. Let’s see how by restarting this exercise, but settling on the log-based replication method instead of the transaction-ID method.
You want to build a database connector. But, having heard of the problems inherent in transaction-ID-based methods, you do a bit more research to begin with. You learn about this fancy log present in Postgres databases: the write-ahead log (WAL). It records all the DML operations in the database. (DML is short for data manipulation language, and is a term people use to refer to changing data values when they want to confuse people and feel smart.)
Fantastic! You set up a write-ahead log and start selecting from it. The transaction history is laid out before you, and like an omniscient deity, you watch history unfold as the pale glow of your monitor casts your triumphant silhouette on the wall behind you at 1 a.m. (Perhaps this experience is mine alone.)
Now, it is possible to do things like update your tables incrementally without scanning the full table to find the right transaction IDs, vastly improving the performance of the connector with respect to its resource usage on the source database. With this method, you don’t need to rely on specialized gimmicks like date columns or auto-incrementing integer primary keys.
You can track deletions to get a view of your currently active records, or, as Fivetran is developing right now (!), track full histories of tables. Your DBA sends you flowers, and your analysts pop champagne bottles when you walk into the office.
You realize that this is a dream, not because everyone is actually paying attention to an engineer, but rather, because you’re in an office, and it’s 2020. You awake from your dream, ready to build a better database connector.