How to Think About Schema Changes
Change is the only constant. Tools and technologies change and business requirements change. The upshot of it all is that the schemas of your underlying data sources and data models for analytics will also change. So how do you handle all of these changes?
Why Schema Changes Will Ruin Your Day
Schema changes can seriously hamper your workflow, whether you put together manual reports or rely on a traditional ETL process.
If you have ever built data pipelines by hand, you have experienced the struggle firsthand as you have scrambled to rewrite your data ingestion code every time API endpoints or files from your data sources change.
If your company performs traditional ETL, then your data engineers know the pain, as they deal not only with periodic updates to your data sources but also with alterations to the “transformation” stage of the ETL process. This happens whenever analysts and data scientists demand new data models to meet changing business needs.
What can be especially galling about these changes is that the data fields affected by the change might be completely irrelevant to the questions you are trying to answer and the analyses you are trying to conduct, yet you must accommodate them.
The Fivetran Approach
At Fivetran, we solve the problem of schema changes in two ways. The first is that, in order to address the problems posed by changing data sources, we automatically, incrementally, and comprehensively propagate schema changes from the source to your data warehouse. We handle this in a few ways, depending on the situation:
- If your source adds a new column, we will detect the change and add the same column in our warehouse, backfilling the data if applicable.
- If a column is removed we won’t delete it outright from your data warehouse but “soft-delete” it and mark future records NULL so that the old data remains as a historical point of reference.
- If a column’s data types change, we will do our best to losslessly accept both the old and new data by retaining the old column and creating a new column with a data type accommodating both the old and new data. You can then create views off that table.
- If your source adds a new table, we will begin syncing it like any other.
- If a table is deleted in your source, we will leave it alone in your warehouse.
The second way we address the problem of schema changes is to shift the “transform” stage of ETL to the end of the process so that the schemas used by analysts are only ever refactored after they have already been warehoused. This spares data engineers the burden of reengineering pipelines whenever the business needs change, making the entire pipeline far less brittle. It’s better for analysts to build data models using SQL than to have engineers hardcode them further upstream. Some of our customers use Looker and other BI tools as an ad hoc transformation layer.
No More Pain, Only Gain
Schema changes can be vexing, if not paralyzing. Our goal is to take the pain out of coping with what should be a solved problem. This means less time and effort spent agonizing over upstream changes that might not even affect your analysis, and more time spent actually analyzing data.
If you’re tired of handling schema changes, consider giving Fivetran a try.