A good database schema is indispensable for making data warehouses work. Get it wrong, and you’ll be in a world of hurt.
A database schema is an important part of building the best possible data warehouse. Think of it as the blueprint for your data structures, or an outline of how your data is organized and categorized. You’ll typically use a database schema to give database users, developers, and analysts a logical overview of your data, so they can obtain actionable information from your data.
Schemas can comprise physical elements, such as where the data will be stored and in what forms, and the logical structure used, such as tables, fields and relationships. They can take several different types, such as a simple two-dimensional table or more hierarchical models that look like trees with different branches or network nodes.
A good database schema design simplifies everything, for data engineers, analysts, and end-users. But a badly designed schema will make things harder for data scientists and analysts to do their jobs. Bad schemas can be massive time sinks, confusing to interpret by data users and difficult to modify and maintain. You’ll end up wasting effort and spending time doing even the simplest of tasks.
Here are a few of our worst-case schema scenarios you want to avoid. But don’t stop here — make sure you know what the database schema best practices are as well.
These diagrams should try to explain the underlying data model that is being used by your data warehouse with clarity and accuracy. That last point — accuracy — means you’ll need to ensure that you update these diagrams whenever you make changes to the data model. The bad entity-relationship diagrams are confusing, obscure these relationships, or are woefully outdated. You can find a good example of an ERD in our database schema definition blog post.
Speaking about being out of date, the digital age and the pandemic have both brought about significant changes in how we collect, consume and communicate data. If you have written your schema in such a way that either you don’t know how to update it or can’t easily make changes to it, it can be a major stumbling block in having a useful data structure.
Part of pre-planning your schema is knowing what your data is, why you want to retrieve it, and how it is going to be used. Without a clear understanding of these elements ahead of time, it’s going to be difficult to create an appropriate schema.
If you didn’t do much in the way of up-front design and planning, your schema will suffer the consequences — and anyone trying to use it to understand your data will suffer, too. It might be inappropriate, ineffective or add additional work to produce actionable reports. Another consequence of poor planning can be omitting data that could be useful to your data structures.
All of your data should follow some simple and consistent rules in storage formatting. This should also be part of your pre-planning efforts.
Consistency also applies to the various naming conventions that you will use for field and table names. If you choose from cryptic, seemingly random combinations of letters and numbers, that will just create confusion and make it difficult for everyone to find what they are looking for. Or worse yet, using the same name for similar elements in different tables — at best this creates confusion and at worst it can lead to syntax errors with your database software.
Engineers who build data structures rarely approach data management and database schema design in the way that data analysts do. While it is great to collaborate with all stakeholders — including your devops team — the wrong schema will make your data unusable.
For a relational database schema, getting key field linking right is critical. You don’t want to be updating the same data across multiple locations, and you don’t want to be caught by too many key fields — that can lengthen the processing time needed to make updates and queries.
Ideally, your choice of indexes should help, not hurt, these analytical processes too. You don’t need — and shouldn’t have — every column indexed “just in case.” Doing so only creates massive processing overhead for your database software.
That is what computers are good at. Attempting these tasks manually will quickly bog down and drive even the most precise analyst crazy. Not to mention that they are subject to human error, which might defeat the entire purpose to begin with.
Having little or no documentation, or poorly written documentation, means that the useful life of your data is limited to the job tenure of the people who created the schema. Once they have left the company, it will be difficult to recall decisions made or understand the logical flows.
Our aim at Fivetran is to make data access as easy as electricity: after all, you don’t need an engineering degree to turn on your lights. In our next blog post, we’ll talk about best practices for creating database schemas.