Follow these tips to ensure that your database schema delivers maximum utility for data warehouse users, data analysts, and data scientists.
In our last blog post, we covered some of the biggest mistakes you can make in designing your database schemas. (And if you need a database schema definition, look no further.) Let’s take a look at how you can make better choices to ensure that your database schema design helps users, data scientists, and data analysts find what they need and use it well.
As we mentioned in our last post, a good entity-relationship diagram is an important part of a relational database schema: It is a visual representation of the underlying data model contained to be sent from your source to your data warehouse. This includes tables, folder structure, business teams, data labels and so forth. The provenance of a data model can be complicated, and this diagram can help by illustrating the potential multitude of tables and interrelationships that support a particular data model. Finally, the diagram should also be a living document and reflect the current state of your databases as they change.
Your data schema should be flexible enough to evolve with your changing needs. And you should be able to make these updates with some understanding of the consequences of how your data infrastructure will change too. This is why you need to update your relationship diagrams, as mentioned above: Your data is a living, breathing organism.
While you don’t have to go full-out Tony Robbins, you should have a firm grasp of your data’s purpose in life. You’ll need to know what it will be used for and what business decisions it will enable in order to design the appropriate data structures, anticipate the volume of data queries, choose the best possible database engine and other environmental and management issues. A schema should satisfy multiple goals, including reducing redundant data, enforcing data consistency, ensuring data integrity and so forth.
A good way to determine purpose is to mock up some sample reports that your stakeholders will need. If you start with the end product in mind like this, you should have a better understanding of the shape of your data needs. This also applies to data dashboards. You want the dashboards not to just look pretty, but provide actionable insights that managers can use in their daily jobs.
As we hinted at in our last blog post, the more effort you put into upfront planning, the easier it will be to produce the most effective database. Understand who the data consumers and the target audience are for the reports that will be generated from this data ahead of time. While it is great to be able to create ad-hoc queries and reports, your users still will benefit from creating a few samples as part of this planning effort to guide your plans.
Make sure your various data abstraction layers, application interface, and data feeds are useful to the data users and analysts that will produce reports and other data-oriented products. Engineers don’t typically approach issues the same way that data analysts do, so having both of these as part of your schema design team is important.
If you have designed your schema properly, you will have just the right amount of indexing for the different types of queries used by your analysts. You can have too many indices or too few, and neither situation is optimal. Figuring out the right balance will take some effort and experimentation.
You should label your fields, tables, and other data elements with meaningful names so that everyone who uses the data can figure out what these elements mean at first glance. This also means keeping these names consistent across your entire database. Don’t use system-reserved labels for column fields or table names. Don’t use hyphens or other punctuation marks that will just confuse things or require special programming to avoid errors. Keep your names short without any necessary modifiers. Finally, if it isn’t obvious what something is from its name, think about changing it to something else.
The concept of “security by design” has been around for decades. If this term is unfamiliar to you, take a moment to review what it means and how you can implement it. Part of this design means not giving administrative rights to every user and every developer, but rather ensuring that each user has the right access level that is appropriate for their needs. An entire other topic is ensuring that your sensitive and confidential data is protected by the right amount of encryption and that it can’t be exported by malicious hackers.
Because your schema is a living organism, you want to ensure that it has a happy and useful life long after the creators have moved on to other jobs or other employers. The best way to ensure this is to produce a careful and complete document that explains your choices, has plenty of comments in the code, and other information. Have you explained the relationship among the various fields in each table, or the relationship among your various tables?
Follow these best practices, and your database schema will be maximally useful — and that, in turn, will help ensure that you get the most utility, actionable insights, and life out of your data warehouse.