Principles of Good Schema Design
A schema is a blueprint that represents the tables and relations of a data set. Good schema design is essential to making your data tractable so that you can make sense of it and build the dashboards, reports, and data models that you need.
Raw data feeds from the APIs of SaaS apps are seldom organized in a way that is readily usable to analysts. Oftentimes, data from APIs are organized to accommodate general analytics use cases predicted by the provider. This often involves pre-joined tables and pre-calculated fields, i.e. denormalization.
At Fivetran, we serve a multitude of customers with different use cases, and strive to build a solution that can accommodate as many as possible. Our philosophy is that any additional customization specific to your use case should be performed by you at the warehouse level. In addition, we optimize the throughput of every sync to preserve bandwidth and shorten sync times.
The Fivetran approach to designing schemas overcomes these challenges through:
- A clear understanding of the application’s underlying data model
- Normalization to create a single source of truth and maintain data integrity
- Constructing an entity-relationship diagram (ERD) that clearly represents the following types of relationships:
- Standardization to allow templated analytics products to be built on top of schemas
These principles represent a substantial investment of time, effort and expertise.
Understanding the Underlying Data Model
The first and most important step to leveraging data from an application is to understand the underlying data model. Unfortunately, many APIs are poorly designed and documented, and some API endpoints can be both complicated and inconsistent. Sometimes, the columns representing the same data are named inconsistently between tables. In any case, data from API endpoints tends to arrive in a non-tabular format like JSON.
There are several ways to build an understanding of how an app behaves. During schema development at Fivetran, our developers:
- Read the documentation
- Use the app themselves
- Speak with the users of the app
- Speak with the developers of the app (and/or the API)
Every SaaS app implicitly contains a representation of the world, inasmuch as the world consists of organizations, people, transactions and other common business concepts. Understanding what data columns correspond to what real-world equivalents is essential to making sense of the data.
A broader discussion of the normal forms is beyond the scope of this article, but, in a nutshell, normalization strives to eliminate duplicate, redundant and derived data values.
At Fivetran, we often define our data integration workflow as ELT (extract, load, transform), but in reality it’s more like E-t-LT. Fivetran performs a small transformation for all of our app connectors that involves a modest amount of cleaning, massaging and reorganizing before the data is loaded into the data warehouse.
Normalization is only possible after a large, upfront investment of work to understand the underlying data model of the app. The knowledge that is painstakingly gleaned from API documentation, experience using the app and consultation with the app developers, helps us understand the business logic performed by the app and how it represents real-world concepts. Then, we identify data columns that are redundant, duplicative, or derived from other values, disaggregate or eliminate them, and construct an entity-relationship model.
Entity-relationship diagrams (ERDs) are visual representations of schemas or entity-relationship models. They represent the interrelations between tables. The ERD below is what Fivetran builds for the Asana connector.
Every table has a primary key that denotes the basic entity that the table represents and uniquely identifies each row in that table. Foreign keys are primary keys from another table. They connect the entity represented by the primary key with another entity represented in another table. Using primary and foreign keys, we can represent the following types of relationships:
- One-to-one. Two entities can only map onto each other and no other elements. A real-world example would be Social Security numbers, which only ever map onto one person. In Fivetran ERDs, if we encounter a nested object that has a one-to-one relationship with the main table, we simply flatten its data into columns.
One-to-many. One entity in a table can correspond with multiple records in another table, but not vice-versa. An example of one-to-many might be a list of favorite colors and the customers who like them. Each color can have many devoted customers, but every customer only has a single favorite. If we encounter a nested object that has a one-to-many relationship to the main table, it is turned into a separate table.
- Many-to-many. These relationships are represented in join tables. The composite primary key in a join table consists of the primary keys of the two related entities. For instance, a person’s shopping habits might bring them to many stores, and each store will have many customers.
The ERD of a normalized schema allows the app’s underlying data model to be represented in a clear and concise manner. All Fivetran ERDs can be found in our documentation.
Data from a normalized schema is not necessarily useful for analytical purposes out-of-the-box, as the data has been arranged on the basis of real-world logical relationships rather than the computed and derived values commonly used for reporting, analytics and training algorithms.
Luckily, while there are as many ways to design a denormalized schema as there are opinions, there are only a handful of plausible normal forms. Standardizing on normalized schemas is a way to create the lowest common denominator that can be later transformed to enable virtually every use case.
Since the KPIs of many business intelligence dashboards, such as sales and marketing funnels, revenue reports, customer attribution and so forth are well known and similar across customers, it is possible to transform the data into a denormalized form for such use cases. That is what we at Fivetran are doing with our “Analytics Templates,” which are a set of tools to help our analyst users denormalize the Fivetran data with SQL and dbt (data build tool) into a form that meets their specific analytical needs.
Good schema design requires a substantial investment of time and talent. Availing yourself of a standardized solution allows you to leverage the expertise of people who have made themselves intimately familiar with the data you want so that you don’t have to.