Use schemas to make life easier for your analysts and engineers.
The word “schema” is used in many ways across different disciplines. In the context of databases, a schema is a blueprint that organizes data into tables with columns, keys and interrelations.
Every application contains a representation of reality called a data model that specifies the data necessary to the functioning of the application, including how users interact with it. A logical database schema translates that data model into something that is suited for storage in a relational database.
Schemas are important to data integration and analytics, as well. Since data warehouses are a type of relational database, data stored in a data warehouse must also be organized into schemas. When a data pipeline extracts and loads data into a data warehouse, it performs preprocessing, such as cleaning and normalization, to make the data consistent and legible, and then populates the schema. This ensures that your data can be easily understood and turned into dashboards and reports by your analysts.
Here, we will discuss the importance of schemas in the context of data integration and analytics.
Whether in an application or a data warehouse, a well-designed database schema clearly organizes data into distinct entities with their appropriate attributes, shows the interrelationships between entities, and imposes constraints such as data types.
A well-designed schema in a data warehouse makes life easier for analysts in the following ways:
It largely removes cleaning and other preprocessing from the analyst’s workflow.
It absolves analysts of the need to reverse-engineer the underlying data model of an application.
It provides analysts with a clear, easily understood starting point for analytics projects.
These benefits free analysts to focus on modeling and transforming data in order to produce reports and dashboards.
Raw data from an application or an operational system is seldom analytics-ready. Engineers who design and build API feeds seldom approach issues in the way that analysts do. Instead, most data sources deliver data in whatever structure is most expedient from the perspective of the SaaS provider. At best, the provider might make strong assumptions about the specific metrics their users will want, and organize their data models accordingly. This often creates redundancy and a lack of clarity.
To overcome this redundancy and lack of clarity, you must reverse-engineer the underlying data model of the data source and normalize it.
A broader discussion of the normal forms is beyond the scope of this article, but, in a nutshell, normalization strives to eliminate duplicate, redundant and derived data values. Specifically, to normalize a data set, you must identify data columns that are redundant, duplicative or derived from other values, disaggregate or eliminate them, and construct an entity-relationship model that clearly organizes the attributes and relationships of entities in the data model.
There are as many ways to model data as there are opinions, but there are relatively few ways to normalize a data set. By normalizing a data set, you strip it down to a succinct but comprehensive expression. Schemas can be visually communicated as entity-relationship diagrams (ERDs). A normalized schema is illustrated below:
ERDs illustrate the tables, fields, interrelationships, and keys between tables
Data from a normalized schema is not necessarily useful for analytical purposes out of the box, as analytics use cases often depend on computed and derived values. As a result, normalization is inextricably tied to the extract-load-transform (ELT) process. In ELT, you must perform transformations to turn data into analytics-ready models after the data has been loaded to the destination.
In order to build and populate a good schema in a data warehouse, you will need to:
Fully understand the data source’s underlying data model.
Sketch, and continuously iterate on, an entity-relationship diagram (ERD) to visually communicate the various interrelations (one-to-one, one-to-many and many-to-many) in the data model. You will also have to apply a consistent naming convention to your tables and fields.
Design a normalized schema in the destination. Draft and populate the tables from your ERD. You have succeeded when you can import and recognize all of the fields.
Productionize the actual software to extract the data, preprocess it into the normalized schema, and load it into the destination.
Publish your final ERD as a reference for your analysts.
The most labor-intensive stage of the process is gaining the necessary understanding of your data source. The most obvious step is to study the documentation for your data source, but not all documentation is comprehensive or up to date. You may be forced to actually use the application to supplement what you can glean from the documentation.
The process described above applies whenever the data source changes. This can become tricky if tables or columns are deleted or reorganized, as you can easily destroy data that might be useful in audits or historical analyses. Data integration should be net-additive or lossless.
Good schema design requires a substantial investment of time and talent. Availing yourself of a standardized solution allows you to leverage the expertise of people who have made themselves intimately familiar with the data you want so that you don’t have to.
To experience firsthand how Fivetran automates data integration, sign up for a free trial.