Successful analytics depends on choosing the right approach to storing your enterprise data.
Data warehouses and data lakes collect business data and provide users with a platform to guide business decisions. They have distinct peculiarities and use cases. In this post, we’ll focus on what makes each one distinctive.
A data warehouse is a relational database designed for analytical queries. Analytical queries typically involve combining and summarizing values from a large number of records. For instance, an analytics team might turn a stream of events into a trendline with a count of events by day.
Data warehouses differ in design philosophy from transactional or operational databases, which perform frequent queries and updates to individual records. An example is adding, removing, and purchasing items from a cart on an ecommerce website. This basic difference in design means you must not use the two interchangeably, as they are optimized, at a very basic structural level, for fundamentally opposite kinds of operations. Another way to describe the difference is OLTP vs. OLAP.
Traditionally, data warehouses were hosted in on-premise data centers. Increasingly, data warehouses are now based in the cloud. The most advanced cloud-based data warehouses are “serverless,” meaning that compute and storage resources can be independently scaled up and down as needed. Modern cloud data warehouses have become extremely accessible to organizations with modest resources.
Modern cloud data warehouses easily integrate with business intelligence platforms, through which analysts access business data to produce reports and dashboards. Also, data warehouses offer a fine degree of user permissions and access control, an essential feature of data governance.
Data warehouses are predicated on the assumption that important enterprise data is structured, with a defined data model or database schema. Structured data follows predictable formats, is easily interpreted by a machine, and can be stored in a relational database.
A data lake, by contrast, is an object or file store that can easily accommodate a large volume of raw data in an unstructured format, such as free-form text, images, videos and other media, as well as structured data. This is the kind of data that might be stored in a NoSQL database, or perhaps in no database at all, before landing in a data lake.
The most basic use of a data lake is storing data, comprehensively collecting huge volumes of data before deciding what to do with it. In this approach, the data lake is a staging area for a data warehouse.
Another use is to train a machine learning application using a very large set of unstructured training data.
The chief disadvantage of data lakes is their “murkiness.” Data lakes can be comprehensive at the expense of easily accessible content. An exceptionally disorganized and poorly governed data lake can quickly become so murky that it becomes a data swamp.
Data in data lakes can’t easily be accessed or joined using SQL or most business intelligence platforms, making it generally unsuited for use by analysts. In most cases, data warehouses are the more appropriate repository for structured business data used in analytics.
From the standpoint of data management or data governance, data lakes do not offer a fine level of user permission and access control.
What a data lake can do that a data warehouse cannot is store large quantities of media such as documents, images, videos, and audio. These media can serve as training and validation sets for machine learning models.
Some new technologies combine characteristics of both data warehouses and data lakes. Some data lakes now incorporate characteristics of data warehouses such as ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement as features to make data less “murky.”
Likewise, data warehouses now sometimes support less-structured data, or data science tools and languages usually associated with data lakes such as Apache Spark and Python. A data repository that combines characteristics of a data lake as well as a data warehouse may be referred to as a data lakehouse.
In short, we can compare the different technologies like so:
|Data Warehouse||Data Marts||Data Lake|
|Best Use Case||Analytics||Analytics||Staging area|
|Access Using SQL||Yes||Yes||No|
|BI Tool Compatibility||Yes||Yes||No|
|Store Files and Media||No||No||Yes|
|Easy Data Governance||Yes||Yes||No|
|Cloud-Native||Increasingly yes||Traditionally no||Yes|
Data warehouses, data marts, and data lakes form the lynchpin of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. These activities are collectively known as data integration and are a prerequisite for analytics.
The simplest approach to data integration, and the one Fivetran recommends to most organizations, is to use a data warehouse as a data repository. The data stack consists of the following components:
Business intelligence tool
Data lakes can be used as a central repository for both structured and unstructured data, while (usually) sacrificing a relational structure. Although this data stack is possible, it is not advisable:
Business intelligence tool
BI tools typically support access to data warehouses, not data lakes. A BI reporting and dashboard tool that feeds from a data lake is likely to be a custom solution built by a data scientist or data engineer, with all of the associated expenses.
A data lake can also be used as a staging environment for data warehouses.
Business intelligence tool
Your exact configuration will depend on your exact use case, the size and composition of your company, and the skill sets of your analysts and engineers.
If all else fails, though, just start with a data warehouse.