Data Lakes vs. Data Warehouses vs. Data Marts

Successful analytics depends on choosing the right approach to storing your enterprise data.

By Charles Wang, December 8th 2020

Data warehouses, data marts and data lakes combine business data and provide users with a platform to guide business decisions. They all have distinct peculiarities and use cases. Although different individuals and companies might define each technology slightly differently, we will describe their essential attributes.

What is a Data Warehouse?

A data warehouse is a relational database designed for analytical queries. Analytical queries typically involve combining and summarizing values from a large number of records. For instance, an analytics team might turn a stream of events into a trendline with a count of events by day.

Data warehouses differ in design philosophy from transactional or operational databases, which perform frequent queries and updates to individual records. An example is adding, removing, and purchasing items from a cart on an ecommerce website. This basic difference in design means you must not use the two interchangeably, as they are optimized, at a very basic structural level, for fundamentally opposite kinds of operations. Traditionally, data warehouses were hosted in on-premise data centers. Increasingly, data warehouses are now based in the cloud. The most advanced cloud-based data warehouses are “serverless,” meaning that compute and storage resources can be independently scaled up and down as needed. Modern cloud data warehouses have become extremely accessible to organizations with modest resources.

Modern cloud data warehouses easily integrate with business intelligence platforms, through which analysts access business data to produce reports and dashboards. Also, data warehouses offer a fine degree of user permissions and access control, an essential feature of data governance.

What is a Data Mart?

Traditionally, data marts were physically separate databases. In the modern conception, a data mart is an organizational structure within a data warehouse. Data marts may contain data from a smaller range of sources and summarized, rather than raw data. The purpose of a data mart is to make analytics convenient and accessible to specific teams and business units.

The tables in traditional data marts were often of minimal size, typically <100,000 rows, and completely siloed. The modern approach is to build logical data models, i.e. views and materialized views, directly in a cloud-based, enterprise-wide data warehouse. This a simpler, more flexible approach that does not require separate machines, enables whatever levels of access the organization deems appropriate, and leverages the capabilities of a cloud-based data warehouse.

Depending on how an organization implements its technology and organizes its analytics team, the specifics of ownership and access for a data mart can vary. In some cases, teams and business units may be wholly responsible for their own data marts, and the data marts may effectively be siloed. In other cases, boundaries and access may be looser.

Like data warehouses, data marts easily integrate with business intelligence platforms.

What is a Data Lake?

Data warehouses and data marts are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine, and can be stored in a relational database. A data lake, by contrast, is an object or file store that can easily accommodate a large volume of raw, unstructured data such as free-form text, images, videos and other media, as well as structured data. The most basic use of a data lake is to comprehensively store huge volumes of data before deciding what to do with it. In this approach, the data lake is a staging area for a data warehouse. Another use is to train a machine learning application using a very large set of unstructured training data.

The chief disadvantage of data lakes is their “murkiness.” Data lakes can be comprehensive at the expense of easily accessible content. An exceptionally disorganized and poorly governed data lake can quickly become so murky that it becomes a data swamp. Data in data lakes can’t easily be accessed or joined using SQL or most business intelligence platforms, making it generally unsuited for use by analysts. In most cases, data warehouses are the more appropriate repository for structured business data used in analytics.

From the standpoint of data governance, data lakes do not offer a fine level of user permission and access control.

What a data lake can do that a data warehouse cannot is store large quantities of media such as documents, images, videos, and audio. These media can serve as training and validation sets for machine learning models.

Some new technologies combine characteristics of both data warehouses and data lakes. Some data lakes now incorporate characteristics of data warehouses such as ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement as features to make data less “murky.” Likewise, data warehouses now sometimes support less-structured data and data science tools and languages usually associated with data lakes such as Apache Spark and Python. A data repository that combines characteristics of a data lake as well as a data warehouse may be referred to as a data lakehouse.

Data Warehouses vs. Data Marts vs. Data Lakes

In short, we can compare the three different technologies like so:

When to Use Data Warehouses, Data Marts, and Data Lakes

Data warehouses, data marts, and data lakes form the lynchpin of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. These activities are collectively known as data integration and are a prerequisite for analytics.

The simplest approach to data integration, and the one Fivetran recommends to most organizations, is to use a data warehouse as a data repository. The data stack consists of the following components:

  1. Sources

  2. Data pipeline

  3. Data warehouse

  4. Business intelligence tool

Data marts are subsets of data warehouses. Although data marts as traditionally defined are obsolete, you may still use views and materialized views to divide your data into models for specific teams and business units. A stack including data marts would look like so:

  1. Sources

  2. Data pipeline

  3. Data warehouse

  4. Data marts

  5. Business intelligence tool

Data lakes can be used as a central repository for both structured and unstructured data, while (usually) sacrificing a relational structure. Although this data stack is possible, it is not advisable:

  1. Sources

  2. Data pipeline

  3. Data lake

  4. Business intelligence tool

BI tools typically support access to data warehouses, not data lakes. A BI reporting and dashboard tool that feeds from a data lake is likely to be a custom solution built by a data scientist or data engineer, with all of the associated expenses.

A data lake can also be used as a staging environment for data warehouses.

  1. Sources

  2. Data pipeline

  3. Data lake

  4. Data warehouse

  5. Business intelligence tool

If your organization is especially large, complicated and stores a lot of unstructured data, you may even have a reason to combine every type of technology:

  1. Sources

  2. Data pipeline

  3. Data lake

  4. Data warehouse

  5. Data marts

  6. Business intelligence tool

Your exact configuration will depend on your exact use case, the size and composition of your company, and the skill sets of your analysts and engineers.

If all else fails, though, just start with a data warehouse.

Start analyzing your data in minutes, not months

Launch any Fivetran connector instantly.

Adblock Detection