Data warehouses are a particular kind of database. Learn how they are uniquely suited to analytics.
In computing, as in most things in life, speed matters. No one wants to wait in front of a screen while a computer “thinks” of an answer for them.
That simple fact has driven hardware development — faster CPUs, faster networks, faster storage — and also motivates software developers, who over the years have come up with ever more intelligent ways to get data to people more quickly. Case in point: relational databases and data warehouses.
In this post, we’ll explain the difference between a database and a data warehouse.
Businesses have been using databases for almost as long as they’ve been storing data electronically. Conceptually, a database management system (DBMS) is just a way to make data accessible quickly.
When we talk about databases, we generally mean relational database management systems (RDBMS), because relational databases have had an overwhelming share of the market for several decades. Businesses use them because storing and retrieving data from an RDBMS is faster than other alternatives.
A database is an organized collection of data. In relational databases, data is organized in tables, which group together related objects. You can think of a table as a grid with rows and columns.
Each row is an instance of the object the table holds — a customer record, for instance, or transportation data.
Each column is a field of information — a customer number, a name, an address, and so on.
Tables, rows, and columns are defined by a schema, which is a definition of all the components in the database.
Databases often serve as the back end of online transaction processing applications (OLTP), or transactional databases, in which data is added, modified, and deleted one record at a time. Table data is accessed a row at a time, which means the most efficient way to store records is by row, with indexes on key fields to make it efficient to retrieve any given record.
Not all systems are transaction-based, however. Sometimes you want to see trends in data over time. To do that, you don’t need to know the values of individual records. You need aggregated information — how many sales were made, how many trips did passengers take. And, again, you want to get that information quickly.
There’s a tool for that: a data warehouse.
Like a database, a data warehouse has a relational structure, in that data is organized into tables, rows, and columns — but there’s one key difference.
While the data in a database is organized and stored by row, the data in a data warehouse is stored by columns, to facilitate online analytical processing (OLAP). Business intelligence consists of reports that aggregate many of the same kinds of records — purchases per month, or travel by destination and cost, for example. You don’t care about individual rows, you care about whole columns.
Data warehouses are columnar databases, which are organized, stored, and indexed according to column values, in contrast to the row-oriented storage used by databases. They use columnstore indexes, which, while complex to create, are simple to understand: They make it efficient to pull information from across all aggregated rows in a table at once for reporting.
You may wonder: If you already have your data in a database, should you duplicate it by copying it to a data warehouse? Isn’t it unwise to keep multiple copies of the same data?
But that’s not exactly what you’re doing when you maintain a data warehouse. Data in a database is updated frequently, one record at a time, and represents transactions and events in the real world. Data in a data warehouse is updated only in batches as new data comes in for analysis, and represents systems as a whole.
If you’re comparing data warehouses vs. databases, think of it like this: Databases show the current state of a system; data warehouses can provide a historical perspective useful for data analysis.
You could think of the data in your OLTP systems as a kind of living organism. In that analogy, the data in your OLAP system, in your data warehouse, is an X-ray — a near real-time image of a particular subset of the organism.
It’s always worth using the right tool for the job. If the job is running data analytics in the most efficient manner to get results quickly, then a data warehouse is the right tool for the job.
Since both databases and data warehouses use relational data structures, you might consider using one where you really ought to use the other. Most data professionals become familiar with databases before they work with data warehouses, and it’s always tempting to stick with familiar tools.
In the best of all possible worlds, you’d never use a database as a data warehouse. A row-based database simply won’t give you the performance you need for data analytics, especially when you have a high volume of data.
However, small organizations, or developers doing prototyping, sometimes do use a database in place of a data warehouse. If you’ve done that, we’re not going to make you renounce membership in the Association of Data Analytics Professionals, but we will encourage you not to do it anymore.
With cloud-based, modern data warehouses so cheap these days, and the fact that you pay only for the resources you use (storage and processor time, depending on the data warehouse you choose) there’s really no excuse not to use a data warehouse right from day one of prototyping. Amazon Redshift, Google BigQuery, Snowflake, and Microsoft Synapse are all excellent cloud data warehouses. One of them is probably right for your use case, and our guide to enterprise data warehouses can help you evaluate the differences.
Regardless of which cloud data warehouse you choose, you should use Fivetran to replicate data from your OLTP systems — both databases and SaaS applications — into your cloud data warehouse. Sign up today to try Fivetran for free.