13 Aug 2018 | Analyst Recipe

Why You Shouldn't Use Postgres or MySQL as Your Data Warehouse

George Fraser
George Fraser
Why You Shouldn't Use Postgres or MySQL as Your Data Warehouse
Columnar databases, not OLTP databases, are the appropriate choice for a data warehouse.

If you're an experienced developer or DBA, you've probably spent a lot of time working with one of the major online transaction processing (OLTP) databases: - MySQL - Postgres - MSSQL - Oracle

You've become an expert in how to optimize your favorite database, how to write queries that execute fast, and how to troubleshoot slow queries. At some point, your company will probably want to set up a data warehouse and a business intelligence (BI) tool like Tableau or Looker. Your first instinct is probably to spin up another instance of the database you already know and love, and use it as your data warehouse. This is a huge mistake.

For BI workloads, you should use a columnar database like Redshift, Snowflake, or BigQuery, even if you have no prior experience with columnar warehouses. Columnar databases are optimized for the style of queries generated by BI tools; OLTP databases are optimized for the style of queries generated by apps. This is not a question of one database being "faster" or "slower" than another; this is about trade-offs. BI-style queries are so different from OLTP-style queries that it's impossible to create a single database that is good at both [1]. A typical OLTP-style query looks like this:

select *
from github.issues
where number = ?

The key thing to notice is that we are selecting many column but only a single row. OLTP databases optimize for this scenario by creating indices. In this example, we might create an index on the column number, which would allow us to quickly look up a single row if we know its issue number.

In contrast, BI-style queries look like this:

select author, count(*)
from github.issues
where status = 'open'
group by author

Here we are selecting only a few columns, but many rows: sometimes all the rows in the table. Also, notice the group by clause: we are aggregating a large number of rows down to a small number of rows. Columnar databases are optimized for this style of query. It's not a small difference: a columnar database can easily be 1000x faster for a BI-style query than an OLTP database running on the same hardware. Furthermore, columnar databases are all designed to scale-out across multiple nodes, which can give you additional orders-of-magnitude increase in performance.

OLTP databases and columnar data warehouses look similar on the outside: they both process SQL queries, they both have relational schemas. But on the inside, they are totally different creatures. If you're setting up a database and you're planning to use it as a BI data warehouse, you should use a column store [2]. Future generations of analysts will be glad you did.


[1] In principal, it's possible to optimize for both styles of queries by operating two databases under the hood and present the user with the appearance of a single database. There are several commercial databases that implement this strategy to various degrees. However, in practice you almost always want to keep your BI database separate from your OLTP database, because the OLTP database has much higher availability requirements. Thus, a single database that supports both styles is not as useful as you would expect.

[2] The big 3 major cloud providers each offer a columnar data warehouse: Amazon Redshift, Azure SQL DWH, and Google BigQuery. Fivetran supports all of them as destinations.

Are You A Data Expert?

Get started with a free trial today.

Discover the smartest solution for data-driven results.