Michael Kaminsky of Gradient Metrics takes us through a benchmark comparison of data architectures.
Many have wondered, and some have even claimed to know, which data-organization style provides the best performance in a data warehouse. The two schools of thought are, roughly:
Denormalize the data into one-big-table (OBT) so that the warehouse never has to do any joins on-the-fly
Maintain a star schema that can take advantage of sort keys on the dimension tables
Before writing this blog post, I was very much on team "star schema" -- I had been taught that organizing data in a star schema was critical to performance in analytic data warehouses. It turns out, I was wrong.
The first sign that I was wrong was when I wrote this slack message in the dbt slack channel:
If I had learned anything from my history of making statements that claim 100% certainty, I'd know that this is a strong predictor of me being wrong and having to come back and eat crow. So here we are.
The objective of this analysis is to understand the performance implications of these different warehouse distribution patterns under normal BI-style workloads within a given warehouse. That is, we aren't trying to benchmark warehouses against each other or understand their relative performance and cost tradeoffs. We want to understand how different data architecture patterns perform once you've chosen which warehouse to use.
In particular, this analysis is focused on architecture patterns to support business-intelligence style workloads, not necessarily the query performance of miscellaneous, arbitrarily complex ad-hoc queries. The way many people build their warehouses today (using an ELT paradigm with a tool like dbt), the star schema is constructed at the end of an ELT run and is explicitly designed to support BI-type queries in tools like Looker or Periscope. With that in mind, the queries that we use to test these different distribution styles are not especially complex as they're intentionally designed to reflect the common sorts of queries that are run by a BI tool -- aggregating measures over a variety of different dimensions, occasionally with a CTE or window function thrown in.
You can review the queries we used for the test here.
For all three of the warehouses we tested, Redshift, Snowflake, and Bigquery, using a single denormalized table instead of a star schema leads to a substantial improvement in query times. The speed improvement of using a single denormalized table represents an improvement of 25%-50% depending on which warehouse you're using. This amounts to a difference of about 10 seconds on a single-node cluster in Redshift. Excluding redshift query compilation time, the improvements are:
Redshift: 25%-30% (depending on warehouse size and number of clusters)
For the redshift results, we present data from runs using both a large multi-node cluster as well as a small single-node cluster. We also split the results between the first time a query was executed (which will include the time Redshift needs to compile the query) as well as subsequent runs that only include compute time
Here we can see that the OBT (denormalized) model out-performs the star-schema model in all but one of the 10 queries we tested. With the exception of the query-4 enigma, the denormalized table out performs the star schema from 10% to 45% depending on the query.
For Snowflake, the results are more mixed. While the OBT (denormalized) model is definitely faster than the star schema in the slowest of queries (queries 8, 9, and 10), the star schema actually does appear to out-perform the OBT model in some of the simpler queries (namely 3, 4, and 7). Note that these queries include query compilation time.
I do not have a good enough intuition for the inner-workings of snowflake to cast any light on why this might be happening, but if any of our readers are Snowflake experts we'd love to hear your hypotheses!
For BigQuery, the results are even more dramatic than what we saw in Redshift -- the average improvement in query response time is 49%, with the denormalized table out-performing the star schema in every category. Note that these queries include query compilation time.
One thing that's interesting to note is how dramatically different the variances in the query response times are between the two different distribution styles -- the star schema has a much higher variance in query response time which I assumes has to do with how bigquery is planning the execution under the hood (but I'm definitely not a BQ expert, so would love someone with more knowledge to weigh-in on what's going on here).
This comparison was made using a subset of the data from the TPC-DS benchmark, kindly made available by the folks at Fivetran. For all analyses, we used the TPC-DS "100" data.
dc2.large with 1 node
dc2.8xlarge cluster with three nodes
X-Large warehouse (16 servers)
I used whatever the default configuration comes with a fresh warehouse
We make use of the following tables:
For the star schema, I just kept these tables as-is (distributing the fact table by
ss_item_key and distributing the dimension tables across all nodes. In redshift, I distribute this by
ss_item_key as well.
For Snowflake, we exclude the initial run for the queries that serve the purpose of "warming the cache" by reading the data from S3 onto the SSD that snowflake uses for caching. So while the results of the query aren't being cached directly, we do want to assess snowflake under the circumstances where the data has been read off of S3. You can read more about Snowflake caching here.
For the denormalized tables, I just do a simple join to bring everything together:
LEFT JOIN public.date_dim
ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
LEFT JOIN public.store
ON store_sales.ss_store_sk = store.s_store_sk
LEFT JOIN public.household_demographics
ON store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
LEFT JOIN public.customer_address
ON store_sales.ss_addr_sk = customer_address.ca_address_sk
All of the code to reproduce the analysis can be found in this repo.
There are a few reasons why you might still want to consider using the star schema (or something like it):
The star schema promotes better ELT / ETL code conceptualization and organization.
The star schema is easier for end-users (analysts and other query-writers) to navigate.
The star schema takes up less disk space.
While the first two concerns are important, I think they can be handled pretty easily by staging your ELT process such that the data all get transformed into something like a star schema before everything gets re-joined back together for end-user querying.
The third point deserves more consideration, especially in a datawarehouse like Redshift -- materializing the denormalized takes up a significant amount of disk space on the cluster. Simply materializing the table bumped the disk-space usage up from a bit over 30 gigabytes to over 90.
tablename | megabytes
household_demographics | 8
date_dim | 31
store | 32
customer_address | 56
store_sales | 29778
one_big_table | 60250
And this is only a subset of the data we could have joined to
store_sales! In fact, when I initially started on this analysis task I wanted to join all of the possible dimensions onto
store_sales but couldn't because redshift ran out of disk-space (on a dc2.large cluster with 1 node)
Depending on the scale of your data, the storage cost of duplicating all of the dimensions on disk could just be too high.
Determining why the star-schema out performs the denormalized table on query 4 in the single-node cluster (but not the multi-node cluster) is left as an exercise for the reader. Mostly because I have no idea.
Because dbt doesn't have the ability to specify column compression or encoding style in Redshift, this is probably the worst-possible-case in terms of disk storage size. I suspect that with proper column encoding you could alleviate a fair amount of this issue.
Update your browser to view this website correctly. Update my browser now