10 Sep 2018 | Data Strategy

Cloud Data Warehouse Benchmark: Redshift, Snowflake, Azure, Presto and BigQuery

George Fraser
George Fraser
Eric O'Connor
Eric O'Connor
Cloud Data Warehouse Benchmark: Redshift, Snowflake, Azure, Presto and BigQuery
The Fivetran data warehousing benchmark compares price, performance and differentiated features for Azure, BigQuery, Presto, Redshift and Snowflake.

Updated Sept. 2018

Fivetran is a data pipeline that syncs data from apps, databases and file stores into our customers’ data warehouses. The question we get asked most often is “what data warehouse should I choose?” In order to better answer this question, we’ve performed a benchmark comparing the speed and cost of five of the most popular data warehouses: - Amazon Redshift - Snowflake - Azure SQL Data Warehouse - Presto - Google BigQuery

Benchmarks are all about making choices: what kind of data will I use? How much? What kind of queries? How you make these choices matters a lot: change the shape of your data or the structure of your queries and the fastest warehouse can become the slowest. We’ve tried to make these choices in a way that represents a typical Fivetran user, so that the results will be useful to the kind of company that uses Fivetran.

A typical Fivetran user might sync Salesforce, Zendesk, Marketo, Adwords and their production MySQL database into a data warehouse. These data sources aren’t that large: a typical source will contain tens to hundreds of gigabytes. They are complex: they contain hundreds of tables in a normalized schema, and our customers write complex SQL queries to summarize this data.

The source code for this benchmark is available at https://github.com/fivetran/benchmark

What Data Did We Query?

We generated the the TPC-DS [1] dataset at 100GB and 1TB scales. TPC-DS has 24 tables in a snowflake schema; the tables represent web, catalog, and store sales of an imaginary retailer. The largest fact table had 400 million rows at 100GB scale and 4 billion rows at 1TB scale [2].

What Queries Did We Run?

We ran 99 TPC-DS queries [3] in August-September of 2018. These queries are complex: they have lots of joins, aggregations, and subqueries. We ran each query only once, to prevent the warehouse from caching previous results.

Because we ran each query only once, these times include both the time to compile the query and run the query. Redshift in particular is highly sensitive to this assumption; it spent more time compiling these queries than running them. We explore this issue further in our deep dive.

How Did We Configure the Warehouses?

We set up each warehouse in a small and large configuration for the 100 GB and 1TB scales:

100 GB 1TB
Configuration Cost / Hour Configuration Cost / Hour
Redshift 8x dc2.large $2.00 4x dc2.8xlarge $19.20
Snowflake X-Small $2.00 Large $16.00
Azure [4] DW200 $2.42 DW1500c $18.12
Presto [5] 4x n1-standard-8 $1.23 32x n1-standard-8 $9.82
BigQuery [6] On-demand On-demand

How Did We Tune the Warehouses?

These data warehouses each offer advanced features like sort keys, clustering keys, and date-partitioning. We chose not to use any of these features in this benchmark [7]. We did apply column compression encodings in Redshift; Snowflake, Azure and BigQuery apply compression automatically; Presto used ORC files in HDFS, which is a compressed format.

We explore various tuning optimizations in our deep dive.

Get the extended benchmark

Sign up to download the full-length report

↓ SIGN UP TO DOWNLOAD PDF

Results

Which Warehouse Was the Fastest?

Which data warehouse is the fastest?

All warehouses had excellent execution speed, suitable for ad-hoc, interactive querying.

Redshift's slower times were primarily due to its slower query planner; in a scenario where you run similar queries repeatedly, the second query will be much faster.

Azure's slower times at the 100 GB scale are due to the Gen1 architecture used; the Gen2 architecture is not yet available in a small warehouse size.

Which Warehouse Was the Cheapest?

Which data warehouse is the cheapest?

BigQuery charges per-query, so we are showing the actual costs billed by Google Cloud. To calculate cost-per-query for other warehouses, we made an assumption about how much time a typical warehouse spends idle. For example, if you run a Snowflake X-Small warehouse for 1 hour at $2 / hour, and during that time you run 1 query that takes 30 minutes, that query cost you $2 and your warehouse was idle 50% of the time. On the other hand, if you run two 30-minute queries and the warehouse spends 0% of the time idle, each query only costs you $1. We looked at the actual usage data of a sample of Fivetran users with Redshift warehouses. The median Redshift cluster was idle 82% of the time [8].

This comparison depends heavily on our assumption about idleness [9]. If you have a very “spiky” workload, BigQuery would be much cheaper than the other warehouses. If you have a very “steady” workload, BigQuery would be much more expensive.

Which Warehouse Is "Best"?

Speed and cost are not the only considerations; these warehouses have important qualitative differences. Based on our experience with real customers, we believe there are 5 key features that differentiate data warehouses:

  • Elasticity: how quickly can a data warehouse increase and decrease capacity in response to changing workloads?
  • Availability: how does the warehouse provide high uptime?
  • JSON support: can you store and query JSON data?
  • Can you tune WHERE clauses by partitioning the data?
  • Can you tune JOINs by specifying data distribution?

We've summarized how Redshift, Snowflake and BigQuery compare on these criteria; we don't yet have enough customers with Azure and Presto to include them in a qualitative comparison:

Compare data warehouses and their features

Why Are Our Results Different Than Previous Benchmarks?

Amazon’s Redshift vs BigQuery benchmark

In October 2016, Amazon ran a version of the TPC-DS queries on both BigQuery and Redshift. Amazon reported that Redshift was 6x faster and that BigQuery execution times were typically greater than 1 minute. The key differences between their benchmark and ours are:

  • They used a 10x larger dataset (10TB versus 1 TB) and a 2x larger Redshift cluster ($38.40 / hour versus $19.20 / hour).
  • They tuned the warehouse using sort and dist keys, whereas we did not.
  • BigQuery Standard-SQL was still in beta in October 2016, it may have gotten faster by late 2018 when we ran this benchmark.

Benchmarks from vendors that claim their own product is the best should be taken with a grain of salt. There are many details not specified in Amazon’s blog post. For example, they used a huge Redshift cluster — -did they allocate all memory to a single user to make this benchmark complete super-fast, even though that’s not a realistic configuration? We don’t know. It would be great if AWS would publish the code necessary to reproduce their benchmark, so we could evaluate how realistic it is.

Periscope’s Redshift vs. Snowflake vs. BigQuery Benchmark

Also in October 2016, Periscope Data compared Redshift, Snowflake and BigQuery using three variations of an hourly-aggregation query that joined a 1-billion row fact table to a small dimension table. They found that Redshift was about the same speed as BigQuery, but Snowflake was 2x slower. The key differences between their benchmark and ours are:

  • They ran the same queries multiple times, which eliminated Redshift's slow compilation times
  • Their queries were much simpler than our TPC-DS queries

The problem with doing a benchmark with “easy” queries is that every warehouse is going to do pretty well on this test; it doesn’t really matter if Snowflake does an easy query fast and Redshift does an easy query really really fast. What matters is whether you can do the hard queries fast enough.

Periscope also compared costs, but they used a somewhat different approach to calculate cost-per-query. Like us, they looked at their customer’s actual usage data; but instead of using percentage-of-time idle, they looked at the number of queries-per-hour. They determined that most (but not all) Periscope customers would find Redshift cheaper, but it was not a huge difference.

Mark Litwintschik’s 1.1 Billion Taxi Rides Benchmark

Mark Litwintshik benchmarked BigQuery in April 2016 and Redshift in June 2016. He ran 4 simple queries against a single table with 1.1 billion rows. He found that BigQuery was about the same speed as a Redshift cluster about 2x bigger than ours ($41 / hour). Both warehouses completed his queries in 1–3 seconds, so this probably represents the “performance floor”: there is a minimum execution time for even the simplest queries.

Conclusion

These warehouses all have excellent price and performance. We shouldn’t be surprised that they are similar: the basic techniques for making a fast columnar data warehouse have been well-known since the C-Store paper was published in 2005. These data warehouses undoubtedly use the standard performance tricks: columnar storage, cost-based query planning, pipelined execution, and just-in-time compilation. We should be skeptical of any benchmark claiming one data warehouse is orders-of-magnitude faster than another.

The most important differences between warehouses are the qualitative differences caused by their design choices: some warehouses emphasize tunability, others ease-of-use. If you're evaluating data warehouses, you should demo multiple systems, and choose the one that strikes the right balance for you.

Notes

[1] TPC-DS is an industry-standard benchmarking meant for data warehouses. Even though we used TPC-DS data and queries, this benchmark is not an official TPC-DS benchmark, because we only used one scale, we modified the queries slightly, and we didn’t tune the data warehouses or generate alternative versions of the queries.

[2] This is a small scale by the standards of data warehouses, but most Fivetran users are interested data sources like Salesforce or MySQL, which have complex schemas but modest size.

[3] We had to modify the queries slightly to get them to run across all warehouses. The modifications we made were small, mostly changing type names. We used BigQuery standard-SQL, not legacy-SQL.

[4] The DW1500c configuration for Azure uses Microsoft's Gen2 architecture; the Gen2 architecture is not yet available in a small warehouse size appropriate for the 100 GB scale, so we used the Gen1 DW200 configuration at that scale.

[5] Presto is an open-source query engine, so it isn't really comparable to the commercial data warehouses in this benchmark. But it has the potential to become an important open-source alternative in this space. We used v0.195e of the Starburst distribution of Presto. Cost is based on the on-demand cost of the instances on Google Cloud.

[6] BigQuery is a pure shared-resource query service, so there is no equivalent “configuration”; you simply send queries to BigQuery, and it sends you back results.

[7] If you know what kind of queries are going to run on your warehouse, you can use these features to tune your tables and make specific queries much faster. However, typical Fivetran users run all kinds of unpredictable queries on their warehouses, so there will always be a lot of queries that don’t benefit from tuning.

[8] Some readers may be surprised these data warehouses are idle most of the time. Fivetran users tend to use their warehouses for interactive queries, where a human is waiting for the result, so they need queries to return in a few seconds. To achieve this performance target, you need to provision a large warehouse relative to the size of your data, and that warehouse is going to be idle most of the time. Also, note that this doesn’t mean these warehouses are doing nothing for hours; it means that there are many small gaps of idleness interspersed between queries.

[9] The formula for calculating cost-per-query is [Query cost] = [Query execution time] * [Cluster cost] / (1 — [Cluster idle time])

[10] Redshift takes minutes-to-hours to restore a cluster; in practice this means you will run your cluster 24/7, even if you only use it for part of each day.

[11] Snowflake bills per-second, with a minimum of 1 minute, so you can save money by configuring your cluster to turn off during periods of inactivity.

[12] BigQuery bills per-query, so you only pay for exactly what you use.

[13] Redshift automatically backs up to S3, but in the event of a node failure you will lose a few hours of data and experience downtime while you wait for a restore.

[14] Redshift doesn't have an UNNEST or FLATTEN operator, so it's impractical to work with nested JSON arrays.

[15] Snowflake supports JSON data natively via its VARIANT type

[16] You can convert JSON to typed STRUCT/ARRAY types by writing a user-defined-function in Javascript.

[17] BigQuery only allows partitioning on date columns, and date-partitioning limits how you can use DML operations; but it works well for event tables.

About Fivetran: Our standardized technology delivers data into your warehouse the right way. Shaped by the real-world needs of data analysts, Fivetran technology supports agile analytics, enabling data-backed decisions across organisations. After a five-minute setup, Fivetran replicates all your applications, databases, events and files into a high-performance data warehouse. Our standardised cloud pipelines are fully managed and zero-maintenance.

If you're interested in learning more about Fivetran, sign up for a demo of our service here.

Are You A Data Expert?

Get started with a free trial today.

Discover the smartest solution for data-driven results.