What Is the Ideal Cloud Data Warehouse Platform?
This is a guest post by John Ryan, Data Warehouse Solution Architect at Snowflake. John Ryan’s blog can be found here.
A few years ago I joined a Tier 1 Investment Bank in London as a Data Warehouse Architect and SQL performance tuning expert, and immediately faced one of the biggest challenges of my entire career.
The mission-critical, on-premise, multi-terabyte Oracle warehouse system I was responsible for was struggling to meet SQL processing demands, and was failing to meet the SLA for both ETL and business intelligence query performance. This became one of the most challenging periods of my entire career.
In this article, I’ll describe the difficulties I faced, the solutions available at the time, and discuss new and innovative cloud-based data warehouses with incredible benefits in scalability, elasticity and SQL performance.
Data Warehouse Query Challenges
Data warehouse architects deal with (often conflicting) requirements that include:
- Maximising Query Performance: Which means minimising the time taken for SQL queries to complete.
- Maximising Throughput: Batch ETL or ELT tasks must complete as quickly as possible which means running multiple processes in parallel to make best use of machine resources.
- Maximising Machine Usage: To make efficient use of machine resources, the warehouse should ideally be run at 100% CPU capacity. Clearly this often conflicts with the two priorities above, but running at any less is a waste of multimillion-dollar CPU resources. Unlike a car, a computer system won’t last any longer if it’s sitting idle most of the time.
"You cannot put CPU in the bank and save it for later. So, if you are running with idle cycles you should be looking for more ways to use it" - Tom Kyte (Oracle)
To make matters worse, data warehouses must accommodate several competing uses:
- Micro-Batch ETL and ELT Processes: Which typically run on a repeating cycle. The machine tends to be overloaded during this time, either at or above 100% capacity in an attempt to process and deliver new data as quickly as possible.
- Business Intelligence Reporting: Which tends to be erratic, with load varying between zero and maximum capacity throughout the day.
- Dashboard Queries: Which need very fast (sub-second) performance. Machine load in these cases is extremely unpredictable, but when the capacity is needed, results must be available immediately.
The chart below illustrates a typical ETL or ELT load profile with repeating batch processes which run the machine at nearly 100% capacity. These tasks maximise throughput to complete as much work in as little time as possible.
This contrasts with the much more erratic Business Intelligence query profile illustrated below.
The most common approach to manage these competing demands involves a Workload Management Queue whereby each user group is allocated 50% of machine resources. This places queries in two separate queues, with each limited to a fixed number of concurrent queries or tasks executed in parallel. If insufficient resources are available, the queries are queued, and sometimes time out.
The combined profile might look like this:
The difficulties with this solution quickly become apparent:
- Capped Maximum Usage: As each group is given half the machine resources, the ETL processes which would normally run at 100% CPU are limited to 50%, and the ETL run times are therefore extended. This problem worsens as additional data sources are connected.
- Built-in Inefficiency: Dividing the machine resources means reserving spare capacity, which means that CPUs seldom run at 100%. Equally, when both ETL and analytic queries are running, usage can exceed 100%, leading to long elapsed times and timeouts.
In conclusion, neither group is fully satisfied, and there’s an on-going tug-of-war for increasingly scarce machine resources.
Potential Performance Tuning Options
There are several methods to manage the performance of a data warehouse:
- Limit Resource Usage: Use increasingly aggressive workload management to control the number of concurrent online queries and batch tasks.
- Database and Application Tuning: Often the most effective way to improve SQL throughput, this requires deep understanding of database design principles. Even with highly skilled engineers, it can take months to deliver meaningful results. In my case it took a team of three experts almost a year to completely resolve the performance issues.
- Improve the Hardware:
- Scale Up: Which involves additional memory, CPUs or faster disk storage or even migrating the entire application to a new, larger machine.
- Scale Out: Which is an option for Massively Parallel Processing (MPP) architectures, which involves adding nodes to the existing system.
Scale Up for Fast SQL Performance
When running on a shared memory architecture, the only realistic option is to scale up the hardware.
The diagram below illustrates this architecture, and the solution involves adding CPUs or memory or migrating to a bigger machine. Like the tuning option described above, this can take months to complete. A typical data warehouse system can have a huge number of incoming and outgoing data feeds, many undocumented, and the preparation, testing and migration process can take considerable time.
It took my team almost a year to migrate a large and complex Oracle data warehouse to a new Exadata platform.
To make matters worse, like induced demand on motorways, the workload typically increases to make use of the additional capacity, and a few years down the line the system will need upgrading again.
Clearly, this is not a sustainable long term strategy.
Scale Out for Bigger Workloads (MPP)
The diagram below illustrates the massively parallel processing (MPP) architecture used by many databases and warehouse appliances including Vertica, Teradata and Greenplum. This solution involves adding compute and storage nodes to an existing cluster.
While this can be easier than the scale up approach described above it does have the following drawbacks:
- Time to Market: Adding nodes to an existing on-premises MPP system still requires a potentially lengthy ordering, deployment and installation process which can take days or even weeks to complete.
- Database Reorganisation: On MPP systems, the data is sharded (distributed) across nodes in the cluster which enables parallel query execution. When an additional node is added, often the existing tables need to be reorganised to make use of the additional capacity, and this can be a complex and resource-intensive operation, which can cause downtime.
- Over-capacity: Since storage and compute are tightly coupled on each node, if storage demands exceed processing capacity needs you may pay for far more compute than you need. Research by Clarity Insights shows that demand for storage can outpace demand for compute by up to a factor of 300.
The Ideal Cloud Data Warehouse Platform
The ideal cloud-based data warehouse circumvents the aforementioned limitations. It should accommodate the ability to independently scale compute and storage up (or down) within seconds, or even on the fly. This delivers the advantages of scale-out architecture while avoiding the need to reorganise the database.
Such a data warehouse could deploy multiple independent clusters of compute resources over a shared data pool. Users and business units can independently be allocated compute resources on which to run ETL/ELT load processes or business intelligence queries. This avoids the tug of war for machine resources found on almost every other database solution.
The benefits to such a setup include:
- Fast to Deploy: As the solution is provided as a cloud based service, a data warehouse can be deployed within minutes without installing new hardware or even a long term financial commitment.
- Complete Flexibility: Additional nodes can be scaled up or down as needed. You can even suspend the service when it isn’t used, with compute resources charged on a per-second basis. You can automatically start up additional clusters immediately without any data reorganisation. The entire operation can run without DBA involvement.
- Pay Separately for Storage: As storage is charged separately from compute resources, it is possible to scale up to petabyte sizes, but pay only for the storage and processing power on an as-needed basis. This differs greatly from MPP solutions like Teradata or Vertica where adding nodes adds to both compute and storage at the same time.
- Zero Contention: With the ability to allocate multiple clusters of varying sized virtual warehouses to different tasks, there is no contention for compute resources. This would allow ETL/ELT processes to run completely separately from end user queries while also being shielded from the massively unpredictable analytic query demands of data scientists.
- Absolute Simplicity: Unlike the legacy database solutions, there is no indexing, partitioning or data sharding go manage. There are also no query plan statistics to capture, and almost no options to tune the system. It just works.
Performance Challenge Revisited
The charts below illustrate the separation between batch ETL/ELT and business intelligence query profiles on a modern, cloud-based data warehouse. Different business units can be deployed on entirely independent clusters, each sized as appropriate. This is simply not possible on most database platforms, including cloud-based solutions that are direct descendents of on-premise systems.
The benefits of this solution include:
- Independent Sizing: Each business unit can maintain a separate virtual warehouse with the appropriate compute and storage resources. It is also easier to allocate and control budgets on different projects.
- Ability to Scale Up/Down: Often the batch or query workload increases on a regular basis, for example at month or financial year end. Normally this would lead to reduced performance leading to delays and frustration, but modern cloud data warehouses can scale the system up or down to cope with the additional demands. This can be scripted directly using SQL.
- Automatic Scale Out: While the number of concurrent ETL/ELT tasks is largely predictable, business intelligence query workloads are often erratic, and during peak periods, it is still possible queries will be queued. To maintain end user performance, the option exists to configure the virtual warehouse to run in multi-cluster mode. Unlike the scale up option described above, this adds additional same size nodes to an existing cluster. Additional nodes are automatically added as query demands exceed the ability of the cluster to cope, and once the peak demand subsides, these can be transparently suspended to manage costs.
"Benchmarks are demonstrating that separate compute and storage can outperform dedicated MPP relational database platforms." - Tripp Smith. Clarity Insights
Just a few years ago, the only ways to scale a data warehouse platform were by application tuning or migrating to a bigger machine. If you were lucky enough to be running Teradata, Vertica or Greenplum, you could extend the existing cluster, but the short term options were limited, and it could take weeks to respond to increases in demand.
Cloud data warehouses have changed the SQL data warehouse landscape by delivering innovative multi-cluster shared data platforms which deliver the flexibility to spin up, suspend or resize storage and compute clusters to exactly match load requirements.
In addition, because storage and compute are deployed separately, it is possible to independently scale either as storage, or processing as requirements change.
Best of all, you only pay for the compute resources you need, and these are charged on a per-second basis which makes this both a compelling and potentially cost effective solution.