Databricks' release of Delta Lake last year was one of the most important developments in the data and analytics ecosystem. Databricks is the primary sponsor of Apache Spark, an open-source distributed computing platform that is an alternative to commercial analytical database systems like Snowflake. There are a variety of reasons why you might use Spark instead of a traditional analytical database:
- You need to run a query that can't be expressed in SQL.
- You dislike SQL and you prefer writing queries in Scala or Python.
- You want to do advanced statistics and your data doesn't fit in memory.
Historically, Spark has been a pure query engine. This means every Spark user operated a separate storage layer, typically an HDFS cluster or an S3 data lake.
Delta Lake is a built-in storage layer for Spark and Databricks that also provides many of the features of traditional analytical databases:
- A schema for each table.
- Transactional inserts, updates, and deletes.
- Table and column statistics, which allows an optimizer to plan efficient joins.
If these features seem familiar, they should: these are the key features of a relational database. In particular, the ability to do efficient joins is what makes a database relational.
Delta Lake is especially significant because it represents a kind of convergence between data lakes and data warehouses. Databricks calls this the "Lakehouse". We see the exact same phenomenon, in the opposite direction, from Snowflake. Their original innovation was the separation of compute from storage, which brought the cost advantages of data lakes into a familiar SQL data warehouse. More recently they announced Snowpark, which will allow Snowflake customers to write queries using procedural programming languages.
Over time, we expect both data warehouses and data lakes to co-opt each others’ key advantages. The ideal analytical database system:
- Supports multi-table transactions.
- Has table and column-level permissions.
- Supports structured and semi-structured data.
- Executes joins efficiently.
- Can be queried using SQL and procedural programming languages.
- Natively supports advanced statistics.
You can call this system a cloud data warehouse, or a data lake, or a lakehouse. Or you can be old-fashioned, and call it a relational database management system.