The data lakehouse is a promising new technology that combines aspects of data warehouses and data lakes.
Fairy tales often emphasize the importance of moderation, compromise, and combining the best characteristics of things. Goldilocks needed a bowl of porridge that was not scalding or frigid, but just right.
The scalding bowl is a data lake, in which data swims without a schema. You can replicate any data, structured or unstructured, into it, and impose order on what you’ve stored only when you need to analyze it. Just as you can blow on a spoonful of hot porridge to cool it down, you can organize and analyze portions of data from a data lake as needed.
The frigid bowl is the data warehouse, a scalable repository in which data is defined by schemas, which makes it suitable for analyzing organized data. The downside is its inability to absorb media files and unstructured data. It’s thick and cool, and unable to properly dissolve many ingredients.
The third, palatable bowl is the data lakehouse. It’s at the right temperature to be neither too ordered nor too unstructured for timely analytics.
In the battle of data lakes vs. data warehouses, each offers key advantages and disadvantages. Data lakehouses aim to provide the best of both.
One of the key advantages of a data warehouse is the use of relational database schemas to define structured data, which makes for fast analytics and SQL compatibility.
Like a data warehouse, a data lakehouse supports schemas for structured data, and implements schema enforcement to ensure that the data uploaded to a table matches the schema. It also supports ACID transactions, another data warehouse feature, to ensure consistency as multiple parties concurrently read and write data.
Data lakes, on the other hand, have data warehouses beat with their flexibility to hold unstructured data.
Like data lakes, data lakehouses can hold both structured and unstructured data, so you can use them to store, transform, and analyze things like images, video, audio, and text, as well as semistructured data like JSON files. They support schema-on-read, in which the software accessing the data determines its structure on the fly. Data lakehouses also support huge volumes of storage more cost-effectively than data warehouses.
On top of these features, data lakehouses run on cloud platforms, which means they have high scalability. Storage resources are decoupled from compute resources, so you can scale either one separately to meet the needs of your workloads, whether they be for machine learning, business intelligence and analytics, or data science.
And because data lakehouses use open storage formats, you can use a variety of tools with them; you’re not locked into one vendor’s monolithic data analytics architecture.
So it seems like a data lakehouse really does offer a way for an organization to avoid maintaining a separate data warehouse (for a single source of truth) and data lake (for cost-effective storage of historical data and media files).
Is the data lakehouse the best of both worlds? It’s too soon to tell. The lakehouse is a fairly new arrival on the data analytics scene. The term was coined by Databricks in 2020 for its Delta Lake software. Delta Lake is an open source project aimed at bringing reliability to data lakes.
In a paper presented at the Conference on Innovative Data Systems Research earlier this year, Databricks developers posit that
the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse, which will (i) be based on open direct-access data formats, such as Apache Parquet, (ii) have first-class support for machine learning and data science, and (iii) offer state-of-the-art performance. Lakehouses can help address several major challenges with data warehouses, including data staleness, reliability, total cost of ownership, data lock-in, and limited use-case support.
No one is recommending businesses to immediately abandon their investments in Amazon Redshift, Google BigQuery, Microsoft Azure Data Lake or Snowflake and migrate to Delta Lake on Databricks.
But the ecosystem is growing, and most vendors appear to be moving in that direction anyway. Microsoft, one of the top four cloud data warehouses, last year unveiled capabilities for its Azure Synapse Analytics platform that fit the data lakehouse definition pretty well. AWS is also trying to jump on the bandwagon. Snowflake increasingly refers to its platform as a data lakehouse, as well.
Fivetran supports Delta Lake on Databricks, as well as every one of the most popular data warehouses and data lakes. Wherever you pipe your data, we’ve got you covered. Sign up today for a free trial.