Trying to decide where to store your data? The good news is there are plenty of options to choose from. Data warehouses are fantastic for structured data, but they’re fairly rigid and expensive to use. And while data lakes excel with unstructured and semi-structured data, they don’t have strict governance in place. Using both — a data lake for staging, and a data warehouse for analytics-ready models — is an option, but you’ll quickly see rising costs and notice errors from moving data between the two.
A data lakehouse allows you to easily store all kinds of data, model it into actionable information, and feed downstream systems without performance issues. We’ll explore exactly how this composite architecture works and outline why teams are moving to lakehouse-first data storage.
What’s a data lakehouse?
A data lakehouse is a composite architecture that combines the scalability and low-cost storage of a data lake with the structured, analytics-ready management capabilities of a data warehouse.
Lakehouses ingest both unstructured and structured data into object storage, a medium that traditionally works best with unstructured information. But by using open table formats to add rigorous schema enforcement, ACID (Atomicity, Consistency, Isolation, and Durability) transactions, indexing, and extra governance controls, lakehouses enrich the data lake structure with warehouse capabilities.
Together, you get a single architecture that gives you the best of both worlds. Users can handle structured, semi-structured, and unstructured data without compromising performance or compliance.
How does a data lakehouse work?
A data lakehouse architecture comprises five layers. Here’s a high-level overview of how they work and contribute to your end goal.
Ingestion
Extract data from multiple sources like databases, event streams, and SaaS applications. Modern ingestion tools like Fivetran even automate schema handling, conversion, and deduplication to make sure everything is analytics-ready.
Storage
Manage data in low-cost cloud object storage, decoupled from the compute layer. Lakehouses support structured, semi-structured, and unstructured formats while enabling warehouse features like ACID transactions through open table formats, which combine data in files with metadata specifying table states, partition information, and columnar statistics.
Metadata
Use a centralized metadata layer to track versions, data lineage, and schemas. This layer provides much-needed structure, keeping your lakehouse compliant and organized.
Examples of open table formats that combine storage and metadata include Iceberg and Delta Lake.
Query engine/API
The query layer uses SQL engines and DataFrame APIs to transform and query data directly in the lakehouse. As querying is decoupled from storage, a single storage layer can be paired with multiple engines, each optimized for BI or other use cases.
Consumption
Downstream analytics tools or connected ML models consume data from the lakehouse, giving your end users consistent reporting and analytics.
With their flexibility, scalability, and agility, it’s easy to see why organizations are gravitating towards data lakehouses as company-wide storage options.
Data lakehouse vs. data warehouse vs. data lake
Data lakehouses are a combination of two previously established data architectures: data warehouses and data lakes. Here’s how each of them differs and why businesses may use each.
Data warehouse
A data warehouse is a centralized system that stores structured data. It’s ideal for BI and analytics, letting you quickly query data using SQL. One downside to warehousing is that it’s not very flexible. In a data warehouse, storage and compute are tightly coupled, and scaling one necessarily scales the other. In addition, data warehouses typically do not handle raw or unstructured data well or at all. Engineers will need to stick with structured, processed data.
Data lake
A data lake provides low-cost storage for raw, unstructured, and object storage — excelling in the areas where warehouses fall short. Plus, lakes are extremely scalable and perfect for connecting to systems needing a bulk flow of raw or unstructured data, like many AI and ML systems. That said, if governance and consistency aren’t priorities, the quality of data can deteriorate and leave you stuck in a data swamp.
Data lakehouse
A data lakehouse is a composite technology, hosting inexpensive lake and warehouse-style storage for structured data and analytics.
The main reason lakehouses exist is because many modern use cases don’t fall neatly into either lakes or warehouses. Even machine learning, which would best fit a lake structure, might need structured data from warehouses. Pulling data from separate, siloed analytical systems can become costly, not to mention the risk of duplicating data or errors.
Key features of a data lakehouse
With the combined capabilities of other architecture, the data lakehouse offers the following key features:
- ACID transactions: Lakehouses enable ACID transactions, normally associated with data warehouses, to ensure consistency as multiple parties concurrently read and write data.
- BI support: With ACID support and SQL query engines, analysts can directly connect data lakehouses with business intelligence platforms.
- Open storage formats: Data teams can combine their compute engine of choice with the lakehouse’s open storage format.
- Schema and governance capabilities: Lakehouses support schema-on-read, allowing accessing software to define data structure. For structured data, schemas are implemented with enforcement to ensure all uploaded information matches.
- Support for diverse data types and workloads: This storage option holds both structured and unstructured data, meaning you can use it to store, transform, and analyze images, video, audio, and text, as well as semi-structured data like JSON files.
- Decoupled storage and compute: Storage resources are decoupled from compute resources, providing modularity to control costs and scale your workloads.
For many organizations, lakehouses provide the best of both worlds by feeding many modern data-first systems with high efficiency and lower costs.
Data lakehouse examples
With the ability to manage all formats for data, lakehouses directly complement a range of modern use cases. Here are top examples of when to use a lakehouse.
Real-time streaming analytics
Lakehouses are great at ingesting large volumes of data while keeping transactions consistent and reliable. For downstream engines needing data immediately, lakehouses have the quantity and quality to deliver regularly-updated insights.
Scalable data science and exploration
By holding a large quantity of data in one place, you can readily access datasets and experiment within lakehouses. From there, you can streamline interaction by eliminating the need to move through different systems to find the applicable data.
Enterprise-wide advanced analytics
Lakehouses feed downstream analytics engines like BI tools and dashboards. When your entire organization acts based on data from the same source of truth, you’ll stay aligned and consistent.
Data lakehouse benefits
A lakehouse combines the scalability of a data lake with the compliance-first structure and analytical capabilities of a data warehouse. Combining these architectures offers a range of benefits:
- Scalability: Because lakehouses use object storage, you can scale their repositories to house larger volumes of data.
- Improved management: Offering warehousing features like ACID transactions and schema enforcement makes for more effective data management.
- Streamlined data architecture: Lakehouses compile the best parts of warehouses and lakes, reducing the need for two separate architectural components into one.
- Lower costs: When only using a single data architecture, you don’t have to pay for maintaining multiple storage options at the same time.
- Eliminate redundant data: Moving data between lakes and warehouses can create redundancies. Remove the need for migration by using a lakehouse to limit the likelihood of data copies floating around.
Data lakehouse challenges
Although lakehouses come with several upsides, that doesn’t mean there aren’t challenges to consider when using them:
- Technology complexity: Managing the distinct architecture of a data lakehouse with open table formats can be a problem for teams without lakehouse experience.
- Need for new skill sets: Engineers may need to separately provision new tools, systems, and query engines to combine with the lakehouse.
- Choosing the right components: There are five distinct layers in a data warehouse, and the right tool must be selected for each one. Each choice needs planning and understanding.
How Fivetran feeds and fuels the data lakehouse
Fivetran’s automated data movement platform and its extensive library of pre-built connectors make it easy to continuously ingest data into lakehouse storage. With 700+ connectors, you can absorb information from an enormous variety of sources into a single storage layer.
And, by offering fully managed end-to-end ELT pipelines, Fivetran completely removes the need for admin pipeline maintenance. Scale your data architecture with easy-to-use, powerful, and industry-leading data movement tools. Get started for free or book a live demo to find out more.
FAQs
What is a Databricks data lakehouse?
Databricks is the pioneer of the lakehouse structure, framing it as a unified, open, and scalable piece of architecture combining the best of data warehousing and lakes.
Is AWS a data lake or a data warehouse?
AWS is a cloud provider that offers specific data lake and warehouse services. You can partner with AWS for cloud lake or warehouse capabilities, with distinct tools for each.
Is a data lake an ETL tool?
No, a data lake is not an ETL tool. An ETL (Extract, Transform, Load) tool is a kind of data pipeline that moves source data to storage and prepares it for analysis on the way. When loading data into your lake, you will likely use an ETL tool.
[CTA_MODULE]

