The data lake segment is evolving fast. Hadoop’s distributed file system (HDFS) was a great start, but now HDFS has largely been replaced by inexpensive object storage for creating “data lakes”. This is great up until you have to figure out how to ensure data quality and governance. And then there’s the question of completeness of enterprise data; data lakes have traditionally not addressed the variety of “small data” from enterprise operational data sources.
This is why we are excited to announce a partnership with the team at Databricks as a launch partner of their Data Ingestion Network for simplifying loading data into Delta Lake, the open source technology for building a reliable and fast lakehouse.
First, let’s start with a simple question: What is a lakehouse?
As coined and defined by Databricks, a lakehouse has the following key features
- Storage is decoupled from compute
- Open-source
- Support for diverse data types ranging from unstructured to structured data
- Support for diverse workloads
- ACID transaction support
- Ability to ingest data via both stream and batch
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. It was designed to bring reliability, performance, and life-cycle management to data lakes. It’s a core component of the Databricks Unified Data Service that helps companies build data lakes that are not only reliable, but also adhered to compliance and security policies.
These are important concepts to us here at Fivetran because we believe that every company should have a single database that records every fact, about every event, that has ever occurred in the business. When companies do this, they can generate valuable analytics that are accurate and consistent for BI applications and reports. While we have traditionally partnered with leading data warehouse partners to deliver on this goal, we believe that this ethos also applies to the world of data science and data lakes.
There are two core problems that many experience with data lakes:
- Issues with data quality due to a lack of control over ingested data
- Most data lakes lack the key operational data providing overall business context
This is because data lakes were never built to handle complex issues such as historical queries, data validation, reprocessing, or updates. Second, because of these problems, data lakes were never used as both a data science and business intelligence reporting in a single environment where all the data could live. Key operational data from modern data sources such as Salesforce, NetSuite, Google Ads, Marketo, Zendesk, Postgres, and others were often excluded only existed within a data warehouse.
To solve this, a few anti-patterns emerged that were largely focused on solving these system problems instead of focusing on how to extract value from data:
- Data quality: In order to solve for issues like data validation, reprocessing, and updates, a lambda architecture is generally adopted so that the business can ensure a level of data quality
- Data completeness: If there is a need to join your IoT, Event, or log data with your business data it is solved for by running a Spark job and offloading it to your data warehouse or by creating ad-hoc pipelines where spreadsheets or CSV files are uploaded to the data lake. This leaves you with an incomplete view of the customer.
Our partnership with Databricks helps solve both problems in a few key ways:
- Delta Lake largely eliminates the core challenges of data quality with features like ACID transactions, DML support, and schema enforcement.
- Fivetran solves the data completeness challenge with zero-configuration, automated data integration so you can put your data pipelines, from these modern systems, on auto-pilot, no matter the sources’ schema or API changes.
All in all, customers end up as the winners when they can instead focus on extracting value from data in a centralized location as opposed to building and maintaining multiple systems. Teams that used to solve for systems issues by architecting and maintain complex systems are now empowered to unify all of their data and analytics needs into a reliable data lake with automated data pipelines. When everything is unified, organizations are able to make smarter decisions while saving crucial development time and money.