Are data lakes the right approach for your organization?
A data lake is a permanent repository of your company's data, in an open-source file format like Parquet and a blob-store like S3. A data lake is similar to a data warehouse, but because all the underlying components are open-source or commodity infrastructure, they have less vendor lock-in.
It seems like every AWS solutions architect is recommending data lakes, every systems integrator is pitching them, and there's a data lake in the center of every infrastructure diagram. Data lakes are so hot right now:
There are good reasons to adopt a data lake, but there are also some bad reasons that you'll hear from solutions architects and systems integrators. Let's try to cut through the confusion: What are the right and wrong reasons to adopt a data lake?
The enterprise data warehouse is a great idea, but when you use a proprietary data store as the definitive long-term repository of all your company's data, you are locked into that vendor. By making a data lake your system of record, you protect yourself from vendor lock-in.
Your primary data store is probably a traditional SQL data warehouse like Snowflake, Redshift or BigQuery. But what if you want to use a non-SQL system like Spark? Maybe you're tired of writing SQL queries with 20
with clauses, and you want to try out Azure Data Explorer. If you store all your data in a vendor-neutral data lake, you can easily support multiple destinations.
It seems like every five years, a faster, cheaper, easier data warehouse comes out. Data warehouse migrations are a never-ending fact of life. At Fivetran, we have customers who run three data warehouses concurrently: the new data warehouse, the old data warehouse and the really old data warehouse. It's totally reasonable to run an old data warehouse for years after it's been deprecated. Maybe the finance team has some really complicated reports in the old DWH. If it isn't broken, why fix it? A data lake makes it easy to send the exact same data to multiple data warehouses.
There's a new generation of data warehouses, notably Snowflake and BigQuery, that store data separately from the compute nodes. Modern data warehouses can do this:
What you're seeing is a query running against a warehouse that is suspended. In about one second after we run the query, new compute nodes are provisioned to run the query: You can see the "warehouse on" indicator turn green on the right.
Some solutions architects have proposed data lakes as a way of "separating compute from storage" in a traditional data warehouse. But they're missing the point: You want the ability to scale compute easily and on demand. A data lake isn't going to give you this; what you need is a data warehouse that can provision and suspend capacity whenever you need it.
Data lakes are often proposed as a "stage of curation." The idea is that you store your raw data from your apps, databases and event streams in the data lake with minimal processing. Then you transform this data into a "curated" form that lives in the data warehouse, usually a dimensional schema.
Defining stages of curation is a great idea, but data lakes are a bad way to accomplish it. Data lakes physically separate "raw" from "curated" data. Instead, you should logically separate your raw data from your curated data by simply having two schemas in your data warehouse. Having both the raw and the curated data in the same data warehouse makes it much easier to write transformations: They're just plain SQL queries. It also gives you flexibility in defining the border between raw and curated data: If you have a relatively clean data set, you can start a new project with queries directly against the raw data. Later, once the project has proven value, you can migrate to a more organized dimensional schema.
If you're going to get all your data in one place, you're probably going to encounter non-relational, semi-structured data like JSON or XML. You will hear people claim that data warehouses can't store semi-structured data, so you need a data lake to store all your data.
This is nonsense. Modern data warehouses have first-class support for semi-structured data. Snowflake has the VARIANT type, and BigQuery can flatten nested JSON with the UNNEST operator. If your data warehouse can't support semi-structured data, the solution is a better data warehouse, not a data lake.
Sign up for the beta of the Fivetran fully managed data lake, which simultaneously replicates all your data sources to your data lake and your data warehouses: