Most problems associated with the EDW are actually problems with ETL. The foundation of a successful data warehouse is proper separation of concerns.
The enterprise data warehouse (EDW) has been defined, redefined, loved, hated and passionately debated for almost 50 years. Yet the core concept is simple and wonderful: a single database that records everything that happens in your business.
In practice, the EDW is not always wonderful. You will hear claims that the data warehouse is an archaic, soon-to-be replaced technology, for the following reasons:
The extract-transform-load (ETL) process introduces hours of latency, so your data warehouse can't serve real-time use cases.
In multi-region environments, data gravity makes it infeasible to move all your data to a single location.
It takes months to build a useable data warehouse.
Your EDW will be too rigid to adapt to the changing needs of your business.
EDWs don't work with non-relational data sources like APIs and semi-structured events.
These problems are real, but they’re not actually problems with the data warehouse concept. They’re problems with ETL. In a typical EDW, an ETL pipeline will periodically extract the current data from each data source, transform it into a user-friendly schema, and load it into the data warehouse:
All the common problems with data warehouses stem from this approach:
The extract phase pulls a snapshot from each source on every update. This takes a long time, so the overall ETL pipeline will have hours of latency (problem 1). It also generates a tremendous amount of data that needs to be moved, so your data warehouse needs to be physically close to your sources (problem 2).
The transformations that convert the source schema into the EDW schema are very complex and take months to develop (problem 3).
The EDW schema is designed up front, and is very costly to change as the needs of your business evolve (problem 4).
Non-relational data sources aren't present in this workflow, because there is no way to snapshot them (problem 5).
Several alternatives to the EDW have been proposed to avoid the problems of ETL. Unfortunately, each of them is fatally flawed.
BI tools like Tableau and PowerBI offer the option to connect directly to data sources like Oracle and Salesforce. User queries like "How many new customers signed up in January?" are translated directly into SQL queries or API calls against the source. These queries intrinsically have zero latency and minimize data movement. Query federation would be wonderful — if it worked. Unfortunately, most real-world data sources are simply too slow to support federated querying against realistic data sets.
Virtualization is a variation of the query-federation concept. Instead of connecting your BI tool directly to each data source, you create a "virtual data warehouse" that exposes a standard SQL interface. The "virtual data warehouse" translates your queries at runtime into native queries against each data source. Data virtualization is a beautiful concept, but in practice it only works on high-bandwidth data sources like HDFS or S3. Row-store databases like Oracle and apps like Salesforce will never be fast enough for data virtualization to work on large data sets.
Hybrid Transactional/Analytical Databases (HTAP)
HTAP databases are designed to support both high-frequency transactions and high-data-volume scans for analytics, in the same system. The latency of these systems is intrinsically zero, since you're using a single database for both workloads. HTAP databases are fascinating pieces of technology, and have been the subject of many PhD theses. Unfortunately, in the real world they are rarely used, because it's not realistic to force all your tools to run on a single HTAP database.
The foundation of a successful data warehouse is proper separation of concerns. Your plan should have two stages:
Replicating all your business data into the EDW.
Transforming that data into a user-friendly schema.
This architecture solves the problems we identified with the traditional ETL-based approach:
Instead of an extract phase that pulls snapshots, we have a replication phase that uses the native changelog mechanism of each source. Using changelogs eliminates the latency of ETL (solves problem 1), and solves the data gravity problem by moving 99% less data (solves problem 2).
Transformation is performed inside the data warehouse using SQL queries. You can create SQL queries in minutes instead of months (solves problem 3).
The transformation process is non-destructive, so you can build your EDW schema incrementally and evolve it as the needs of the business change (solves problem 4).
You can support non-relational data sources by using the native changelog mechanism of each source (solves problem 5).
The hardest part of this approach is building an accurate replication process. The changelog of each data source is complex, and often requires dynamically generated API calls or queries to get additional context that's not present in the changelog. Fortunately, we can help you with that: Fivetran is fully managed, cloud-native data infrastructure that will manage the top half of the above diagram for you. Sign up for a personalized demo of our service or start your free trial today.
Update your browser to view this website correctly. Update my browser now