ETL, long a mainstay of data integration, is labor-intensive, brittle, complex — and ripe to be supplanted by ELT.
The theory of disruptive innovation describes how new products and services displace incumbent market leaders. The key to disruption is leveraging new technologies to create products that are simpler to use and more accessible than incumbents, creating new markets.
Disruptive products initially compete against nonconsumption (i.e. the inability of customers to purchase and use a product), empowering customers to do new things. Over time, they grow in capability until they can directly challenge and replace established incumbent products that address the same need.
Ultimately, they overtake incumbents, radically expanding the market and accounting for greater sales and market share.
The progression of modern computing illustrates disruption well:
Vacuum tube mainframes, used exclusively by universities, government labs, and large corporations, were disrupted by transistor-based minicomputers, which were smaller, cheaper, simpler to use, and accessible to smaller organizations.
Minicomputers were disrupted by microcomputers, namely personal computers (PCs) and commodity servers, which leveraged advancements in miniaturization, user interfaces, and networking to make computing and the internet accessible to small businesses, households, and individuals.
PCs are currently being disrupted by mobile devices that are cheaper, handier, and consume less electricity. Desktops lost out to laptops years ago. Smartphones have largely replaced laptops for on-the-go web connectivity, and newer laptop designs cling to relevance by imitating smartphones in portability, chip architecture, battery life, and internet connectivity. More importantly, smartphones bring internet connectivity and digital products to customers who can’t afford PCs, especially in developing countries.
A similar process of change is underway in data integration. Extract-Transform-Load (ETL) is a 1970s legacy approach to data integration that is predicated on technological conditions that no longer exist, namely high costs of computation, storage and network bandwidth. It fully predates the existence of the modern cloud.
In ETL, data is extracted from sources, transformed into data models for dashboards, reports, and other analytics uses, and loaded to a data warehouse. In order to transform data in ETL, an organization requires a custom software solution designed and built by engineers, involving complex workflow orchestration and scripting.
The central tradeoff made by ETL is to preserve technological resources at the expense of labor. Transforming data before loading it limits the computation and storage needs of the pipeline and destination. The downside is that the entire data pipeline must be designed and tailor-built around extracting every data model at the source, and producing every data model used by analysts in the destination. This means the pipeline must be redesigned and rebuilt every time either downstream or upstream data models change. In short, ETL is labor-intensive, brittle and complex. It is a bespoke technological solution that cannot be generalized, as well as an IT-centric activity that can easily become bottlenecked for lack of IT and engineering resources.
Today, ETL is most characteristic of two kinds of companies:
Tech-intensive companies that build and manage their own products as data sources, and therefore are responsible front-to-back for both their data source and data integration setup.
Large, established companies that can throw armies of IT professionals and data engineers at technical challenges.
The bespoke nature of an ETL pipeline means it can be incredibly powerful if a company has the budget, talent and infrastructure to sustain a complex data pipeline with minimal downtime. However, this barrier to entry is practically insurmountable for smaller companies.
Larger companies can use their resources to partially overcome the inherent disadvantages of ETL, but ETL is fundamentally a slower and more complicated system to use. As agility and responsiveness become more and more important, even large organizations will consider an alternative method of data integration that leverages modern technology, called ELT.
Extract-Load-Transform is a reimagining of data integration that decouples extraction and transformation, moving data into a data warehouse before applying business logic to transform it into data models for analytics. A signature benefit of ELT is that it is analyst-centric. Analysts, not engineers, perform transformations within the data warehouse environment using SQL-based tools. This enables shorter turnaround times for all analytics projects, such as reports and dashboards, and timelier delivery of insights.
Another benefit of staging transformations in the destination rather than the pipeline is that a third-party provider can produce a general extract-and-load solution for all users of common data sources. External data pipeline providers can develop highly specialized expertise around the idiosyncrasies of specific data sources so that their customers don’t have to. This allows companies to outsource and automate their data pipeline.
Outsourcing and automation relieve engineers of projects that are time-consuming, complicated, failure-prone, and a constant source of bottlenecks and frustration. Moreover, outsourcing frees engineers to pursue higher-value projects related to data science and product improvements.
ELT is enabled by the plummeting cost of computation, storage and internet bandwidth, which has culminated in modern cloud-based services and infrastruc ture. This presents both a challenge and an opportunity. The challenge is the explosion in the sheer volume, variety and complexity of data. The opportunity is cheap, powerful and scalable cloud-based data pipelines and infrastructure that enable a company to substitute technology for labor.
The basic use case where automated ELT shines is ingesting data from operational systems connected to common types of databases and common third-party applications. This approach foregoes configurability and customization for ease-of-use. Ease-of-use means that companies can pursue data integration with far fewer resources than with ETL. Automated ELT makes data integration accessible to early stage startups with small teams of analysts, rather than larger organizations with huge teams of data engineers.
We have discussed the differences between ETL and ELT elsewhere in detail, but in short, the differences are:
|Extract, transform, load||Extract, load, transform|
|Integrate summarized or subsetted|
|Integrate all raw data|
|Loading and transformation tightly|
|Loading and transformation decoupled|
|Longer time to load data||Shorter time to load data|
|Transformation failures stop pipeline||Transformation failures do not stop pipeline|
|Predict use cases and design data |
models beforehand or else fully
revise data pipeline
|Create new use cases and design |
data models any time
|Constant building and maintenance||Automated|
|Conserves computation and storage||Conserves labor|
|Use scripting languages for |
|Use SQL for transformations|
|Engineering/IT-centric; expert system||Analyst-centric; accessible|
to non-technical users
|Cloud-based or on-premise||Almost strictly cloud-based|
In multiple ways, ELT is more affordable, easier to use, and more practical. The importance of automated ELT will grow as:
ELT providers continue to build and improve connections to more sources
ELT providers continue to improve performance, security, reliability and scalability
ELT providers add more features to enhance analytics, such as the ability to examine the past values of fields (i.e. history tables)
ELT providers add stronger capabilities for custom integrations and more features for configurability and tuning
Data warehouse-based transformation tools gain usage and popularity
Firms that offer turnkey data-driven consulting services rely more heavily on automated ELT
Large companies continue to move operations to the cloud, using more commodity SaaS tools and cloud operational databases as data sources and cloud data warehouses as destinations
As ELT platforms continue to mature and the cost of cloud infrastructure continues to decline, most firms will eventually find no advantage to using ETL over ELT.
Like mainframes and minicomputers, ETL is unlikely to ever be fully extinct. There will always be highly specialized use cases, like companies that build and manage their own products as data sources. Even in those cases, ELT can powerfully complement a custom ETL pipeline wherever third-party SaaS programs are concerned.