ETL and ELT are well-known concepts, but they do not describe the most common agile analytics workflow. In practice, data teams often omit modeled transformations and perform extraction, loading and analysis (ELA). Accessible SQL-based data warehouse (DWH) solutions such as Redshift, BigQuery and Snowflake make this possible by performing large, complex transformations on the fly. Data teams that set up a modern cloud DWH and are unconstrained by ETL or ELT can get meaningful insights immediately.
The term “agile” is often associated with ELT, but it probably describes ELA better. Transformations in the sense of formal, Kimball-style dimensional modelling can take a long time to design and implement. Bypassing the transformation stage allows the BI team to pursue a more agile workflow, at least early in the growth of an organization. It is important not to conflate ELT and ELA, as they entail distinct tradeoffs.
This article seeks to compare the approaches of ETL, ELT and ELA, by use of coded examples — primarily transformation and analysis. Extraction and loading are outside of the article's scope; in the modern context, they are best outsourced. The end goal is to:
- Identify the different behaviours in the wild
- Consider how they can be used for separating concerns
- How they affect data quality and code maintenance
There are marked differences in the pacing and distribution of effort between ELT and ELA. We can be certain that ELT will win the marathon, and ELA will win the sprint — the real question is when it is best to pass the baton.
ETL, ELT and ELA compared by example
We will use sample code and a simple data model with business rules and two questions to further our discussion. ETL, ELT & ELA are all discussed with commentary in relation to the code. It is worth noting that during the writing of this article, the verbose nature of the ETL led us to move much of its code and discussion to the appendix, where it can be referenced for completeness.
Sample data
Below is a common quasi-real world data set — the shape and errors are based on examples commonly seen in business. It includes two tables, users and line_items, related by user_id. The model has some intentional flaws, which mean that the source system data requires work before it lines up with business logic. The data set is simplified, which lends itself to the ELA approach. The data set has no integrity errors, null values or duplication, all of which would be expected in a real data set.
The three intentional flaws highlight the risks of reshaping all data inline.
- Confusing column names (is_paid): names that act as a “false friend” are common and cause trouble in the real world. In our case, “is_paid” means that the checkout is completed, but does not mean the order had any monetary value. A better alternative: "is_completed".
- Obscured business logic: It is possible to get an answer that is plausible and seemingly complete given a particular question, only to find that there were hidden rules — these are often in distant tables or hidden enurums. In this example, when business stakeholders wish for a breakdown by “locale,” they mean the locale of the user, not the line items (as they may differ). This creates the need for a join and some thought every time "locale" is involved, which would be in most queries.
- Complex business logic (revenue): Revenue is often calculated using a number of columns, which can sometimes change. In this data set, revenue is defined as "The total gross value in an order which has been put fully through the checkout process and is not cancelled."
Sample data layout
raw_data.users | raw_data.line_items |
---|---|
* user_id | * id |
* locale (where users signed up) | * product_id |
* city | * order_id |
* created_at | * user_id |
* created_at | |
* paid_at | |
* locale (where users completed purchase) | |
* is_paid | |
* is_cancelled | |
* unit_cost | |
* quantity |
Questions
Q1: Top 10 markets by paid_revenue Q2: Customers from the locale "de" who spent more than 100 in their first paid order
Method
Given the challenge of the above thought exercise the ETL(A) & ELT(A) approaches will both have two steps. The first is to create transformed tables in our DWH, which are modeled and enriched by a simple Kimball method. The second step is the same for both approaches and involves using SQL to run the final analysis off the modeled tables. Both questions for ELA will be completed in one step, with all cleaning happening inline.
Analytics on modeled data – ETL & ELT analytics element
If the data is already modeled through ETL or ELT, the SQL for answering the two questions will look like the above. Modeling provides the following advantages:
- SQL required is exceptionally short – 7 and 6 lines of code, respectively
- Close to error-proof even for the most inexperienced SQL user Joins are avoided for simple queries (and would otherwise be simple) Common utility columns are available due to precalculatation
- Naming is clear and self-evident
Good modeling allows greater confidence in the data and gives analysts the opportunity to focus on insights rather than navigating SQL and data cleaning. However, these neat output queries come with a cost in the form of the transformation layer.
ELT transformation
ELT transformations are typically written exclusively in SQL. This offers a number of advantages over ETL (which can be seen by example in the appendix).
The code above is less than half the length (51 vs. 117 lines) of the ETL (Python) equivalent that can be found in the appendix. The concerns of processing power and memory usage are moved to the DB systems. This plays to the strengths of cloud-based providers, particularly to dynamically scaling compute solutions like BigQuery and Snowflake.
Transformation, in either ETL or ELT, needs planning and has an associated time cost. Although an actual rebuild might be significantly faster with modern database solutions, the human element of implementing new business logic in the transformation/modeling layer is slow. Furthermore, only a certain percentage of possible needs will be designed into any given model, and changes will always be needed. A strict transformation layer can block analysts from getting time-sensitive insights.
ELA
Extract-load-analyze is hard to beat over a short distance. It is by far the most agile approach to analytics and saves on development time, lines of code, and planning over a small number of queries.
The ELA code answers our questions in 11 and 23 lines, respectively, which is double the 7 and 6 lines of the analytics sections associated with ETL and ELT. However, extract-load-analyze skips 117 and 51 lines of preparation associated with ETL and ELT, respectively. This saves a total of 30 lines over the ELT approach. Especially if combined with SaaS extraction and loading, this can mean a half day turnaround from DWH setup to meaningful insights. Still, as with anything, ELA has its drawbacks:
- Repetition: Already above there is repetition of meaningful business logic: total_revenue and order_revenue. DRY coding style is as relevant now as it was when Hunt and Thomas first coined it. Across a larger number of disparate queries, repeating logic leads to three problems: - Difficult migration if definitions change - Differing opinions across reports due to human error or failed migration - Larger code base as lines initially saved on skipping transformation are lost via repetition
- Decentralisation and Quality: When columns and their contents are left as is, the opportunity to improve quality and clarity is lost. Each new query requires thought and investigation to ensure data quality, as these learning are not centralized and solidified.
- Increased Complexity: The absolute level of accessibility of data has gone up, but so has the complexity of SQL. The inexperienced are more able to get to data but less able to use it.
Synopsis
Unsurprisingly, both ELT and ELA have advantages and disadvantages. Differences in cost and efficiency are largely a matter of where and when rather than which is better, as the ELT vs. ELA distinction is more a matter of shunting the same work to a different place. ELA is no doubt the fastest and cheapest way to create a reporting structure for a startup Series A funding round. It does not require the participation of data architects and engineers, only SQL-savvy junior team members. Reports can be delivered immediately rather than after three to six months of modeling. However, there will come a point when quality and scalability concerns tip the cost benefit back towards ELT. When this happens, cost and time will be paid back with interest to repair a flawed structure. Balancing the two worlds is no doubt best — indeed, I use a hybrid approach to feed the benefits of agility from ELA into the stability of ELT, but that's a story for another article.
Appendix
ETL transformation
ETL is presented Python pseudocode serving to load data from an imagined Postgres DB. Python was chosen as it certainly the most popular language of data, and by some measures the most popular language overall.
The above Python code has many concerns and is by no means trivial, even though the load and in part the extract elements/methods are glossed over. Despite these omissions, there are 117 lines of code. We can see a number of very distinct skill sets intertwined.
- Performance and hardware: Row batching and the management of connection cursor and gotchyas associated with flushing.
- Python & SQL mix: Two separate languages are required for this ETL process. The mixing of concerns that occurs, and the complexity associated with ETL in-house, are part of the reason why it is quickly becoming a legacy approach.
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.
Guest post: Extract, load and analyze the agile way
Guest post: Extract, load and analyze the agile way
ETL and ELT are well-known concepts, but they do not describe the most common agile analytics workflow. In practice, data teams often omit modeled transformations and perform extraction, loading and analysis (ELA). Accessible SQL-based data warehouse (DWH) solutions such as Redshift, BigQuery and Snowflake make this possible by performing large, complex transformations on the fly. Data teams that set up a modern cloud DWH and are unconstrained by ETL or ELT can get meaningful insights immediately.
The term “agile” is often associated with ELT, but it probably describes ELA better. Transformations in the sense of formal, Kimball-style dimensional modelling can take a long time to design and implement. Bypassing the transformation stage allows the BI team to pursue a more agile workflow, at least early in the growth of an organization. It is important not to conflate ELT and ELA, as they entail distinct tradeoffs.
This article seeks to compare the approaches of ETL, ELT and ELA, by use of coded examples — primarily transformation and analysis. Extraction and loading are outside of the article's scope; in the modern context, they are best outsourced. The end goal is to:
- Identify the different behaviours in the wild
- Consider how they can be used for separating concerns
- How they affect data quality and code maintenance
There are marked differences in the pacing and distribution of effort between ELT and ELA. We can be certain that ELT will win the marathon, and ELA will win the sprint — the real question is when it is best to pass the baton.
ETL, ELT and ELA compared by example
We will use sample code and a simple data model with business rules and two questions to further our discussion. ETL, ELT & ELA are all discussed with commentary in relation to the code. It is worth noting that during the writing of this article, the verbose nature of the ETL led us to move much of its code and discussion to the appendix, where it can be referenced for completeness.
Sample data
Below is a common quasi-real world data set — the shape and errors are based on examples commonly seen in business. It includes two tables, users and line_items, related by user_id. The model has some intentional flaws, which mean that the source system data requires work before it lines up with business logic. The data set is simplified, which lends itself to the ELA approach. The data set has no integrity errors, null values or duplication, all of which would be expected in a real data set.
The three intentional flaws highlight the risks of reshaping all data inline.
- Confusing column names (is_paid): names that act as a “false friend” are common and cause trouble in the real world. In our case, “is_paid” means that the checkout is completed, but does not mean the order had any monetary value. A better alternative: "is_completed".
- Obscured business logic: It is possible to get an answer that is plausible and seemingly complete given a particular question, only to find that there were hidden rules — these are often in distant tables or hidden enurums. In this example, when business stakeholders wish for a breakdown by “locale,” they mean the locale of the user, not the line items (as they may differ). This creates the need for a join and some thought every time "locale" is involved, which would be in most queries.
- Complex business logic (revenue): Revenue is often calculated using a number of columns, which can sometimes change. In this data set, revenue is defined as "The total gross value in an order which has been put fully through the checkout process and is not cancelled."
Sample data layout
raw_data.users | raw_data.line_items |
---|---|
* user_id | * id |
* locale (where users signed up) | * product_id |
* city | * order_id |
* created_at | * user_id |
* created_at | |
* paid_at | |
* locale (where users completed purchase) | |
* is_paid | |
* is_cancelled | |
* unit_cost | |
* quantity |
Questions
Q1: Top 10 markets by paid_revenue Q2: Customers from the locale "de" who spent more than 100 in their first paid order
Method
Given the challenge of the above thought exercise the ETL(A) & ELT(A) approaches will both have two steps. The first is to create transformed tables in our DWH, which are modeled and enriched by a simple Kimball method. The second step is the same for both approaches and involves using SQL to run the final analysis off the modeled tables. Both questions for ELA will be completed in one step, with all cleaning happening inline.
Analytics on modeled data – ETL & ELT analytics element
If the data is already modeled through ETL or ELT, the SQL for answering the two questions will look like the above. Modeling provides the following advantages:
- SQL required is exceptionally short – 7 and 6 lines of code, respectively
- Close to error-proof even for the most inexperienced SQL user Joins are avoided for simple queries (and would otherwise be simple) Common utility columns are available due to precalculatation
- Naming is clear and self-evident
Good modeling allows greater confidence in the data and gives analysts the opportunity to focus on insights rather than navigating SQL and data cleaning. However, these neat output queries come with a cost in the form of the transformation layer.
ELT transformation
ELT transformations are typically written exclusively in SQL. This offers a number of advantages over ETL (which can be seen by example in the appendix).
The code above is less than half the length (51 vs. 117 lines) of the ETL (Python) equivalent that can be found in the appendix. The concerns of processing power and memory usage are moved to the DB systems. This plays to the strengths of cloud-based providers, particularly to dynamically scaling compute solutions like BigQuery and Snowflake.
Transformation, in either ETL or ELT, needs planning and has an associated time cost. Although an actual rebuild might be significantly faster with modern database solutions, the human element of implementing new business logic in the transformation/modeling layer is slow. Furthermore, only a certain percentage of possible needs will be designed into any given model, and changes will always be needed. A strict transformation layer can block analysts from getting time-sensitive insights.
ELA
Extract-load-analyze is hard to beat over a short distance. It is by far the most agile approach to analytics and saves on development time, lines of code, and planning over a small number of queries.
The ELA code answers our questions in 11 and 23 lines, respectively, which is double the 7 and 6 lines of the analytics sections associated with ETL and ELT. However, extract-load-analyze skips 117 and 51 lines of preparation associated with ETL and ELT, respectively. This saves a total of 30 lines over the ELT approach. Especially if combined with SaaS extraction and loading, this can mean a half day turnaround from DWH setup to meaningful insights. Still, as with anything, ELA has its drawbacks:
- Repetition: Already above there is repetition of meaningful business logic: total_revenue and order_revenue. DRY coding style is as relevant now as it was when Hunt and Thomas first coined it. Across a larger number of disparate queries, repeating logic leads to three problems: - Difficult migration if definitions change - Differing opinions across reports due to human error or failed migration - Larger code base as lines initially saved on skipping transformation are lost via repetition
- Decentralisation and Quality: When columns and their contents are left as is, the opportunity to improve quality and clarity is lost. Each new query requires thought and investigation to ensure data quality, as these learning are not centralized and solidified.
- Increased Complexity: The absolute level of accessibility of data has gone up, but so has the complexity of SQL. The inexperienced are more able to get to data but less able to use it.
Synopsis
Unsurprisingly, both ELT and ELA have advantages and disadvantages. Differences in cost and efficiency are largely a matter of where and when rather than which is better, as the ELT vs. ELA distinction is more a matter of shunting the same work to a different place. ELA is no doubt the fastest and cheapest way to create a reporting structure for a startup Series A funding round. It does not require the participation of data architects and engineers, only SQL-savvy junior team members. Reports can be delivered immediately rather than after three to six months of modeling. However, there will come a point when quality and scalability concerns tip the cost benefit back towards ELT. When this happens, cost and time will be paid back with interest to repair a flawed structure. Balancing the two worlds is no doubt best — indeed, I use a hybrid approach to feed the benefits of agility from ELA into the stability of ELT, but that's a story for another article.
Appendix
ETL transformation
ETL is presented Python pseudocode serving to load data from an imagined Postgres DB. Python was chosen as it certainly the most popular language of data, and by some measures the most popular language overall.
The above Python code has many concerns and is by no means trivial, even though the load and in part the extract elements/methods are glossed over. Despite these omissions, there are 117 lines of code. We can see a number of very distinct skill sets intertwined.
- Performance and hardware: Row batching and the management of connection cursor and gotchyas associated with flushing.
- Python & SQL mix: Two separate languages are required for this ETL process. The mixing of concerns that occurs, and the complexity associated with ETL in-house, are part of the reason why it is quickly becoming a legacy approach.
Related blog posts
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.
*dbt Core is a trademark of dbt Labs, Inc. All rights therein are reserved to dbt Labs, Inc. Fivetran Transformations is not a product or service of or endorsed by dbt Labs, Inc.