Data modeling and transformation sometimes get in the way of agility.
ETL and ELT are well-known concepts, but they do not describe the most common agile analytics workflow. In practice, data teams often omit modeled transformations and perform extraction, loading and analysis (ELA). Accessible SQL-based data warehouse (DWH) solutions such as Redshift, BigQuery and Snowflake make this possible by performing large, complex transformations on the fly. Data teams that set up a modern cloud DWH and are unconstrained by ETL or ELT can get meaningful insights immediately.
The term “agile” is often associated with ELT, but it probably describes ELA better. Transformations in the sense of formal, Kimball-style dimensional modelling can take a long time to design and implement. Bypassing the transformation stage allows the BI team to pursue a more agile workflow, at least early in the growth of an organization. It is important not to conflate ELT and ELA, as they entail distinct tradeoffs.
This article seeks to compare the approaches of ETL, ELT and ELA, by use of coded examples — primarily transformation and analysis. Extraction and loading are outside of the article's scope; in the modern context, they are best outsourced. The end goal is to:
Identify the different behaviours in the wild
Consider how they can be used for separating concerns
How they affect data quality and code maintenance
There are marked differences in the pacing and distribution of effort between ELT and ELA. We can be certain that ELT will win the marathon, and ELA will win the sprint — the real question is when it is best to pass the baton.
We will use sample code and a simple data model with business rules and two questions to further our discussion. ETL, ELT & ELA are all discussed with commentary in relation to the code. It is worth noting that during the writing of this article, the verbose nature of the ETL led us to move much of its code and discussion to the appendix, where it can be referenced for completeness.
Below is a common quasi-real world data set — the shape and errors are based on examples commonly seen in business. It includes two tables, users and line_items, related by user_id. The model has some intentional flaws, which mean that the source system data requires work before it lines up with business logic. The data set is simplified, which lends itself to the ELA approach. The data set has no integrity errors, null values or duplication, all of which would be expected in a real data set.
The three intentional flaws highlight the risks of reshaping all data inline.
Confusing column names (is_paid): names that act as a “false friend” are common and cause trouble in the real world. In our case, “is_paid” means that the checkout is completed, but does not mean the order had any monetary value. A better alternative: "is_completed".
Obscured business logic: It is possible to get an answer that is plausible and seemingly complete given a particular question, only to find that there were hidden rules — these are often in distant tables or hidden enurums. In this example, when business stakeholders wish for a breakdown by “locale,” they mean the locale of the user, not the line items (as they may differ). This creates the need for a join and some thought every time "locale" is involved, which would be in most queries.
Complex business logic (revenue): Revenue is often calculated using a number of columns, which can sometimes change. In this data set, revenue is defined as "The total gross value in an order which has been put fully through the checkout process and is not cancelled."
Q1: Top 10 markets by paid_revenue Q2: Customers from the locale "de" who spent more than 100 in their first paid order
Given the challenge of the above thought exercise the ETL(A) & ELT(A) approaches will both have two steps. The first is to create transformed tables in our DWH, which are modeled and enriched by a simple Kimball method. The second step is the same for both approaches and involves using SQL to run the final analysis off the modeled tables. Both questions for ELA will be completed in one step, with all cleaning happening inline.
-- Q1: Top 10 markets by paid_revenue
SELECT
locale,
SUM(gross_revenue) AS total_revenue
FROM facts.line_items
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
----------- 7 Lines
-- Q2: Customers "de" who spent more than 100 in their first paid order
SELECT
user_id,
SUM(gross_revenue) AS order_revenue
FROM facts.line_items
WHERE locale = 'de' AND is_first_paid_order
HAVING order_revenue > 100
------------ 6 Lines
If the data is already modeled through ETL or ELT, the SQL for answering the two questions will look like the above. Modeling provides the following advantages:
SQL required is exceptionally short – 7 and 6 lines of code, respectively
Close to error-proof even for the most inexperienced SQL user Joins are avoided for simple queries (and would otherwise be simple) Common utility columns are available due to precalculatation
Naming is clear and self-evident
Good modeling allows greater confidence in the data and gives analysts the opportunity to focus on insights rather than navigating SQL and data cleaning. However, these neat output queries come with a cost in the form of the transformation layer.
ELT transformations are typically written exclusively in SQL. This offers a number of advantages over ETL (which can be seen by example in the appendix).
-- Create a facts.line_items to cover most important patterns associated with revenue etc
CREATE TABLE facts.line_items
AS (
WITH line_items_with_users AS (
SELECT
l.id AS line_item_id,
l.*,
u.locale AS locale,
l.locale AS order_locale,
---
u.created_at AS registration_cohort_at,
---
l.is_paid AS is_completed,
---
COALESCE(l.quantity * l.unit_cost,0) AS gross_price,
COALESCE(CASE
WHEN l.is_paid AND NOT l.is_cancelled THEN gross_price
END,0) AS gross_revenue,
SUM(gross_revenue) OVER (
PARTITION BY l.order
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
ORDER BY l.paid_at ASC, l.id ASC
) AS order_gross_revenue,
SUM(1) OVER (
PARTITION BY l.order
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
ORDER BY l.paid_at ASC, l.id ASC
) AS line_items_in_order,
SUM(1) OVER (
PARTITION BY l.order
ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
) AS order_line_items_ranking,
SUM(1) OVER (
PARTITION BY l.user_id
ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
) AS user_line_item_ranking,
SUM( (gross_revenue > 0)::SMALLINT ) OVER (
PARTITION BY l.user_id
ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
) AS paid_user_line_item_ranking,
FROM raw_data.line_items AS l
LEFT JOIN raw_data.users AS u ON u.id = l.user_id
)
SELECT
line_item_id,
user_id,
product_id,
order_id,
locale,
order_locale,
---
created_at,
paid_at,
registration_cohort_at,
---
is_completed,
is_cancelled,
is_free,
COALESCE(user_line_item_ranking <= line_items_in_order,FALSE) AS is_first_order,
(order_gross_revenue > 0 AND is_first_order) AS is_first_paid_order,
---
unit_cost,
quantity,
gross_price,
gross_revenue,
--- ranking utils
line_items_in_order,
user_line_item_ranking,
paid_user_line_item_ranking,
SUM((order_line_items_ranking = 1)::SMALLINT) OVER (
PARTITION BY user_id, line_item_id
ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC
) AS user_order_ranking,
SUM((order_line_items_ranking = 1 AND order_gross_revenue > 0)::SMALLINT) OVER (
PARTITION BY user_id, line_item_id
ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC
) AS paid_user_order_ranking,
FROM line_items_with_users
);
-------- 51 line 71 lines (incl column listing)
The code above is less than half the length (51 vs. 117 lines) of the ETL (Python) equivalent that can be found in the appendix. The concerns of processing power and memory usage are moved to the DB systems. This plays to the strengths of cloud-based providers, particularly to dynamically scaling compute solutions like BigQuery and Snowflake.
Transformation, in either ETL or ELT, needs planning and has an associated time cost. Although an actual rebuild might be significantly faster with modern database solutions, the human element of implementing new business logic in the transformation/modeling layer is slow. Furthermore, only a certain percentage of possible needs will be designed into any given model, and changes will always be needed. A strict transformation layer can block analysts from getting time-sensitive insights.
Extract-load-analyze is hard to beat over a short distance. It is by far the most agile approach to analytics and saves on development time, lines of code, and planning over a small number of queries.
-- Q1: Top 10 markets by paid_revenue
SELECT
u.locale,
COALESCE(SUM(CASE
WHEN l.is_paid AND NOT l.is_canelled
THEN l.quantity * l.unit_cost
END),0) AS total_revenue
FROM raw_data.line_items AS l
LEFT JOIN raw_data.users AS u ON u.id = l.user_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
------------------- 11 Lines
-- Q2: Customers "de" who spent more than 100 in their first paid order
WITH paid_de_user_order AS (
SELECT
l.user_id,
l.order_id,
MIN(l.paid_at) AS order_paid_at,
COALESCE(SUM(CASE
WHEN l.is_paid AND NOT l.is_cancelled
THEN l.quantity * l.unit_cost
END),0) AS order_revenue
FROM raw_data.line_items AS l
LEFT JOIN raw_data.users AS u ON u.id = l.user_id
WHERE u.locale = 'de'
GROUP BY 1,2
HAVING order_revenue > 100
),
orders_ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_paid_at) AS order_ranking
FROM paid_de_user_order
)
SELECT user_id,order_revenue FROM orders_ranked
WHERE order_ranking = 1
-------------------- 23 Lines
The ELA code answers our questions in 11 and 23 lines, respectively, which is double the 7 and 6 lines of the analytics sections associated with ETL and ELT. However, extract-load-analyze skips 117 and 51 lines of preparation associated with ETL and ELT, respectively. This saves a total of 30 lines over the ELT approach. Especially if combined with SaaS extraction and loading, this can mean a half day turnaround from DWH setup to meaningful insights. Still, as with anything, ELA has its drawbacks:
Repetition: Already above there is repetition of meaningful business logic: total_revenue and order_revenue. DRY coding style is as relevant now as it was when Hunt and Thomas first coined it. Across a larger number of disparate queries, repeating logic leads to three problems: - Difficult migration if definitions change - Differing opinions across reports due to human error or failed migration - Larger code base as lines initially saved on skipping transformation are lost via repetition
Decentralisation and Quality: When columns and their contents are left as is, the opportunity to improve quality and clarity is lost. Each new query requires thought and investigation to ensure data quality, as these learning are not centralized and solidified.
Increased Complexity: The absolute level of accessibility of data has gone up, but so has the complexity of SQL. The inexperienced are more able to get to data but less able to use it.
Unsurprisingly, both ELT and ELA have advantages and disadvantages. Differences in cost and efficiency are largely a matter of where and when rather than which is better, as the ELT vs. ELA distinction is more a matter of shunting the same work to a different place. ELA is no doubt the fastest and cheapest way to create a reporting structure for a startup Series A funding round. It does not require the participation of data architects and engineers, only SQL-savvy junior team members. Reports can be delivered immediately rather than after three to six months of modeling. However, there will come a point when quality and scalability concerns tip the cost benefit back towards ELT. When this happens, cost and time will be paid back with interest to repair a flawed structure. Balancing the two worlds is no doubt best — indeed, I use a hybrid approach to feed the benefits of agility from ELA into the stability of ELT, but that's a story for another article.
ETL is presented Python pseudocode serving to load data from an imagined Postgres DB. Python was chosen as it certainly the most popular language of data, and by some measures the most popular language overall.
## Class that will manage a Load and Transform from raw_data.line_items
## Pseudo Code
import psycopg2
class RankOrder():
def __init__():
self.current_order_id = None
self.current_order = []
def query(self,cursor):
### Involves Python & SQL mix
query = """
SELECT
l.id AS line_item_id,
l.*,
u.locale AS locale,
l.locale AS order_locale,
----
u.created_at AS registration_cohort_at,
--
l.is_paid AS is_completed,
---
COALESCE(l.quantity * l.unit_cost,0) AS gross_price,
COALESCE(CASE
WHEN l.is_paid AND NOT l.is_cancelled
THEN gross_price
END,0) AS gross_revenue
FROM raw_data.line_items AS l
LEFT JOIN raw_data.users AS u ON u.id = l.user_id
ORDER BY l.user_id, l.order_id, l.paid_at
"""
self.cursor.execute(query)
self.column_names = [desc[0] for desc in cursor.description]
return cursor
def get_column(self,row,column_name):
index = self.column_names.index(column_name)
return row[index]
def process_orders(self,order):
line_items_in_order = len(order)
if(line_items_in_order = 0):
return
user_id = self.get_column(order[0],'user_id')
user_tally = self.users[user_id]
order_gross_revenue = sum([ self.get_column(line_item,'gross_revenue') for line_item in order])
output_orders = []
for i, line_item in enumerate(order):
order_line_items_ranking = i
is_paid_line = self.get_column(line_item,'gross_revenue') > 0
user_line_item_ranking = i + user_tally["user_line_item_ranking"]
paid_user_line_item_ranking = i + user_tally["paid_user_line_item_ranking"] if is_paid_line else None
is_first_order = user_line_item_ranking <= line_items_in_order or False
is_first_paid_order = (is_paid_line and paid_user_line_item_ranking <= line_items_in_order) or False
new_cols = (
is_first_order, is_first_paid_order,
line_items_in_order, order_line_items_ranking,
user_tally["user_order_ranking"], user_tally["paid_user_order_ranking"],
order_gross_revenue
)
output_orders.append(line_item + new_cols)
self.batch_insert(output_orders)
########## keep the tally up to date
self.users[user_id]['user_order_ranking'] += 1
self.users[user_id]['user_line_item_ranking'] += lines_in_order
if is_paid_line:
self.users[user_id]['paid_user_line_item_ranking'] += lines_in_order
if order_gross_revenue > 0
self.users[user_id]['paid_order_ranking'] += 1
return
def batch_insert(self,output_orders):
###### sudo code to insert rows to table
def process_row(self,row):
order_id = self.get_column(row,'order_id')
user_id = self.get_column(row,'user_id')
is_paid_line = self.get_column(line_item,'gross_revenue') > 0
if self.users[user_id] is None:
self.users[user_id] = {
'user_order_ranking':1,
'paid_user_order_ranking':1,
'user_line_item_ranking':1,
'paid_user_line_item_ranking':1
}
if self.current_order_id != order_id:
self.process_orders(self.current_order)
lines_in_order = len(self.current_order);
self.current_order_id = order_id
self.current_order = []
self.current_order.append(row)
def run(self):
## hardware batching and scalability concerns
with psycopg2.connect('MY_CONNECTION_STRING') as conn:
with conn.cursor(name='named_cursor_for_large_result') as cursor:
cursor.itersize = 20000
cursor = self.query(cursor)
for row in cursor:
self.process_row(row)
## flush - batching gotcha
self.process_orders(self.current_order)
The above Python code has many concerns and is by no means trivial, even though the load and in part the extract elements/methods are glossed over. Despite these omissions, there are 117 lines of code. We can see a number of very distinct skill sets intertwined.
Performance and hardware: Row batching and the management of connection cursor and gotchyas associated with flushing.
Python & SQL mix: Two separate languages are required for this ETL process. The mixing of concerns that occurs, and the complexity associated with ETL in-house, are part of the reason why it is quickly becoming a legacy approach.
Launch any Fivetran connector instantly.