23 Jul 2019 | Data Strategy

Extract, Load and Analyze – the Agile Way

Michael O'Toole
Michael O'Toole
Extract, Load and Analyze  – the Agile Way
Data modeling and transformation sometimes get in the way of agility.

ETL and ELT are well-known concepts, but they do not describe the most common agile analytics workflow. In practice, data teams often omit modeled transformations and perform extraction, loading and analysis (ELA). Accessible SQL-based data warehouse (DWH) solutions such as Redshift, BigQuery and Snowflake make this possible by performing large, complex transformations on the fly. Data teams that set up a modern cloud DWH and are unconstrained by ETL or ELT can get meaningful insights immediately.

The term “agile” is often associated with ELT, but it probably describes ELA better. Transformations in the sense of formal, Kimball-style dimensional modelling can take a long time to design and implement. Bypassing the transformation stage allows the BI team to pursue a more agile workflow, at least early in the growth of an organization. It is important not to conflate ELT and ELA, as they entail distinct tradeoffs.

This article seeks to compare the approaches of ETL, ELT and ELA, by use of coded examples — primarily transformation and analysis. Extraction and loading are outside of the article's scope; in the modern context, they are best outsourced. The end goal is to:

  • Identify the different behaviours in the wild
  • Consider how they can be used for separating concerns
  • How they affect data quality and code maintenance

There are marked differences in the pacing and distribution of effort between ELT and ELA. We can be certain that ELT will win the marathon, and ELA will win the sprint — the real question is when it is best to pass the baton.

ETL, ELT & ELA Compared by Example

We will use sample code and a simple data model with business rules and two questions to further our discussion. ETL, ELT & ELA are all discussed with commentary in relation to the code. It is worth noting that during the writing of this article, the verbose nature of the ETL led us to move much of its code and discussion to the appendix, where it can be referenced for completeness.

Sample Data

Below is a common quasi-real world data set — the shape and errors are based on examples commonly seen in business. It includes two tables, users and line_items, related by user_id. The model has some intentional flaws, which mean that the source system data requires work before it lines up with business logic. The data set is simplified, which lends itself to the ELA approach. The data set has no integrity errors, null values or duplication, all of which would be expected in a real data set.

The three intentional flaws highlight the risks of reshaping all data inline.

  • Confusing column names (is_paid): names that act as a “false friend” are common and cause trouble in the real world. In our case, “is_paid” means that the checkout is completed, but does not mean the order had any monetary value. A better alternative: "is_completed".
  • Obscured business logic: It is possible to get an answer that is plausible and seemingly complete given a particular question, only to find that there were hidden rules — these are often in distant tables or hidden enurums. In this example, when business stakeholders wish for a breakdown by “locale,” they mean the locale of the user, not the line items (as they may differ). This creates the need for a join and some thought every time "locale" is involved, which would be in most queries.
  • Complex business logic (revenue): Revenue is often calculated using a number of columns, which can sometimes change. In this data set, revenue is defined as "The total gross value in an order which has been put fully through the checkout process and is not cancelled."

Sample Data Layout

raw_data.users raw_data.line_items
* user_id * id
* locale (where users signed up) * product_id
* city * order_id
* created_at * user_id
* created_at
* paid_at
* locale (where users completed purchase)
* is_paid
* is_cancelled
* unit_cost
* quantity

Questions

Q1: Top 10 markets by paid_revenue Q2: Customers from the locale "de" who spent more than 100 in their first paid order

Method

Given the challenge of the above thought exercise the ETL(A) & ELT(A) approaches will both have two steps. The first is to create transformed tables in our DWH, which are modeled and enriched by a simple Kimball method. The second step is the same for both approaches and involves using SQL to run the final analysis off the modeled tables. Both questions for ELA will be completed in one step, with all cleaning happening inline.

Analytics on Modeled Data – ETL & ELT Analytics Element

-- Q1: Top 10 markets by paid_revenue
SELECT
	locale,
	SUM(gross_revenue) AS total_revenue
FROM facts.line_items
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
----------- 7 Lines
-- Q2: Customers "de" who spent more than 100 in their first paid order
SELECT
	user_id,
	SUM(gross_revenue) AS order_revenue
FROM facts.line_items
WHERE locale = 'de' AND is_first_paid_order
HAVING order_revenue > 100
------------ 6 Lines

If the data is already modeled through ETL or ELT, the SQL for answering the two questions will look like the above. Modeling provides the following advantages:

  • SQL required is exceptionally short – 7 and 6 lines of code, respectively
  • Close to error-proof even for the most inexperienced SQL user
    • Joins are avoided for simple queries (and would otherwise be simple)
    • Common utility columns are available due to precalculatation
  • Naming is clear and self-evident

Good modeling allows greater confidence in the data and gives analysts the opportunity to focus on insights rather than navigating SQL and data cleaning. However, these neat output queries come with a cost in the form of the transformation layer.

ELT Transformation

ELT transformations are typically written exclusively in SQL. This offers a number of advantages over ETL (which can be seen by example in the appendix).

-- Create a facts.line_items to cover most important patterns associated with revenue etc
CREATE TABLE facts.line_items
AS (
WITH line_items_with_users AS (
	SELECT
		l.id AS line_item_id,
		l.*,
		u.locale AS locale,
		l.locale AS order_locale,
		---
		u.created_at AS registration_cohort_at,
		---
		l.is_paid AS is_completed,
		---
		COALESCE(l.quantity * l.unit_cost,0) AS gross_price,
		COALESCE(CASE
			WHEN l.is_paid AND NOT l.is_cancelled THEN gross_price
		END,0) AS gross_revenue,
		SUM(gross_revenue) OVER (
			PARTITION BY l.order
			ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
			ORDER BY l.paid_at ASC, l.id ASC
		) AS order_gross_revenue,
		SUM(1) OVER (
			PARTITION BY l.order
			ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
			ORDER BY l.paid_at ASC, l.id ASC
		) AS line_items_in_order,
		SUM(1) OVER (
			PARTITION BY l.order
			ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
		) AS order_line_items_ranking,
		SUM(1) OVER (
			PARTITION BY l.user_id
			ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
		) AS user_line_item_ranking,
		SUM( (gross_revenue > 0)::SMALLINT ) OVER (
			PARTITION BY l.user_id
			ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC
		) AS paid_user_line_item_ranking,
	FROM raw_data.line_items AS l
	LEFT JOIN raw_data.users AS u ON u.id = l.user_id
)
SELECT
	line_item_id,
	user_id,
	product_id,
	order_id,
	locale,
	order_locale,
	--- 
	created_at,
	paid_at,
	registration_cohort_at,
	---
	is_completed,
	is_cancelled,
	is_free,
	COALESCE(user_line_item_ranking <= line_items_in_order,FALSE) AS is_first_order,
	(order_gross_revenue > 0 AND is_first_order) AS is_first_paid_order,
	---
	unit_cost,
	quantity,
	gross_price,
	gross_revenue,
	--- ranking utils
	line_items_in_order,
	user_line_item_ranking,
	paid_user_line_item_ranking,
	SUM((order_line_items_ranking = 1)::SMALLINT) OVER (
		PARTITION BY user_id, line_item_id
		ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC
	) AS user_order_ranking,
	SUM((order_line_items_ranking = 1 AND order_gross_revenue > 0)::SMALLINT) OVER (
		PARTITION BY user_id, line_item_id
		ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC
	) AS paid_user_order_ranking,
FROM line_items_with_users
);
-------- 51 line  71 lines (incl column listing)

The code above is less than half the length (51 vs. 117 lines) of the ETL (Python) equivalent that can be found in the appendix. The concerns of processing power and memory usage are moved to the DB systems. This plays to the strengths of cloud-based providers, particularly to dynamically scaling compute solutions like BigQuery and Snowflake.

Transformation, in either ETL or ELT, needs planning and has an associated time cost. Although an actual rebuild might be significantly faster with modern database solutions, the human element of implementing new business logic in the transformation/modeling layer is slow. Furthermore, only a certain percentage of possible needs will be designed into any given model, and changes will always be needed. A strict transformation layer can block analysts from getting time-sensitive insights.

ELA

Extract-load-analyze is hard to beat over a short distance. It is by far the most agile approach to analytics and saves on development time, lines of code, and planning over a small number of queries.


-- Q1: Top 10 markets by paid_revenue SELECT u.locale, COALESCE(SUM(CASE WHEN l.is_paid AND NOT l.is_canelled THEN l.quantity * l.unit_cost END),0) AS total_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ------------------- 11 Lines -- Q2: Customers "de" who spent more than 100 in their first paid order WITH paid_de_user_order AS ( SELECT l.user_id, l.order_id, MIN(l.paid_at) AS order_paid_at, COALESCE(SUM(CASE WHEN l.is_paid AND NOT l.is_cancelled THEN l.quantity * l.unit_cost END),0) AS order_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id WHERE u.locale = 'de' GROUP BY 1,2 HAVING order_revenue > 100 ), orders_ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_paid_at) AS order_ranking FROM paid_de_user_order ) SELECT user_id,order_revenue FROM orders_ranked WHERE order_ranking = 1 -------------------- 23 Lines

The ELA code answers our questions in 11 and 23 lines, respectively, which is double the 7 and 6 lines of the analytics sections associated with ETL and ELT. However, extract-load-analyze skips 117 and 51 lines of preparation associated with ETL and ELT, respectively. This saves a total of 30 lines over the ELT approach. Especially if combined with SaaS extraction and loading, this can mean a half day turnaround from DWH setup to meaningful insights. Still, as with anything, ELA has its drawbacks:

  • Repetition: Already above there is repetition of meaningful business logic: total_revenue and order_revenue. DRY coding style is as relevant now as it was when Hunt and Thomas first coined it. Across a larger number of disparate queries, repeating logic leads to three problems:
    • Difficult migration if definitions change
    • Differing opinions across reports due to human error or failed migration
    • Larger code base as lines initially saved on skipping transformation are lost via repetition
  • Decentralisation and Quality: When columns and their contents are left as is, the opportunity to improve quality and clarity is lost. Each new query requires thought and investigation to ensure data quality, as these learning are not centralized and solidified.
  • Increased Complexity: The absolute level of accessibility of data has gone up, but so has the complexity of SQL. The inexperienced are more able to get to data but less able to use it.

Synopsis

Unsurprisingly, both ELT and ELA have advantages and disadvantages. Differences in cost and efficiency are largely a matter of where and when rather than which is better, as the ELT vs. ELA distinction is more a matter of shunting the same work to a different place. ELA is no doubt the fastest and cheapest way to create a reporting structure for a startup Series A funding round. It does not require the participation of data architects and engineers, only SQL-savvy junior team members. Reports can be delivered immediately rather than after three to six months of modeling. However, there will come a point when quality and scalability concerns tip the cost benefit back towards ELT. When this happens, cost and time will be paid back with interest to repair a flawed structure. Balancing the two worlds is no doubt best — indeed, I use a hybrid approach to feed the benefits of agility from ELA into the stability of ELT, but that's a story for another article.

Appendix

ETL Transformation

ETL is presented Python pseudocode serving to load data from an imagined Postgres DB. Python was chosen as it certainly the most popular language of data, and by some measures the most popular language overall.


## Class that will manage a Load and Transform from raw_data.line_items ## Pseudo Code import psycopg2 class RankOrder(): def __init__(): self.current_order_id = None self.current_order = [] def query(self,cursor): ### Involves Python & SQL mix query = """ SELECT l.id AS line_item_id, l.*, u.locale AS locale, l.locale AS order_locale, ---- u.created_at AS registration_cohort_at, -- l.is_paid AS is_completed, --- COALESCE(l.quantity * l.unit_cost,0) AS gross_price, COALESCE(CASE WHEN l.is_paid AND NOT l.is_cancelled THEN gross_price END,0) AS gross_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id ORDER BY l.user_id, l.order_id, l.paid_at """ self.cursor.execute(query) self.column_names = [desc[0] for desc in cursor.description] return cursor def get_column(self,row,column_name): index = self.column_names.index(column_name) return row[index] def process_orders(self,order): line_items_in_order = len(order) if(line_items_in_order = 0): return user_id = self.get_column(order[0],'user_id') user_tally = self.users[user_id] order_gross_revenue = sum([ self.get_column(line_item,'gross_revenue') for line_item in order]) output_orders = [] for i, line_item in enumerate(order): order_line_items_ranking = i is_paid_line = self.get_column(line_item,'gross_revenue') > 0 user_line_item_ranking = i + user_tally["user_line_item_ranking"] paid_user_line_item_ranking = i + user_tally["paid_user_line_item_ranking"] if is_paid_line else None is_first_order = user_line_item_ranking <= line_items_in_order or False is_first_paid_order = (is_paid_line and paid_user_line_item_ranking <= line_items_in_order) or False new_cols = ( is_first_order, is_first_paid_order, line_items_in_order, order_line_items_ranking, user_tally["user_order_ranking"], user_tally["paid_user_order_ranking"], order_gross_revenue ) output_orders.append(line_item + new_cols) self.batch_insert(output_orders) ########## keep the tally up to date self.users[user_id]['user_order_ranking'] += 1 self.users[user_id]['user_line_item_ranking'] += lines_in_order if is_paid_line: self.users[user_id]['paid_user_line_item_ranking'] += lines_in_order if order_gross_revenue > 0 self.users[user_id]['paid_order_ranking'] += 1 return def batch_insert(self,output_orders): ###### sudo code to insert rows to table def process_row(self,row): order_id = self.get_column(row,'order_id') user_id = self.get_column(row,'user_id') is_paid_line = self.get_column(line_item,'gross_revenue') > 0 if self.users[user_id] is None: self.users[user_id] = { 'user_order_ranking':1, 'paid_user_order_ranking':1, 'user_line_item_ranking':1, 'paid_user_line_item_ranking':1 } if self.current_order_id != order_id: self.process_orders(self.current_order) lines_in_order = len(self.current_order); self.current_order_id = order_id self.current_order = [] self.current_order.append(row) def run(self): ## hardware batching and scalability concerns with psycopg2.connect('MY_CONNECTION_STRING') as conn: with conn.cursor(name='named_cursor_for_large_result') as cursor: cursor.itersize = 20000 cursor = self.query(cursor) for row in cursor: self.process_row(row) ## flush - batching gotcha self.process_orders(self.current_order)

The above Python code has many concerns and is by no means trivial, even though the load and in part the extract elements/methods are glossed over. Despite these omissions, there are 117 lines of code. We can see a number of very distinct skill sets intertwined. * Performance and hardware: Row batching and the management of connection cursor and gotchyas associated with flushing. * Python & SQL mix: Two separate languages are required for this ETL process. The mixing of concerns that occurs, and the complexity associated with ETL in-house, are part of the reason why it is quickly becoming a legacy approach.

Are You A Data Expert?

Start a free trial today.

Discover the smartest solution for data-driven results.