-- Q1: Top 10 markets by paid_revenue SELECT locale, SUM(gross_revenue) AS total_revenue FROM facts.line_items GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ----------- 7 Lines -- Q2: Customers "de" who spent more than 100 in their first paid order SELECT user_id, SUM(gross_revenue) AS order_revenue FROM facts.line_items WHERE locale = 'de' AND is_first_paid_order HAVING order_revenue > 100 ------------ 6 Lines
If the data is already modeled through ETL or ELT, the SQL for answering the two questions will look like the above. Modeling provides the following advantages:
Good modeling allows greater confidence in the data and gives analysts the opportunity to focus on insights rather than navigating SQL and data cleaning. However, these neat output queries come with a cost in the form of the transformation layer.
ELT transformations are typically written exclusively in SQL. This offers a number of advantages over ETL (which can be seen by example in the appendix).
-- Create a facts.line_items to cover most important patterns associated with revenue etc CREATE TABLE facts.line_items AS ( WITH line_items_with_users AS ( SELECT l.id AS line_item_id, l.*, u.locale AS locale, l.locale AS order_locale, --- u.created_at AS registration_cohort_at, --- l.is_paid AS is_completed, --- COALESCE(l.quantity * l.unit_cost,0) AS gross_price, COALESCE(CASE WHEN l.is_paid AND NOT l.is_cancelled THEN gross_price END,0) AS gross_revenue, SUM(gross_revenue) OVER ( PARTITION BY l.order ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ORDER BY l.paid_at ASC, l.id ASC ) AS order_gross_revenue, SUM(1) OVER ( PARTITION BY l.order ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ORDER BY l.paid_at ASC, l.id ASC ) AS line_items_in_order, SUM(1) OVER ( PARTITION BY l.order ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC ) AS order_line_items_ranking, SUM(1) OVER ( PARTITION BY l.user_id ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC ) AS user_line_item_ranking, SUM( (gross_revenue > 0)::SMALLINT ) OVER ( PARTITION BY l.user_id ROWS UNBOUNDED PRECEDING ORDER BY l.paid_at ASC, l.id ASC ) AS paid_user_line_item_ranking, FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id ) SELECT line_item_id, user_id, product_id, order_id, locale, order_locale, --- created_at, paid_at, registration_cohort_at, --- is_completed, is_cancelled, is_free, COALESCE(user_line_item_ranking <= line_items_in_order,FALSE) AS is_first_order, (order_gross_revenue > 0 AND is_first_order) AS is_first_paid_order, --- unit_cost, quantity, gross_price, gross_revenue, --- ranking utils line_items_in_order, user_line_item_ranking, paid_user_line_item_ranking, SUM((order_line_items_ranking = 1)::SMALLINT) OVER ( PARTITION BY user_id, line_item_id ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC ) AS user_order_ranking, SUM((order_line_items_ranking = 1 AND order_gross_revenue > 0)::SMALLINT) OVER ( PARTITION BY user_id, line_item_id ROWS UNBOUNDED PRECEDING ORDER BY user_line_item_ranking ASC ) AS paid_user_order_ranking, FROM line_items_with_users ); -------- 51 line 71 lines (incl column listing)
The code above is less than half the length (51 vs. 117 lines) of the ETL (Python) equivalent that can be found in the appendix. The concerns of processing power and memory usage are moved to the DB systems. This plays to the strengths of cloud-based providers, particularly to dynamically scaling compute solutions like BigQuery and Snowflake.
Transformation, in either ETL or ELT, needs planning and has an associated time cost. Although an actual rebuild might be significantly faster with modern database solutions, the human element of implementing new business logic in the transformation/modeling layer is slow. Furthermore, only a certain percentage of possible needs will be designed into any given model, and changes will always be needed. A strict transformation layer can block analysts from getting time-sensitive insights.
ELA
Extract-load-analyze is hard to beat over a short distance. It is by far the most agile approach to analytics and saves on development time, lines of code, and planning over a small number of queries.
-- Q1: Top 10 markets by paid_revenue SELECT u.locale, COALESCE(SUM(CASE WHEN l.is_paid AND NOT l.is_canelled THEN l.quantity * l.unit_cost END),0) AS total_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ------------------- 11 Lines -- Q2: Customers "de" who spent more than 100 in their first paid order WITH paid_de_user_order AS ( SELECT l.user_id, l.order_id, MIN(l.paid_at) AS order_paid_at, COALESCE(SUM(CASE WHEN l.is_paid AND NOT l.is_cancelled THEN l.quantity * l.unit_cost END),0) AS order_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id WHERE u.locale = 'de' GROUP BY 1,2 HAVING order_revenue > 100 ), orders_ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_paid_at) AS order_ranking FROM paid_de_user_order ) SELECT user_id,order_revenue FROM orders_ranked WHERE order_ranking = 1 -------------------- 23 Lines
The ELA code answers our questions in 11 and 23 lines, respectively, which is double the 7 and 6 lines of the analytics sections associated with ETL and ELT. However, extract-load-analyze skips 117 and 51 lines of preparation associated with ETL and ELT, respectively. This saves a total of 30 lines over the ELT approach. Especially if combined with SaaS extraction and loading, this can mean a half day turnaround from DWH setup to meaningful insights. Still, as with anything, ELA has its drawbacks:
- Repetition: Already above there is repetition of meaningful business logic: total_revenue and order_revenue. DRY coding style is as relevant now as it was when Hunt and Thomas first coined it. Across a larger number of disparate queries, repeating logic leads to three problems: - Difficult migration if definitions change - Differing opinions across reports due to human error or failed migration - Larger code base as lines initially saved on skipping transformation are lost via repetition
- Decentralisation and Quality: When columns and their contents are left as is, the opportunity to improve quality and clarity is lost. Each new query requires thought and investigation to ensure data quality, as these learning are not centralized and solidified.
- Increased Complexity: The absolute level of accessibility of data has gone up, but so has the complexity of SQL. The inexperienced are more able to get to data but less able to use it.
Synopsis
Unsurprisingly, both ELT and ELA have advantages and disadvantages. Differences in cost and efficiency are largely a matter of where and when rather than which is better, as the ELT vs. ELA distinction is more a matter of shunting the same work to a different place. ELA is no doubt the fastest and cheapest way to create a reporting structure for a startup Series A funding round. It does not require the participation of data architects and engineers, only SQL-savvy junior team members. Reports can be delivered immediately rather than after three to six months of modeling. However, there will come a point when quality and scalability concerns tip the cost benefit back towards ELT. When this happens, cost and time will be paid back with interest to repair a flawed structure. Balancing the two worlds is no doubt best — indeed, I use a hybrid approach to feed the benefits of agility from ELA into the stability of ELT, but that's a story for another article.
Appendix
ETL transformation
ETL is presented Python pseudocode serving to load data from an imagined Postgres DB. Python was chosen as it certainly the most popular language of data, and by some measures the most popular language overall.
## Class that will manage a Load and Transform from raw_data.line_items ## Pseudo Code import psycopg2 class RankOrder(): def __init__(): self.current_order_id = None self.current_order = [] def query(self,cursor): ### Involves Python & SQL mix query = """ SELECT l.id AS line_item_id, l.*, u.locale AS locale, l.locale AS order_locale, ---- u.created_at AS registration_cohort_at, -- l.is_paid AS is_completed, --- COALESCE(l.quantity * l.unit_cost,0) AS gross_price, COALESCE(CASE WHEN l.is_paid AND NOT l.is_cancelled THEN gross_price END,0) AS gross_revenue FROM raw_data.line_items AS l LEFT JOIN raw_data.users AS u ON u.id = l.user_id ORDER BY l.user_id, l.order_id, l.paid_at """ self.cursor.execute(query) self.column_names = [desc[0] for desc in cursor.description] return cursor def get_column(self,row,column_name): index = self.column_names.index(column_name) return row[index] def process_orders(self,order): line_items_in_order = len(order) if(line_items_in_order = 0): return user_id = self.get_column(order[0],'user_id') user_tally = self.users[user_id] order_gross_revenue = sum([ self.get_column(line_item,'gross_revenue') for line_item in order]) output_orders = [] for i, line_item in enumerate(order): order_line_items_ranking = i is_paid_line = self.get_column(line_item,'gross_revenue') > 0 user_line_item_ranking = i + user_tally["user_line_item_ranking"] paid_user_line_item_ranking = i + user_tally["paid_user_line_item_ranking"] if is_paid_line else None is_first_order = user_line_item_ranking <= line_items_in_order or False is_first_paid_order = (is_paid_line and paid_user_line_item_ranking <= line_items_in_order) or False new_cols = ( is_first_order, is_first_paid_order, line_items_in_order, order_line_items_ranking, user_tally["user_order_ranking"], user_tally["paid_user_order_ranking"], order_gross_revenue ) output_orders.append(line_item + new_cols) self.batch_insert(output_orders) ########## keep the tally up to date self.users[user_id]['user_order_ranking'] += 1 self.users[user_id]['user_line_item_ranking'] += lines_in_order if is_paid_line: self.users[user_id]['paid_user_line_item_ranking'] += lines_in_order if order_gross_revenue > 0 self.users[user_id]['paid_order_ranking'] += 1 return def batch_insert(self,output_orders): ###### sudo code to insert rows to table def process_row(self,row): order_id = self.get_column(row,'order_id') user_id = self.get_column(row,'user_id') is_paid_line = self.get_column(line_item,'gross_revenue') > 0 if self.users[user_id] is None: self.users[user_id] = { 'user_order_ranking':1, 'paid_user_order_ranking':1, 'user_line_item_ranking':1, 'paid_user_line_item_ranking':1 } if self.current_order_id != order_id: self.process_orders(self.current_order) lines_in_order = len(self.current_order); self.current_order_id = order_id self.current_order = [] self.current_order.append(row) def run(self): ## hardware batching and scalability concerns with psycopg2.connect('MY_CONNECTION_STRING') as conn: with conn.cursor(name='named_cursor_for_large_result') as cursor: cursor.itersize = 20000 cursor = self.query(cursor) for row in cursor: self.process_row(row) ## flush - batching gotcha self.process_orders(self.current_order)
The above Python code has many concerns and is by no means trivial, even though the load and in part the extract elements/methods are glossed over. Despite these omissions, there are 117 lines of code. We can see a number of very distinct skill sets intertwined.
- Performance and hardware: Row batching and the management of connection cursor and gotchyas associated with flushing.
- Python & SQL mix: Two separate languages are required for this ETL process. The mixing of concerns that occurs, and the complexity associated with ETL in-house, are part of the reason why it is quickly becoming a legacy approach.