Every data professional should have a high degree of competence with SQL. A long-time mainstay of database operations, the various dialects of SQL feature many capabilities that may not be obvious.
This article follows part 1 of the SQL Essentials guide. In part 2 of SQL Essentials, we will discuss the following topics:
- CASE shortening, exclusivity and order
- Using LIKE better
- SUMMING booleans
- Use window functions to deduplicate
- Ignoring NULLS in a window function
- Marking past events
- Knowing your dialects (of SQL)
CASE shortening, exclusivity and order
CASE statements are a staple of every SQL dialect and pop up whenever IF, THEN logic is required. However, they can often be ungainly and analysts often ignore the benefits of their implicit defaults.
One neat element that is often overlooked is that the ELSE clause is optional and defaults to ELSE NULL. Leaving out the ELSE in your CASE statement is always cleaner and more readable, and helps explicitly map every condition to an appropriate value. Consider the following:
SELECT CASE WHEN state_enum = 1 THEN 'started' WHEN state_enum < 3 THEN 'pending' ELSE complete END as order_state FROM orders
This simple CASE maps an enumeration to human-readable text. We can assume that state_enum = 3 is the completed state. However, quite often a new state may be added as a company grows - for example state_enum = 4 -> returned. Were this to happen, returned state orders would be swallowed into complete and we would be none-the-wiser to the error.
CASE WHEN state_enum = 1 THEN 'started' WHEN state_enum < 3 THEN 'pending' WHEN state_enum = 4 THEN 'complete' END as order_state
With the above snippet, all enumerations are explicitly handled, and any new enumerations slipping in would turn up NULL alerting us to a change.
We could extend this one further by using CASE in the style of a SWITCH statement as such.
CASE state_enum WHEN 1 THEN 'started' WHEN 2 THEN 'pending' WHEN 3 THEN 'pending' WHEN 4 THEN 'complete' END as order_state
Here, each evaluation is implicitly "=" and so does not suit every circumstance. However, it is still a great tool to have in the toolbox. Redshift even boasts a DECODE function that goes one step further:
DECODE(state_enum, 1, 'started', 2, 'pending', 3, 'pending', 4, 'complete', ) AS order_state
Another property of the noble CASE statement is that it is exclusive. This means that only those rows failing to meet a WHEN clause will make it to the next, and that those rows which meet a WHEN will never been seen by later clauses. In practice, this is familiar territory for those with experience in any programming language (cf. if statements). However, for others its a point easily missed. A simple and common example might be creating age buckets ( eg 00 - 20, 21-55, 56-99 ).
CASE WHEN age < 21 THEN '00 - 20' WHEN age >= 21 AND age < 56 THEN '21 - 56' WHEN age BETWEEN 56 AND 99 THEN '56 - 99' ELSE 'invalid-age' END as age_buckets
The above probably produces the correct answer, but it is more prone to error and less readable than simply leaning on the exclusivity property of the CASE
CASE WHEN age < 0 OR age > 99 THEN 'invalid-age' -- nice to handle invalid cases first WHEN age < 21 THEN '00 - 20' WHEN age < 56 THEN '21 - 56' WHEN age < 100 THEN '56 - 99' END as age_buckets
The above is a simple cleanup that presents a more readable controllable solution to the same problem. With the concept of a scale as we have above it's easy to understand how each range excludes all other ranges, but considering the same principal in more complex scenarios can help simplify code and reduce errors.
CASE WHEN order_price > 0 AND NOT is_cancelled AND is_complete THEN 'completed_order' WHEN order_price = 0 AND NOT is_cancelled AND is_complete THEN 'free_order' WHEN NOT is_complete THEN 'pending' WHEN is_cancelled THEN 'cancelled' END as order_state
CASE WHEN is_cancelled THEN 'cancelled' WHEN NOT is_complete THEN 'pending' WHEN order_price > 0 THEN 'completed_order' WHEN order_price = 0 THEN 'free_order' END as order_state
We can see that order matters and when done correctly can control errors and reduce lines of code. Its worth noting that there exists a double-edged sword here if you are unfamiliar with the quality of your data as over or under catching can have unpredictable results.
On a final note, since CASE statements only execute until a match is made, we can often improve performance by pushing compute-intensive operations (Like REGEX & LIKE) further down the list.
Using LIKE better
Well known is LIKE and its wildcard %, however, the _ operator is often overlooked. The % operator captures any number of characters whereas _ allows only one wild card. Not knowing about the later may be fine for quite some time as % can solve many issues, but this knowledge gap can create difficult to understand errors and unneeded performance issues.
The LIKE operator (similar to any string search function) is a compute-heavy resource compared to equality and IN operators. Though often not a problem at a small scale it can explode the run time of queries as data volumes grow. This is particularly when true when a non-terminal % is used, as the entire string must be scanned before the evaluation can be resolved. For example, a statement like a% must only check the first character of any VARCHAR to know if it should return TRUE or FALSE whereas %a will only finish once the whole string is read. However, there are use cases where the preceding letters are wild, and without the use of _, % becomes our only resort. A real-life pattern Ive often encountered is checking if a referrer URL is from facebook. Facebook unkindly has a number of subdomains (www.facebook, m.facebook etc.). Using
CASE WHEN referer_url LIKE '%.facebook' THEN 'facebook' END AS source
Can incur significant performance problems especially since the referer_url column is no doubt quite a long VARCHAR. Since there are only a max 3** characters in facebook subdomains using our underscore we can vastly reduce the scanning work needed by our DB with the following:
CASE WHEN referer_url LIKE '_.facebook' THEN 'facebook' WHEN referer_url LIKE '__.facebook' THEN 'facebook' WHEN referer_url LIKE '___.facebook' THEN 'facebook' END AS source
Though performance issues can often be solved with more money and a higher-performance system not knowing _ can still have serious consequences.
Imagine a column called domain_language containing both the domain of a user as well as their language. Values might include:
Domain_language - sample values:
- de_de
- de_en
- fr_fr
- com_en
- co.uk_en
- co_en
- co_fr
If we tried to find all users registered on our .co domain where might try the following and be surprised by the incorrect results
SELECT * FROM users WHERE domain_language LIKE 'co_%'
This will return users for com and co.uk as well as co because of underscores wildcard status. In the wild, this doesn't occur every other day, but when it does it can be hard to track down or might go unnoticed entirely. The above is typically corrected by escaping the _ character as follows
SELECT * FROM users WHERE domain_language LIKE 'co\\_%'
SUMMING Booleans
It's quite common to want to find all the users who are active or have a feature turned on. The most common approach I see is below, which although perfectly functional can take a lot of space and be hard to read in large more real-world queries.
SELECT f.month, SUM(1) AS users, SUM(CASE WHEN u.is_active = TRUE THEN 1 ELSE 0 END) AS active_users FROM facts.users AS u GROUP BY 1
There are a number of syntactic changes that can help with your SQL golf.
- u.is_active = TRUE is the same as u.is_active
- CASE WHEN u.is_active THEN 1 END will often suffice although runs the risk of NULL results
- BOOLEAN cast neatly to INTs particularly in PgSQL derived syntaxes.
To wit:
SELECT f.month, SUM(1) AS users, --- PgSQL / Redshift syntax SUM(u.is_active::INT) AS active_users --- ANSI CASTING SUM(CAST(u.is_active AS INT)) AS active_users FROM facts.users AS u GROUP BY 1
*Note a Boolean column in MYSQL will autocast to INT - which is still no excuse to use it as a DWH
Use window functions to deduplicate
Window functions, also called analytical functions, are essential for any data engineer or analyst. If you don't know how they function, learn them. If your database doesn't support them, consider trading it in for a database more suited to function as a data warehouse. (If such a trade-in is impossible, some enterprising professionals may consider trading their employer in - but commenting further on that topic is beyond my remit!). Unfortunately, window functions can be hard to wrap your head around at first. The good news, however, is that window functions open up a new world of possibilities. Window functions are placed in line in your select expression list but behave more like columns appended to the table after all other execution has completed. To recreate window function outputs without window functions themselves often involves subqueries and self joins. Window functions significantly shorten this syntax and often provide functionality that cant otherwise be replicated.
A powerful use case in the context of Data engineering/data modeling is their role in deduplication - particularly of functional duplicates, which share ids but not all properties. This kind of duplicate role often occurs in the context of event logs and tracking data.
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.