SQL Essentials Part 2 | Blog

Every data professional should have a high degree of competence with SQL. A long-time mainstay of database operations, the various dialects of SQL feature many capabilities that may not be obvious.

This article follows part 1 of the SQL Essentials guide. In part 2 of SQL Essentials, we will discuss the following topics:

CASE shortening, exclusivity and order
Using LIKE better
SUMMING booleans
Use window functions to deduplicate
Ignoring NULLS in a window function
Marking past events
Knowing your dialects (of SQL)

CASE shortening, exclusivity and order

CASE statements are a staple of every SQL dialect and pop up whenever IF, THEN logic is required. However, they can often be ungainly and analysts often ignore the benefits of their implicit defaults.

One neat element that is often overlooked is that the ELSE clause is optional and defaults to ELSE NULL. Leaving out the ELSE in your CASE statement is always cleaner and more readable, and helps explicitly map every condition to an appropriate value. Consider the following:

SELECT CASE WHEN state_enum = 1 THEN 'started' WHEN state_enum < 3 THEN 'pending' ELSE complete END as order_state FROM orders

This simple CASE maps an enumeration to human-readable text. We can assume that state_enum = 3 is the completed state. However, quite often a new state may be added as a company grows - for example state_enum = 4 -> returned. Were this to happen, returned state orders would be swallowed into complete and we would be none-the-wiser to the error.

CASE WHEN state_enum = 1 THEN 'started' WHEN state_enum < 3 THEN 'pending' WHEN state_enum = 4 THEN 'complete' END as order_state

With the above snippet, all enumerations are explicitly handled, and any new enumerations slipping in would turn up NULL alerting us to a change.

We could extend this one further by using CASE in the style of a SWITCH statement as such.

CASE state_enum WHEN 1 THEN 'started' WHEN 2 THEN 'pending' WHEN 3 THEN 'pending' WHEN 4 THEN 'complete' END as order_state

Here, each evaluation is implicitly "=" and so does not suit every circumstance. However, it is still a great tool to have in the toolbox. Redshift even boasts a DECODE function that goes one step further:

DECODE(state_enum, 1, 'started', 2, 'pending', 3, 'pending', 4, 'complete', ) AS order_state

Another property of the noble CASE statement is that it is exclusive. This means that only those rows failing to meet a WHEN clause will make it to the next, and that those rows which meet a WHEN will never been seen by later clauses. In practice, this is familiar territory for those with experience in any programming language (cf. if statements). However, for others its a point easily missed. A simple and common example might be creating age buckets ( eg 00 - 20, 21-55, 56-99 ).

CASE WHEN age < 21 THEN '00 - 20' WHEN age >= 21 AND age < 56 THEN '21 - 56' WHEN age BETWEEN 56 AND 99 THEN '56 - 99' ELSE 'invalid-age' END as age_buckets

The above probably produces the correct answer, but it is more prone to error and less readable than simply leaning on the exclusivity property of the CASE

CASE WHEN age < 0 OR age > 99 THEN 'invalid-age' -- nice to handle invalid cases first WHEN age < 21 THEN '00 - 20' WHEN age < 56 THEN '21 - 56' WHEN age < 100 THEN '56 - 99' END as age_buckets

The above is a simple cleanup that presents a more readable controllable solution to the same problem. With the concept of a scale as we have above it's easy to understand how each range excludes all other ranges, but considering the same principal in more complex scenarios can help simplify code and reduce errors.

CASE WHEN order_price > 0 AND NOT is_cancelled AND is_complete THEN 'completed_order' WHEN order_price = 0 AND NOT is_cancelled AND is_complete THEN 'free_order' WHEN NOT is_complete THEN 'pending' WHEN is_cancelled THEN 'cancelled' END as order_state

CASE WHEN is_cancelled THEN 'cancelled' WHEN NOT is_complete THEN 'pending' WHEN order_price > 0 THEN 'completed_order' WHEN order_price = 0 THEN 'free_order' END as order_state

We can see that order matters and when done correctly can control errors and reduce lines of code. Its worth noting that there exists a double-edged sword here if you are unfamiliar with the quality of your data as over or under catching can have unpredictable results.

On a final note, since CASE statements only execute until a match is made, we can often improve performance by pushing compute-intensive operations (Like REGEX & LIKE) further down the list.

Using LIKE better

Well known is LIKE and its wildcard %, however, the _ operator is often overlooked. The % operator captures any number of characters whereas _ allows only one wild card. Not knowing about the later may be fine for quite some time as % can solve many issues, but this knowledge gap can create difficult to understand errors and unneeded performance issues.

The LIKE operator (similar to any string search function) is a compute-heavy resource compared to equality and IN operators. Though often not a problem at a small scale it can explode the run time of queries as data volumes grow. This is particularly when true when a non-terminal % is used, as the entire string must be scanned before the evaluation can be resolved. For example, a statement like a% must only check the first character of any VARCHAR to know if it should return TRUE or FALSE whereas %a will only finish once the whole string is read. However, there are use cases where the preceding letters are wild, and without the use of _, % becomes our only resort. A real-life pattern Ive often encountered is checking if a referrer URL is from facebook. Facebook unkindly has a number of subdomains (www.facebook, m.facebook etc.). Using

CASE WHEN referer_url LIKE '%.facebook' THEN 'facebook' END AS source

Can incur significant performance problems especially since the referer_url column is no doubt quite a long VARCHAR. Since there are only a max 3** characters in facebook subdomains using our underscore we can vastly reduce the scanning work needed by our DB with the following:

CASE WHEN referer_url LIKE '_.facebook' THEN 'facebook' WHEN referer_url LIKE '__.facebook' THEN 'facebook' WHEN referer_url LIKE '___.facebook' THEN 'facebook' END AS source

Though performance issues can often be solved with more money and a higher-performance system not knowing _ can still have serious consequences.

Imagine a column called domain_language containing both the domain of a user as well as their language. Values might include:

Domain_language - sample values:

de_de
de_en
fr_fr
com_en
co.uk_en
co_en
co_fr

If we tried to find all users registered on our .co domain where might try the following and be surprised by the incorrect results

SELECT * FROM users WHERE domain_language LIKE 'co_%'

This will return users for com and co.uk as well as co because of underscores wildcard status. In the wild, this doesn't occur every other day, but when it does it can be hard to track down or might go unnoticed entirely. The above is typically corrected by escaping the _ character as follows

SELECT * FROM users WHERE domain_language LIKE 'co\\_%'

SUMMING Booleans

It's quite common to want to find all the users who are active or have a feature turned on. The most common approach I see is below, which although perfectly functional can take a lot of space and be hard to read in large more real-world queries.

SELECT f.month, SUM(1) AS users, SUM(CASE WHEN u.is_active = TRUE THEN 1 ELSE 0 END) AS active_users FROM facts.users AS u GROUP BY 1

There are a number of syntactic changes that can help with your SQL golf.

u.is_active = TRUE is the same as u.is_active
CASE WHEN u.is_active THEN 1 END will often suffice although runs the risk of NULL results
BOOLEAN cast neatly to INTs particularly in PgSQL derived syntaxes.

To wit:

SELECT f.month, SUM(1) AS users, --- PgSQL / Redshift syntax SUM(u.is_active::INT) AS active_users --- ANSI CASTING SUM(CAST(u.is_active AS INT)) AS active_users FROM facts.users AS u GROUP BY 1

*Note a Boolean column in MYSQL will autocast to INT - which is still no excuse to use it as a DWH

Use window functions to deduplicate

Window functions, also called analytical functions, are essential for any data engineer or analyst. If you don't know how they function, learn them. If your database doesn't support them, consider trading it in for a database more suited to function as a data warehouse. (If such a trade-in is impossible, some enterprising professionals may consider trading their employer in - but commenting further on that topic is beyond my remit!). Unfortunately, window functions can be hard to wrap your head around at first. The good news, however, is that window functions open up a new world of possibilities. Window functions are placed in line in your select expression list but behave more like columns appended to the table after all other execution has completed. To recreate window function outputs without window functions themselves often involves subqueries and self joins. Window functions significantly shorten this syntax and often provide functionality that cant otherwise be replicated.

A powerful use case in the context of Data engineering/data modeling is their role in deduplication - particularly of functional duplicates, which share ids but not all properties. This kind of duplicate role often occurs in the context of event logs and tracking data.

Event	Order_id	event_id	event_at	gross_total	browser
order_completed	100	1	2019-01-01	1000	Chrome
order_completed	100	1	2019-01-02	1000	Safari

In the example given above, we see an erroneous row. Someone completed an order, but a mysterious bug has registered the event twice in my database. Often times, I am relatively uninterested in the correct row - ie which browser the user was really using. However, it may be vital to ensure there are no duplicate records. If I am to join this table to others in my data processing, the duplicate row can become two, then four, then eight, then sixteen duplicate rows - dramatically distorting the amount of revenue which was really generated. If, then, we wish to deduplicate this table, standard SQL syntax (like GROUP BY and DISTINCT) will not do the trick.

WITH prep AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id) AS ranking FROM events ) SELECT -- * EXCEPT(column) is a beautiful design pattern offered to us by BigQuery * EXCEPT(ranking) FROM prep WHERE ranking = 1

Snowflake has an even nicer version of the same with their keyword QUALIFY (Which is unfortunately outside the ANSI standard).

SELECT * FROM events QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id) = 1

A pattern such as above neatly ensures that there will be no duplicates on order_id. Note that this fix is really a band-aid that obscures the origin of the problem, and it is always better to handle the true duplication errors at their source.

IGNORE NULLs in the WINDOW

Window functions are a core asset that just keeps giving. A small nuance which can give rise to extremely useful patterns is that of IGNORE / RESPECT NULLS. IGNORE NULLs is of relevance particularly with LEAD and LAG functions. In simple terms the IGNORE NULL options will cause a LAG to continue looking back in the window until it eventually finds a value, as opposed to simply taking the previous value as is the default case (RESPECT NULLS). This is probably best expressed by way of an example:

date_id	event	order_id
1	login	NULL
1	page_view	NULL
1	order	123
2	page_view	NULL
3	page_view	NULL
3	login	NULL
3	order	456
4	page_view	NULL

Let's say the above data shows a user navigating our site. Perhaps we would like to know a user's last order_id at any given time. IGNORE NULLS will do this for us perfectly.

SELECT *, LAG(order_id) IGNORE NULLS OVER (ORDER BY date_id) AS last_order_id, LAG(order_id) OVER (ORDER BY date_id) AS last_events_order_id FROM raw_data.events

Giving us:

date_id	event	order_id	last_order_id	last_events_order_id
1	login	NULL	NULL	NULL
1	page_view	NULL	NULL	NULL
1	order	123	NULL	NULL
2	page_view	NULL	123	123
3	page_view	NULL	123	NULL
3	login	NULL	123	NULL
3	order	456	123	NULL
4	page_view	NULL	456	456

Having IGNORE NULLS in your toolkit opens up a world of important patterns including:

Creating and calculating breakpoints - for example in event sessionization
Diffing values in a sequence - eg. days since last paid purchase

Marking past events

A common use case is to find if a user has ever passed through a given state. In a case like event tracking, a user may have interacted with a particular conversion event in a number of ways (once, twice, or not at all). Often, we only care if they have performed an action at least once, such as reached a stage in a conversion funnel. A typical way to handle this might be the following:

SELECT e.user_id, SUM(CAST(( e.event = 'order_completed' ) AS INT) ) >= 1 AS is_customer FROM raw_data.events

Although a neat trick, Redshift (BOOL_OR), Snowflake (BOOL_OR_AGG), and Bigquery (LOGICAL_OR) provide better ways of handling this common scenario. The AND and BIT versions of the above are also worth having in your toolbag. To wit:

SELECT e.user_id, BOOL_OR(e.event = 'order_completed') AS is_customer FROM raw_data.events

Know your dialects

SQL is a language, and languages diverge by nature. Although the core concepts of SQL are the same throughout dialects, various bits of syntax - along with certain functionalities - differ from one SQL sub-language to the next. Most syntaxes are fairly similar and can solve the same problems, but the specific approach to solving a given problem can vary. You are advised to know your dialect well and know what separates it from the others. At least know your dialect and database name, and, if possible, its parent languages.

Working in BI you will typically only encounter one DWH for your daily work. In that regard, its worth bookmarking its documentation, particularly the function reference section (presented below for your convenience). Being able to quickly discern if SQL found in articles or StackOverflow applies to your system is hugely valuable.

Function Reference

Redshift - https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html
BigQuery - https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators
Snowflake - https://docs.snowflake.net/manuals/sql-reference-functions.html
Postgres - https://www.postgresql.org/docs/8.4/functions.html

Many big DWH players base their drivers and syntax (to a greater or lesser extent) on Postgres, due to its wide compatibility with existing BI tools, open-source nature and generally competent syntax for performing analytical tasks. Google BigQuery** and Snowflake syntaxes are built from the ground up, and are compliant with ANSI SQL syntax. Redshift is tightly based on the Postgres dialect of SQL (namely, PostgreSQL 8.0.2) meaning that 99% of the time, differences between the two are invisible. Amazon, however, has a helpful reference of the core differences here. https://docs.aws.amazon.com/redshift/latest/dg/c_redshift-and-postgres-sql.html

** BigQuery has both a legacy and standard SQL syntax. Unsurprisingly, you should probably stick to the Standard which was introduced later, and which behaves in a more standard SQL manner.

A salient difference I've always noticed as a pain point between syntaxes is CAST between types and COALESCE. Below is a mini code example.

MySQL

SELECT ... CONCAT(CAST(is_active AS CHAR),user_name) AS pseudo_key,

PgSQL / Redshift

SELECT ... is_active::INT::CHAR || username AS pseudo_key

BigQuery

SELECT ... CONCAT(CAST(is_active AS string), username) AS pseudo_key

Conclusion

The world of SQL is vast and the product of decades of continued evolution. For as accessible as the basics of the language are, there is always more to learn and more to strive for with respect to performance and readability. Keep practicing!

SQL essentials: Part 2