Clean Data With Fivetran Transformations
Cleaning is a standard use case for a transformation layer in a data pipeline. The principle of “garbage in, garbage out” tells us that better data trumps more sophisticated algorithms.
To make sense of your data, you may find yourself extracting or combining elements of strings, removing duplicate records, fixing typos and inconsistent capitalization, forcing data types, or flagging or filling in values.
If you aren’t familiar with how to use Fivetran Transformations, refer to this post first.
Extracting Leading and Trailing Elements of Strings: LEFT(), RIGHT(), and SUBSTR()
Suppose you know that the first four characters of a serial number correspond to certain factories. Let’s say you’re also interested in the two trailing characters of a serial number, as every item ending in “00” is a “special.” In most SQL dialects, you can extract leading characters using LEFT() and trailing characters using RIGHT():
SELECT LEFT(553744211, 4) as factory
SELECT RIGHT(553744200, 2) as special
The LEFT() and RIGHT() functions will automatically convert your arguments into strings.
Standard SQL on Google BigQuery doesn’t support the above usages of LEFT() and RIGHT(). Use a more general function, SUBSTR(), instead.
For leading characters:
SUBSTR([string], [starting position], [characters from starting position])
Since you specify the starting position on the string, you can extract elements from the middle of a string as well using SUBSTR().
For trailing characters:
SUBSTR([string], -[number of characters from end])
Note that SUBSTR() requires you to CAST the field as a string (more on that later):
SELECT SUBSTR(553744211, 0, 4) as factory
SELECT SUBSTR(553744200, -2) as special
Taking Care of Whitespace and Other Leading/Trailing Characters: RTRIM(), LTRIM(), and TRIM()
You can use LTRIM() to remove leading white spaces; RTRIM for trailing white spaces; and for both leading and trailing white spaces, TRIM(). You can enter two arguments for TRIM():
TRIM([string], [character(s) to remove from trailing and leading ends])
The second argument can contain multiple characters, each of which will be trimmed. For instance:
SELECT TRIM("???**** example text $$$$", "?$* ")
Unfortunately, none of the TRIM() functions will take care of double whitespace within a string, such as in “example text.” In order to remove duplicate whitespaces within a string (and more!), you will need the REPLACE() function.
Replacing values: REPLACE()
To replace characters, use the REPLACE() function like so:
SELECT REPLACE("Brave", "B", "c")
REPLACE functions, like many others, can be nested:
SELECT REPLACE(REPLACE("Brave", "B", "c"), "v", "z")
You can use REPLACE() to replace whitespaces, too:
SELECT REPLACE(“example text”, “ “, “ “)
You might have noticed that REPLACE() is case-sensitive. If you want to change the case of a string, it’s obviously impractical to nest REPLACE() functions 26 layers deep for every letter.
Changing capitalization: LOWER() and UPPER()
Use LOWER() and UPPER() to lowercase and uppercase strings, respectively.
SELECT UPPER(“eXaMpLe TeXt”)
SELECT LOWER(“eXaMpLe TeXt”)
Most SQL dialects do not, unfortunately, have native functions to convert strings to title- or proper- case.
An Important Logical Function: CASE Statements
The CASE statement is the SQL equivalent to an IF-THEN-ELSE statement. Use this to put things into buckets and perform other rule-based operations.
SELECT SerialNumber, CASE WHEN SerialNumber < 210000 THEN 1943 WHEN SerialNumber >= 210000 AND SerialNumber < 450000 THEN 1944 ELSE 1945 END as YearOfManufacture
Forcing Data Types: CAST()
The CAST() function allows you to alter data types. You will periodically encounter numbers or dates represented as strings:
SELECT CAST(“1337” as INT)
SELECT CAST(“2011-06-14” as DATE)
Sometimes, you will encounter dirty columns with strings called “null” or blank spaces. You can combine CASE statements with CAST to handle these:
SELECT DirtyColumn, CASE WHEN DirtyColumn = “” OR LOWER(DirtyColumn) = “null” THEN NULL ELSE CAST(DirtyColumn AS INT64) END as CleanColumn
Combining Strings: CONCAT
Suppose you wanted to combine strings from two columns.
SELECT CONCAT(“Audie”, “ “, “Murphy”)
Note that CONCAT() does not automatically apply spaces or other delimiters. If you want your strings separated by something, you will have to supply it directly to the function.
Setting Default Values: COALESCE
Use COALESCE to return the first non-NULL value in a list. Suppose you have two columns
tax that you have to sum together, but
tax is sometimes NULL. The usage for COALESCE would look like so:
SELECT Item, Price, Tax, Price + COALESCE(tax, 0) as total_cost FROM Sales.records
The function will check whether “tax” is NULL, and if so, choose 0 as the value. If not, it will use the numeric value of “tax.” We previously addressed how to handle this scenario here.
These functions can be combined with each other to handle many sorts of messy data. There are more advanced cleaning methods to pursue using SQL as well, including fuzzy matching, splitting strings into columns, and title-casing.