20 Jun 2019 | Analyst Recipe

Clean Data With Fivetran Transformations

Charles Wang
Charles Wang
Clean Data With Fivetran Transformations
Fivetran Transformations makes cleaning straightforward. Check out the following methods.

Cleaning is a standard use case for a transformation layer in a data pipeline. The principle of “garbage in, garbage out” tells us that better data trumps more sophisticated algorithms.

To make sense of your data, you may find yourself extracting or combining elements of strings, removing duplicate records, fixing typos and inconsistent capitalization, forcing data types, or flagging or filling in values.

If you aren’t familiar with how to use Fivetran Transformations, refer to this post first.

Extracting Leading and Trailing Elements of Strings: LEFT(), RIGHT(), and SUBSTR()

Suppose you know that the first four characters of a serial number correspond to certain factories. Let’s say you’re also interested in the two trailing characters of a serial number, as every item ending in “00” is a “special.” In most SQL dialects, you can extract leading characters using LEFT() and trailing characters using RIGHT():

SELECT LEFT(553744211,  4) as factory

“5537”

SELECT RIGHT(553744200, 2) as special

“00”

The LEFT() and RIGHT() functions will automatically convert your arguments into strings.

Standard SQL on Google BigQuery doesn’t support the above usages of LEFT() and RIGHT(). Use a more general function, SUBSTR(), instead.

For leading characters:

SUBSTR([string], [starting position], [characters from starting position])

Since you specify the starting position on the string, you can extract elements from the middle of a string as well using SUBSTR().

For trailing characters:

SUBSTR([string], -[number of characters from end])

Note that SUBSTR() requires you to CAST the field as a string (more on that later):

SELECT SUBSTR(553744211, 0,  4) as factory

“5537”

SELECT SUBSTR(553744200, -2) as special

“00”

Taking Care of Whitespace and Other Leading/Trailing Characters: RTRIM(), LTRIM(), and TRIM()

You can use LTRIM() to remove leading white spaces; RTRIM for trailing white spaces; and for both leading and trailing white spaces, TRIM(). You can enter two arguments for TRIM():

TRIM([string], [character(s) to remove from trailing and leading ends])

The second argument can contain multiple characters, each of which will be trimmed. For instance:

SELECT TRIM("???****  example text   $$$$", "?$* ")

example text

Unfortunately, none of the TRIM() functions will take care of double whitespace within a string, such as in “example text.” In order to remove duplicate whitespaces within a string (and more!), you will need the REPLACE() function.

Replacing values: REPLACE()

To replace characters, use the REPLACE() function like so:

SELECT REPLACE("Brave", "B", "c")

crave

REPLACE functions, like many others, can be nested:

SELECT REPLACE(REPLACE("Brave", "B", "c"), "v", "z")

craze

You can use REPLACE() to replace whitespaces, too:

SELECT REPLACE(“example  text”, “  “, “ “)

example text

You might have noticed that REPLACE() is case-sensitive. If you want to change the case of a string, it’s obviously impractical to nest REPLACE() functions 26 layers deep for every letter.

Changing capitalization: LOWER() and UPPER()

Use LOWER() and UPPER() to lowercase and uppercase strings, respectively.

SELECT UPPER(“eXaMpLe TeXt”)

EXAMPLE TEXT

SELECT LOWER(“eXaMpLe TeXt”)

example text

Most SQL dialects do not, unfortunately, have native functions to convert strings to title- or proper- case.

An Important Logical Function: CASE Statements

The CASE statement is the SQL equivalent to an IF-THEN-ELSE statement. Use this to put things into buckets and perform other rule-based operations.

SELECT
SerialNumber,
CASE
	WHEN SerialNumber < 210000 THEN 1943
WHEN SerialNumber >= 210000 AND SerialNumber < 450000 THEN 1944
	ELSE 1945
END as YearOfManufacture

Forcing Data Types: CAST()

The CAST() function allows you to alter data types. You will periodically encounter numbers or dates represented as strings:

SELECT CAST(“1337” as INT)

1337

SELECT CAST(“2011-06-14” as DATE)

2011-06-14

Sometimes, you will encounter dirty columns with strings called “null” or blank spaces. You can combine CASE statements with CAST to handle these:

SELECT
DirtyColumn,
CASE
	WHEN DirtyColumn  = “” OR LOWER(DirtyColumn) = “null” THEN NULL
	ELSE CAST(DirtyColumn AS INT64)
END as CleanColumn

Combining Strings: CONCAT

Suppose you wanted to combine strings from two columns.

SELECT CONCAT(“Audie”, “ “, “Murphy”)

Audie Murphy

Note that CONCAT() does not automatically apply spaces or other delimiters. If you want your strings separated by something, you will have to supply it directly to the function.

Setting Default Values: COALESCE

Use COALESCE to return the first non-NULL value in a list. Suppose you have two columns price and tax that you have to sum together, but tax is sometimes NULL. The usage for COALESCE would look like so:

SELECT
Item,
Price,
Tax,
Price + COALESCE(tax, 0) as total_cost
FROM
Sales.records

The function will check whether “tax” is NULL, and if so, choose 0 as the value. If not, it will use the numeric value of “tax.” We previously addressed how to handle this scenario here.

Going Forward

These functions can be combined with each other to handle many sorts of messy data. There are more advanced cleaning methods to pursue using SQL as well, including fuzzy matching, splitting strings into columns, and title-casing.

Fivetran can help you get a head start on all things analytics-related. Request a demo of Fivetran or start your free trial today.

Are You A Data Expert?

Start a free trial today.

Discover the smartest solution for data-driven results.
We have detecting that you are using an adblocking plugin in your browser. We don't show ads, but we rely on advertising services, so it might restrict you from completing important functions or seeing important content. Please make sure you whitelist our website in your adblocking plugin.
Fivetran uses cookies to enhance your user experience and improve the quality of our website. Unless you disable cookies, you consent to the placement and use of cookies as described in our Privacy Policy by continuing to use this website.
Adblock Detection