02 Jun 2019 | Analyst Recipe

Get Started With the “T” in ELT

Charles Wang
Charles Wang
Get Started With the “T” in ELT
Use Fivetran Transformations to prepare your data for reporting.

With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.

There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.

Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.

Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.

The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:

interface

Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.

An Overview of Transformations

The transformations you will perform can be sorted into the following categories:

Revising

  1. Cleaning: altering values for formatting compatibility
  2. Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
  3. Key restructuring: altering values so that they can be used as keys across tables
  4. Deduplication: identifying and removing duplicate records.
  5. Data validation: evaluate the validity of a record by the completeness of the data

Computing

  1. Derivation: performing simple cross-column calculations
  2. Summarization: using aggregate functions to get summary values
  3. Pivoting: turning row values into columns
  4. Sorting or ordering and indexing to improve search performance

Separating

  1. Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
  2. Filtering: selecting only certain rows or columns

Combining

  1. Joining: linking data across sources
  2. Merging/appending/union: combining records from multiple sources
  3. Integration: reconcile names and values for the same data element across different sources

Some Samples

Suppose you had a table original.activity_table with the following columns:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
email STRING
score INTEGER

Multiple people can have the same first name and each person can participate in multiple activities.

What if you wanted to clean up activity values and convert the relevant figures to metric?

CREATE or REPLACE table transformed.activity_table AS

SELECT
-- 1. Cleaning: let’s remove leading and trailing white spaces and inconsistent capitalization:
LOWER(TRIM(activity)) as activity,
-- 2. Format conversion: let’s change height from inches to cm and weight from lbs to kg
height * 2.54 as height,
weight / 2.205 as weigh
FROM
original.test_table

What if you wanted to identify and extract unique participants and create unique identifiers for them? What if you wanted to exclude records that can’t be verified because they lack emails?

CREATE OR REPLACE table transformed.user_table AS

WITH users_grouped AS
(SELECT
first_name,
email
FROM
original.test_table
WHERE
-- 5. Exclude records with null values in the “email” column
-- 11. This is an example of filtering, too
email is not NULL
-- 4. Deduplicate by grouping
GROUP BY
first_name,
email)

SELECT
first_name,
email,
-- 3. Key restructuring: let’s create an identifier by using first_name and assigning an index number based on sorting by email address
CONCAT(first_name, '-' , CAST(row_number() over (PARTITION by first_name ORDER BY email) as string)) as identifier
FROM
users_grouped

Suppose we use the same table from the last example, only without any duplicate persons and with all units in metric:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
score INTEGER

What if we want to compute BMI?

CREATE OR REPLACE table transformed.bmi_table AS

SELECT
first_name,
activity,
height,
weight,
-- 6. Perform a simple calculation involving more than one column
weight / POWER(height, 2) as bmi,
score
FROM
original.activity_table

What about getting some summary statistics for the numeric values?

CREATE OR REPLACE table transformed.summary_table AS

-- 7. Get min, max, and arithmetic mean. Note that you can use GROUP BY here to segment the data, too
-- 13. Merging records, too!
SELECT
‘score’ as attribute,
MIN(score) as minimum,
AVG(score) as mean,
MAX(score) as maximum,
STDDEV(score) as standard_deviation
FROM
original.activity_table
UNION
SELECT
‘height’ as attribute,
MIN(height) as minimum,
AVG(height) as mean,
MAX(height) as maximum,
STDDEV(score) as standard_deviation
FROM
original.activity_table
UNION
SELECT
‘weight’ as attribute,
MIN(weight) as minimum,
AVG(weight) as mean,
MAX(weight) as maximum,
STDDEV(score) as standard_deviation
FROM
original.activity_table

Percentiles are slightly trickier:

-- 7. More summary stats
CREATE OR REPLACE table transformed.percentile_table AS

SELECT
‘score’ as attribute,
PERCENTILE_CONT(score, 0.25) OVER() as first_quartile,
PERCENTILE_CONT(score, 0.50) OVER() as median,
PERCENTILE_CONT(score, 0.75) OVER() as third_quartile
FROM
salesforce.opportunity

LIMIT 1
-- ...
-- From here on, you would simply do what you did in the last example in terms of UNION

How about joining the percentile tables to the rest of the summary tables after you have created the two from above? You could do this:

CREATE OR REPLACE table transformed.full_summary_table AS

SELECT
attribute,
minimum,
first_quartile,
mean,
median,
third_quartile,
maximum,
standard_deviation,
-- 6. Another example here of derivation, using arithmetic to get the interquartile range
third_quartile - first_quartile as iqr
FROM
transformed.summary_table
-- 12. Join it!
LEFT JOIN
transformed.percentile_table on percentile_table.attribute = summary_table.attribute

Note that you can trigger this to activate after the two prerequisite tables are populated.

What if you wanted to turn some of your row values into columns? Going from “long” to “wide” is a common need in data analysis.

You may want to see the different types of activities as columns. For this, you will have to enumerate each value you want to turn into a column. Let’s say the only possible values are tennis, boxing, soccer, and hockey:

-- 8. Pivoting
CREATE OR REPLACE table transformed.scores_pivoted AS

SELECT
first_name,
activity,
score
FROM
original.activity_table
PIVOT(SUM(score) for activity in (‘tennis’, ‘boxing’, ‘soccer’, ‘hockey’)) as p

If your platform doesn’t have PIVOT, you could try something like this instead:

-- 8. Pivoting the hard way
CREATE OR REPLACE table transformed.scores_pivoted AS

SELECT
first_name,
SUM(IF(activity = ‘tennis’, score, NULL)) tennis,
SUM(IF(activity = ‘boxing’, score, NULL)) boxing,
SUM(IF(activity = ‘soccer’, score, NULL)) soccer,
SUM(IF(activity = ‘hockey’, score, NULL)) hockey
FROM
original.activity_table
GROUP BY
First_name

What if you have a column that contains multiple values? Let’s say you have a bunch of personal profiles with nicknames in a table called original.names:

column type
id INTEGER
first_name STRING
last_name STRING
nicknames STRING

The nicknames are strings, and the values are comma-separated. You want to create a new table with a one-to-many mapping with the following columns (note that this only works in Snowflake at the time of writing this):

column type
id INTEGER
nickname STRING
-- 10. Split a column containing delimited values
CREATE OR REPLACE table transformed.nicknames AS

SELECT
Id,
n.value::string as name
FROM
original.names LATERAL FLATTEN(INPUT=>SPLIT(nicknames, ‘,’)) n;

Beyond EL

With a little creativity, there is plenty you can do to more easily organize and optimize your data for reporting and dashboards. To learn more about Fivetran Transformations, sign up for a personalized demo or get started today by signing up for a free trial.

Are You A Data Expert?

Start a free trial today.

Discover the smartest solution for data-driven results.
We have detecting that you are using an adblocking plugin in your browser. We don't show ads, but we rely on advertising services, so it might restrict you from completing important functions or seeing important content. Please make sure you whitelist our website in your adblocking plugin.
Fivetran uses cookies to enhance your user experience and improve the quality of our website. Unless you disable cookies, you consent to the placement and use of cookies as described in our Privacy Policy by continuing to use this website.
Adblock Detection