Get started with the “T” in ELT

Use Fivetran Transformations to prepare your data for reporting.
June 2, 2019

With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.

There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.

Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.

Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.

The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:

Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.

An overview of transformations

The transformations you will perform can be sorted into the following categories:

Revising

  1. Cleaning: altering values for formatting compatibility
  2. Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
  3. Key restructuring: altering values so that they can be used as keys across tables
  4. Deduplication: identifying and removing duplicate records.
  5. Data validation: evaluate the validity of a record by the completeness of the data

Computing

  1. Derivation: performing simple cross-column calculations
  2. Summarization: using aggregate functions to get summary values
  3. Pivoting: turning row values into columns
  4. Sorting or ordering and indexing to improve search performance

Separating

  1. Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
  2. Filtering: selecting only certain rows or columns

Combining

  1. Joining: linking data across sources
  2. Merging/appending/union: combining records from multiple sources
  3. Integration: reconcile names and values for the same data element across different sources

Some samples

Suppose you had a table original.activity_table with the following columns:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
email STRING
score INTEGER

Multiple people can have the same first name and each person can participate in multiple activities.

What if you wanted to clean up activity values and convert the relevant figures to metric?

CREATE or REPLACE table transformed.activity_table AS SELECT -- 1. Cleaning: let’s remove leading and trailing white spaces and inconsistent capitalization: LOWER(TRIM(activity)) as activity, -- 2. Format conversion: let’s change height from inches to cm and weight from lbs to kg height * 2.54 as height, weight / 2.205 as weigh FROM original.test_table

What if you wanted to identify and extract unique participants and create unique identifiers for them? What if you wanted to exclude records that can’t be verified because they lack emails?

CREATE OR REPLACE table transformed.user_table AS WITH users_grouped AS (SELECT first_name, email FROM original.test_table WHERE -- 5. Exclude records with null values in the “email” column -- 11. This is an example of filtering, too email is not NULL -- 4. Deduplicate by grouping GROUP BY first_name, email) SELECT first_name, email, -- 3. Key restructuring: let’s create an identifier by using first_name and assigning an index number based on sorting by email address CONCAT(first_name, '-' , CAST(row_number() over (PARTITION by first_name ORDER BY email) as string)) as identifier FROM users_grouped

Suppose we use the same table from the last example, only without any duplicate persons and with all units in metric:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
score INTEGER

What if we want to compute BMI?

CREATE OR REPLACE table transformed.bmi_table AS SELECT first_name, activity, height, weight, -- 6. Perform a simple calculation involving more than one column weight / POWER(height, 2) as bmi, score FROM original.activity_table

What about getting some summary statistics for the numeric values?

CREATE OR REPLACE table transformed.summary_table AS -- 7. Get min, max, and arithmetic mean. Note that you can use GROUP BY here to segment the data, too -- 13. Merging records, too! SELECT ‘score’ as attribute, MIN(score) as minimum, AVG(score) as mean, MAX(score) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘height’ as attribute, MIN(height) as minimum, AVG(height) as mean, MAX(height) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘weight’ as attribute, MIN(weight) as minimum, AVG(weight) as mean, MAX(weight) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table

Percentiles are slightly trickier:

-- 7. More summary stats CREATE OR REPLACE table transformed.percentile_table AS SELECT ‘score’ as attribute, PERCENTILE_CONT(score, 0.25) OVER() as first_quartile, PERCENTILE_CONT(score, 0.50) OVER() as median, PERCENTILE_CONT(score, 0.75) OVER() as third_quartile FROM salesforce.opportunity LIMIT 1 -- ... -- From here on, you would simply do what you did in the last example in terms of UNION

How about joining the percentile tables to the rest of the summary tables after you have created the two from above? You could do this:

CREATE OR REPLACE table transformed.full_summary_table AS SELECT attribute, minimum, first_quartile, mean, median, third_quartile, maximum, standard_deviation, -- 6. Another example here of derivation, using arithmetic to get the interquartile range third_quartile - first_quartile as iqr FROM transformed.summary_table -- 12. Join it! LEFT JOIN transformed.percentile_table on percentile_table.attribute = summary_table.attribute

Note that you can trigger this to activate after the two prerequisite tables are populated.

What if you wanted to turn some of your row values into columns? Going from “long” to “wide” is a common need in data analysis.

You may want to see the different types of activities as columns. For this, you will have to enumerate each value you want to turn into a column. Let’s say the only possible values are tennis, boxing, soccer, and hockey:

-- 8. Pivoting CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, activity, score FROM original.activity_table PIVOT(SUM(score) for activity in (‘tennis’, ‘boxing’, ‘soccer’, ‘hockey’)) as p

If your platform doesn’t have PIVOT, you could try something like this instead:

-- 8. Pivoting the hard way CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, SUM(IF(activity = ‘tennis’, score, NULL)) tennis, SUM(IF(activity = ‘boxing’, score, NULL)) boxing, SUM(IF(activity = ‘soccer’, score, NULL)) soccer, SUM(IF(activity = ‘hockey’, score, NULL)) hockey FROM original.activity_table GROUP BY First_name

What if you have a column that contains multiple values? Let’s say you have a bunch of personal profiles with nicknames in a table called original.names:

column type
id INTEGER
first_name STRING
last_name STRING
nicknames STRING

The nicknames are strings, and the values are comma-separated. You want to create a new table with a one-to-many mapping with the following columns (note that this only works in Snowflake at the time of writing this):

column type
id INTEGER
nickname STRING
-- 10. Split a column containing delimited values CREATE OR REPLACE table transformed.nicknames AS SELECT Id, n.value::string as name FROM original.names LATERAL FLATTEN(INPUT=>SPLIT(nicknames, ‘,’)) n;

Beyond EL

With a little creativity, there is plenty you can do to more easily organize and optimize your data for reporting and dashboards. To learn more about Fivetran Transformations, sign up for a personalized demo or get started today by signing up for a free trial.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Product
Product

Get started with the “T” in ELT

Get started with the “T” in ELT

June 2, 2019
June 2, 2019
Get started with the “T” in ELT
Use Fivetran Transformations to prepare your data for reporting.

With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.

There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.

Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.

Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.

The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:

Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.

An overview of transformations

The transformations you will perform can be sorted into the following categories:

Revising

  1. Cleaning: altering values for formatting compatibility
  2. Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
  3. Key restructuring: altering values so that they can be used as keys across tables
  4. Deduplication: identifying and removing duplicate records.
  5. Data validation: evaluate the validity of a record by the completeness of the data

Computing

  1. Derivation: performing simple cross-column calculations
  2. Summarization: using aggregate functions to get summary values
  3. Pivoting: turning row values into columns
  4. Sorting or ordering and indexing to improve search performance

Separating

  1. Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
  2. Filtering: selecting only certain rows or columns

Combining

  1. Joining: linking data across sources
  2. Merging/appending/union: combining records from multiple sources
  3. Integration: reconcile names and values for the same data element across different sources

Some samples

Suppose you had a table original.activity_table with the following columns:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
email STRING
score INTEGER

Multiple people can have the same first name and each person can participate in multiple activities.

What if you wanted to clean up activity values and convert the relevant figures to metric?

CREATE or REPLACE table transformed.activity_table AS SELECT -- 1. Cleaning: let’s remove leading and trailing white spaces and inconsistent capitalization: LOWER(TRIM(activity)) as activity, -- 2. Format conversion: let’s change height from inches to cm and weight from lbs to kg height * 2.54 as height, weight / 2.205 as weigh FROM original.test_table

What if you wanted to identify and extract unique participants and create unique identifiers for them? What if you wanted to exclude records that can’t be verified because they lack emails?

CREATE OR REPLACE table transformed.user_table AS WITH users_grouped AS (SELECT first_name, email FROM original.test_table WHERE -- 5. Exclude records with null values in the “email” column -- 11. This is an example of filtering, too email is not NULL -- 4. Deduplicate by grouping GROUP BY first_name, email) SELECT first_name, email, -- 3. Key restructuring: let’s create an identifier by using first_name and assigning an index number based on sorting by email address CONCAT(first_name, '-' , CAST(row_number() over (PARTITION by first_name ORDER BY email) as string)) as identifier FROM users_grouped

Suppose we use the same table from the last example, only without any duplicate persons and with all units in metric:

column type
first_name STRING
activity STRING
height FLOAT
weight INTEGER
score INTEGER

What if we want to compute BMI?

CREATE OR REPLACE table transformed.bmi_table AS SELECT first_name, activity, height, weight, -- 6. Perform a simple calculation involving more than one column weight / POWER(height, 2) as bmi, score FROM original.activity_table

What about getting some summary statistics for the numeric values?

CREATE OR REPLACE table transformed.summary_table AS -- 7. Get min, max, and arithmetic mean. Note that you can use GROUP BY here to segment the data, too -- 13. Merging records, too! SELECT ‘score’ as attribute, MIN(score) as minimum, AVG(score) as mean, MAX(score) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘height’ as attribute, MIN(height) as minimum, AVG(height) as mean, MAX(height) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘weight’ as attribute, MIN(weight) as minimum, AVG(weight) as mean, MAX(weight) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table

Percentiles are slightly trickier:

-- 7. More summary stats CREATE OR REPLACE table transformed.percentile_table AS SELECT ‘score’ as attribute, PERCENTILE_CONT(score, 0.25) OVER() as first_quartile, PERCENTILE_CONT(score, 0.50) OVER() as median, PERCENTILE_CONT(score, 0.75) OVER() as third_quartile FROM salesforce.opportunity LIMIT 1 -- ... -- From here on, you would simply do what you did in the last example in terms of UNION

How about joining the percentile tables to the rest of the summary tables after you have created the two from above? You could do this:

CREATE OR REPLACE table transformed.full_summary_table AS SELECT attribute, minimum, first_quartile, mean, median, third_quartile, maximum, standard_deviation, -- 6. Another example here of derivation, using arithmetic to get the interquartile range third_quartile - first_quartile as iqr FROM transformed.summary_table -- 12. Join it! LEFT JOIN transformed.percentile_table on percentile_table.attribute = summary_table.attribute

Note that you can trigger this to activate after the two prerequisite tables are populated.

What if you wanted to turn some of your row values into columns? Going from “long” to “wide” is a common need in data analysis.

You may want to see the different types of activities as columns. For this, you will have to enumerate each value you want to turn into a column. Let’s say the only possible values are tennis, boxing, soccer, and hockey:

-- 8. Pivoting CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, activity, score FROM original.activity_table PIVOT(SUM(score) for activity in (‘tennis’, ‘boxing’, ‘soccer’, ‘hockey’)) as p

If your platform doesn’t have PIVOT, you could try something like this instead:

-- 8. Pivoting the hard way CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, SUM(IF(activity = ‘tennis’, score, NULL)) tennis, SUM(IF(activity = ‘boxing’, score, NULL)) boxing, SUM(IF(activity = ‘soccer’, score, NULL)) soccer, SUM(IF(activity = ‘hockey’, score, NULL)) hockey FROM original.activity_table GROUP BY First_name

What if you have a column that contains multiple values? Let’s say you have a bunch of personal profiles with nicknames in a table called original.names:

column type
id INTEGER
first_name STRING
last_name STRING
nicknames STRING

The nicknames are strings, and the values are comma-separated. You want to create a new table with a one-to-many mapping with the following columns (note that this only works in Snowflake at the time of writing this):

column type
id INTEGER
nickname STRING
-- 10. Split a column containing delimited values CREATE OR REPLACE table transformed.nicknames AS SELECT Id, n.value::string as name FROM original.names LATERAL FLATTEN(INPUT=>SPLIT(nicknames, ‘,’)) n;

Beyond EL

With a little creativity, there is plenty you can do to more easily organize and optimize your data for reporting and dashboards. To learn more about Fivetran Transformations, sign up for a personalized demo or get started today by signing up for a free trial.

Topics
No items found.
Share

Related blog posts

No items found.
No items found.
No items found.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.