With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.
There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.
Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.
Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.
The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:
Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.
An overview of transformations
The transformations you will perform can be sorted into the following categories:
Revising
Cleaning: altering values for formatting compatibility
Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
Key restructuring: altering values so that they can be used as keys across tables
Deduplication: identifying and removing duplicate records.
Data validation: evaluate the validity of a record by the completeness of the data
Summarization: using aggregate functions to get summary values
Pivoting: turning row values into columns
Sorting or ordering and indexing to improve search performance
Separating
Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
Filtering: selecting only certain rows or columns
Combining
Joining: linking data across sources
Merging/appending/union: combining records from multiple sources
Integration: reconcile names and values for the same data element across different sources
Some samples
Suppose you had a table original.activity_table with the following columns:
column
type
first_name
STRING
activity
STRING
height
FLOAT
weight
INTEGER
email
STRING
score
INTEGER
Multiple people can have the same first name and each person can participate in multiple activities.
What if you wanted to clean up activity values and convert the relevant figures to metric?
CREATE or REPLACE table transformed.activity_table AS SELECT -- 1. Cleaning: let’s remove leading and trailing white spaces and inconsistent capitalization: LOWER(TRIM(activity)) as activity, -- 2. Format conversion: let’s change height from inches to cm and weight from lbs to kg height * 2.54 as height, weight / 2.205 as weigh FROM original.test_table
What if you wanted to identify and extract unique participants and create unique identifiers for them? What if you wanted to exclude records that can’t be verified because they lack emails?
CREATE OR REPLACE table transformed.user_table AS WITH users_grouped AS (SELECT first_name, email FROM original.test_table WHERE -- 5. Exclude records with null values in the “email” column -- 11. This is an example of filtering, too email is not NULL -- 4. Deduplicate by grouping GROUP BY first_name, email) SELECT first_name, email, -- 3. Key restructuring: let’s create an identifier by using first_name and assigning an index number based on sorting by email address CONCAT(first_name, '-' , CAST(row_number() over (PARTITION by first_name ORDER BY email) as string)) as identifier FROM users_grouped
Suppose we use the same table from the last example, only without any duplicate persons and with all units in metric:
column
type
first_name
STRING
activity
STRING
height
FLOAT
weight
INTEGER
score
INTEGER
What if we want to compute BMI?
CREATE OR REPLACE table transformed.bmi_table AS SELECT first_name, activity, height, weight, -- 6. Perform a simple calculation involving more than one column weight / POWER(height, 2) as bmi, score FROM original.activity_table
What about getting some summary statistics for the numeric values?
CREATE OR REPLACE table transformed.summary_table AS -- 7. Get min, max, and arithmetic mean. Note that you can use GROUP BY here to segment the data, too -- 13. Merging records, too! SELECT ‘score’ as attribute, MIN(score) as minimum, AVG(score) as mean, MAX(score) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘height’ as attribute, MIN(height) as minimum, AVG(height) as mean, MAX(height) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘weight’ as attribute, MIN(weight) as minimum, AVG(weight) as mean, MAX(weight) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table
Percentiles are slightly trickier:
-- 7. More summary stats CREATE OR REPLACE table transformed.percentile_table AS SELECT ‘score’ as attribute, PERCENTILE_CONT(score, 0.25) OVER() as first_quartile, PERCENTILE_CONT(score, 0.50) OVER() as median, PERCENTILE_CONT(score, 0.75) OVER() as third_quartile FROM salesforce.opportunity LIMIT 1 -- ... -- From here on, you would simply do what you did in the last example in terms of UNION
How about joining the percentile tables to the rest of the summary tables after you have created the two from above? You could do this:
CREATE OR REPLACE table transformed.full_summary_table AS SELECT attribute, minimum, first_quartile, mean, median, third_quartile, maximum, standard_deviation, -- 6. Another example here of derivation, using arithmetic to get the interquartile range third_quartile - first_quartile as iqr FROM transformed.summary_table -- 12. Join it! LEFT JOIN transformed.percentile_table on percentile_table.attribute = summary_table.attribute
Note that you can trigger this to activate after the two prerequisite tables are populated.
What if you wanted to turn some of your row values into columns? Going from “long” to “wide” is a common need in data analysis.
You may want to see the different types of activities as columns. For this, you will have to enumerate each value you want to turn into a column. Let’s say the only possible values are tennis, boxing, soccer, and hockey:
-- 8. Pivoting CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, activity, score FROM original.activity_table PIVOT(SUM(score) for activity in (‘tennis’, ‘boxing’, ‘soccer’, ‘hockey’)) as p
If your platform doesn’t have PIVOT, you could try something like this instead:
-- 8. Pivoting the hard way CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, SUM(IF(activity = ‘tennis’, score, NULL)) tennis, SUM(IF(activity = ‘boxing’, score, NULL)) boxing, SUM(IF(activity = ‘soccer’, score, NULL)) soccer, SUM(IF(activity = ‘hockey’, score, NULL)) hockey FROM original.activity_table GROUP BY First_name
What if you have a column that contains multiple values? Let’s say you have a bunch of personal profiles with nicknames in a table called original.names:
column
type
id
INTEGER
first_name
STRING
last_name
STRING
nicknames
STRING
The nicknames are strings, and the values are comma-separated. You want to create a new table with a one-to-many mapping with the following columns (note that this only works in Snowflake at the time of writing this):
column
type
id
INTEGER
nickname
STRING
-- 10. Split a column containing delimited values CREATE OR REPLACE table transformed.nicknames AS SELECT Id, n.value::string as name FROM original.names LATERAL FLATTEN(INPUT=>SPLIT(nicknames, ‘,’)) n;
Beyond EL
With a little creativity, there is plenty you can do to more easily organize and optimize your data for reporting and dashboards. To learn more about Fivetran Transformations, sign up for a personalized demo or get started today by signing up for a free trial.
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Use Fivetran Transformations to prepare your data for reporting.
With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.
There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.
Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.
Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.
The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:
Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.
An overview of transformations
The transformations you will perform can be sorted into the following categories:
Revising
Cleaning: altering values for formatting compatibility
Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
Key restructuring: altering values so that they can be used as keys across tables
Deduplication: identifying and removing duplicate records.
Data validation: evaluate the validity of a record by the completeness of the data
Summarization: using aggregate functions to get summary values
Pivoting: turning row values into columns
Sorting or ordering and indexing to improve search performance
Separating
Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
Filtering: selecting only certain rows or columns
Combining
Joining: linking data across sources
Merging/appending/union: combining records from multiple sources
Integration: reconcile names and values for the same data element across different sources
Some samples
Suppose you had a table original.activity_table with the following columns:
column
type
first_name
STRING
activity
STRING
height
FLOAT
weight
INTEGER
email
STRING
score
INTEGER
Multiple people can have the same first name and each person can participate in multiple activities.
What if you wanted to clean up activity values and convert the relevant figures to metric?
CREATE or REPLACE table transformed.activity_table AS SELECT -- 1. Cleaning: let’s remove leading and trailing white spaces and inconsistent capitalization: LOWER(TRIM(activity)) as activity, -- 2. Format conversion: let’s change height from inches to cm and weight from lbs to kg height * 2.54 as height, weight / 2.205 as weigh FROM original.test_table
What if you wanted to identify and extract unique participants and create unique identifiers for them? What if you wanted to exclude records that can’t be verified because they lack emails?
CREATE OR REPLACE table transformed.user_table AS WITH users_grouped AS (SELECT first_name, email FROM original.test_table WHERE -- 5. Exclude records with null values in the “email” column -- 11. This is an example of filtering, too email is not NULL -- 4. Deduplicate by grouping GROUP BY first_name, email) SELECT first_name, email, -- 3. Key restructuring: let’s create an identifier by using first_name and assigning an index number based on sorting by email address CONCAT(first_name, '-' , CAST(row_number() over (PARTITION by first_name ORDER BY email) as string)) as identifier FROM users_grouped
Suppose we use the same table from the last example, only without any duplicate persons and with all units in metric:
column
type
first_name
STRING
activity
STRING
height
FLOAT
weight
INTEGER
score
INTEGER
What if we want to compute BMI?
CREATE OR REPLACE table transformed.bmi_table AS SELECT first_name, activity, height, weight, -- 6. Perform a simple calculation involving more than one column weight / POWER(height, 2) as bmi, score FROM original.activity_table
What about getting some summary statistics for the numeric values?
CREATE OR REPLACE table transformed.summary_table AS -- 7. Get min, max, and arithmetic mean. Note that you can use GROUP BY here to segment the data, too -- 13. Merging records, too! SELECT ‘score’ as attribute, MIN(score) as minimum, AVG(score) as mean, MAX(score) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘height’ as attribute, MIN(height) as minimum, AVG(height) as mean, MAX(height) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table UNION SELECT ‘weight’ as attribute, MIN(weight) as minimum, AVG(weight) as mean, MAX(weight) as maximum, STDDEV(score) as standard_deviation FROM original.activity_table
Percentiles are slightly trickier:
-- 7. More summary stats CREATE OR REPLACE table transformed.percentile_table AS SELECT ‘score’ as attribute, PERCENTILE_CONT(score, 0.25) OVER() as first_quartile, PERCENTILE_CONT(score, 0.50) OVER() as median, PERCENTILE_CONT(score, 0.75) OVER() as third_quartile FROM salesforce.opportunity LIMIT 1 -- ... -- From here on, you would simply do what you did in the last example in terms of UNION
How about joining the percentile tables to the rest of the summary tables after you have created the two from above? You could do this:
CREATE OR REPLACE table transformed.full_summary_table AS SELECT attribute, minimum, first_quartile, mean, median, third_quartile, maximum, standard_deviation, -- 6. Another example here of derivation, using arithmetic to get the interquartile range third_quartile - first_quartile as iqr FROM transformed.summary_table -- 12. Join it! LEFT JOIN transformed.percentile_table on percentile_table.attribute = summary_table.attribute
Note that you can trigger this to activate after the two prerequisite tables are populated.
What if you wanted to turn some of your row values into columns? Going from “long” to “wide” is a common need in data analysis.
You may want to see the different types of activities as columns. For this, you will have to enumerate each value you want to turn into a column. Let’s say the only possible values are tennis, boxing, soccer, and hockey:
-- 8. Pivoting CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, activity, score FROM original.activity_table PIVOT(SUM(score) for activity in (‘tennis’, ‘boxing’, ‘soccer’, ‘hockey’)) as p
If your platform doesn’t have PIVOT, you could try something like this instead:
-- 8. Pivoting the hard way CREATE OR REPLACE table transformed.scores_pivoted AS SELECT first_name, SUM(IF(activity = ‘tennis’, score, NULL)) tennis, SUM(IF(activity = ‘boxing’, score, NULL)) boxing, SUM(IF(activity = ‘soccer’, score, NULL)) soccer, SUM(IF(activity = ‘hockey’, score, NULL)) hockey FROM original.activity_table GROUP BY First_name
What if you have a column that contains multiple values? Let’s say you have a bunch of personal profiles with nicknames in a table called original.names:
column
type
id
INTEGER
first_name
STRING
last_name
STRING
nicknames
STRING
The nicknames are strings, and the values are comma-separated. You want to create a new table with a one-to-many mapping with the following columns (note that this only works in Snowflake at the time of writing this):
column
type
id
INTEGER
nickname
STRING
-- 10. Split a column containing delimited values CREATE OR REPLACE table transformed.nicknames AS SELECT Id, n.value::string as name FROM original.names LATERAL FLATTEN(INPUT=>SPLIT(nicknames, ‘,’)) n;
Beyond EL
With a little creativity, there is plenty you can do to more easily organize and optimize your data for reporting and dashboards. To learn more about Fivetran Transformations, sign up for a personalized demo or get started today by signing up for a free trial.
Topics
No items found.
Share
Related blog posts
No items found.
No items found.
No items found.
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.