What Is Data Transformation?
What Is Data Transformation?

What Is Data Transformation?

Learn how and when to apply transformations in your data strategy.

By Dailey Kluck, Brandon Chen, August 25, 2020

Definition of Data Transformation

Data transformation in the context of analytics can refer to a broad variety of topics, ranging from data wrangling to modeling. The way that we’ll define data transformation here is about preparing your data for analytics, from manipulating extracted data to creating a final rollup table or data mart.

For definitions of other terms used in this post, please refer to our Definitive Glossary of Data Integration. For an in-depth treatment of the subject, read Data Transformation Explained.

Common Data Transformation Problems

The benefit of using data to inform your decisions is that you’ll have real information to make decisions. However, this data can sometimes be hard to work with — from deciphering extracted data to values that break your modeling and/or reporting. Some common problems that transformation aims to address are:

  • Data deduplication: often happening with misfired events resulting in recording actions twice

  • Data cleanup: incorrect or null values recorded, anonymization or encryption of sensitive data

  • Data manipulation from extracted data: flattening JSON returned from APIs, data type mapping

  • Data manipulation in loading data: schema maintenance, updating data into appropriate rows, aggregating and cleaning “raw” data into a report-ready format

How to Do Data Transformation

For simplicity's sake, we'll consider two "places" data transformation can take place relative to your data warehouse: pre-load and post-load. You can read more about our post on ETL vs ELT that offer perspectives on when to apply both. When building a Modern Data Stack, be sure to consider where your business transformations will take place, as this will likely determine the agility of your projects and team. There are a couple common ways to do this:

  • Python: This is usually used in combination with a library like pandas to emulate SQL for data manipulation.

  • SQL: Applied in the destination database which can leverage the powerful capabilities of a cloud data warehouse like Snowflake, BigQuery or Redshift

If you’re interested in learning more about the benefits of either approach, George Fraser (our CEO), goes into more detail in his Towards Data Science blog post. In addition to the aforementioned methods, you may also use an open source tool like Airflow or dbt to orchestrate transformations.

Where Fivetran Fits In

Fivetran automatically addresses many of the common issues that data professionals face with transformations. In addition to scheduling extracts from your data sources, Fivetran automatically creates and manages schemas and data types appropriately for your destination. This creates tables that are immediately ready to query in your data warehouse or data lake.

In addition, Fivetran offers features such as column blocking and hashing to exclude or anonymize sensitive data. Fivetran also supports dbt through our extensive package library that can be readily applied to our supported data sources. Finally, for any post-load SQL transformations that you’re running, Fivetran Transformations can trigger your SQL code upon new data loads.

Start analyzing your data in minutes, not months

Launch any Fivetran connector instantly.