What Is Data Transformation?
Definition of Data Transformation
Data transformation in the context of analytics can refer to a broad variety of topics, ranging from data wrangling to modeling. The way that we’ll define data transformation here is about preparing your data for analytics, from manipulating extracted data to creating a final rollup table or data mart.
For definitions of other terms used in this post, please refer to our Definitive Glossary of Data Integration.
Common Data Transformation Problems
The benefit of using data to inform your decisions is that you’ll have real information to make decisions. However, this data can sometimes be hard to work with — from deciphering extracted data to values that break your modeling and/or reporting. Some common problems that transformation aims to address are:
- Data deduplication: often happening with misfired events resulting in recording actions twice
- Data cleanup: incorrect or null values recorded, anonymization or encryption of sensitive data
- Data manipulation from extracted data: flattening JSON returned from APIs, data type mapping
- Data manipulation in loading data: schema maintenance, updating data into appropriate rows, aggregating and cleaning “raw” data into a report-ready format
How to Do Data Transformation
For simplicity's sake, we'll consider two "places" data transformation can take place relative to your data warehouse: pre-load and post-load. You can read more about our post on ETL vs ELT that offer perspectives on when to apply both. When building a Modern Data Stack, be sure to consider where your business transformations will take place, as this will likely determine the agility of your projects and team. There are a couple common ways to do this:
- Python: This is usually used in combination with a library like pandas to emulate SQL for data manipulation.
- SQL: Applied in the destination database which can leverage the powerful capabilities of a cloud data warehouse like Snowflake, BigQuery or Redshift
If you’re interested in learning more about the benefits of either approach, George Fraser (our CEO), goes into more detail in his Towards Data Science blog post. In addition to the aforementioned methods, you may also use an open source tool like Airflow or dbt to orchestrate transformations.
Where Fivetran Fits In
Fivetran automatically addresses many of the common issues that data professionals face with transformations. In addition to scheduling extracts from your data sources, Fivetran automatically creates and manages schemas and data types appropriately for your destination. This creates tables that are immediately ready to query in your data warehouse or data lake.
In addition, Fivetran offers features such as column blocking and hashing to exclude or anonymize sensitive data. Fivetran also supports dbt through our extensive package library that can be readily applied to our supported data sources. Finally, for any post-load SQL transformations that you’re running, Fivetran Transformations can trigger your SQL code upon new data loads.