Learn how and when to apply transformations in your data strategy.
Data transformation in the context of analytics can refer to a broad variety of topics, ranging from data wrangling to modeling. The way that we’ll define data transformation here is about preparing your data for analytics, from manipulating extracted data to creating a final rollup table or data mart.
The benefit of using data to inform your decisions is that you’ll have real information to make decisions. However, this data can sometimes be hard to work with — from deciphering extracted data to values that break your modeling and/or reporting. Some common problems that transformation aims to address are:
Data deduplication: often happening with misfired events resulting in recording actions twice
Data cleanup: incorrect or null values recorded, anonymization or encryption of sensitive data
Data manipulation from extracted data: flattening JSON returned from APIs, data type mapping
Data manipulation in loading data: schema maintenance, updating data into appropriate rows, aggregating and cleaning “raw” data into a report-ready format
For simplicity's sake, we'll consider two "places" data transformation can take place relative to your data warehouse: pre-load and post-load. You can read more about our post on ETL vs ELT that offer perspectives on when to apply both. When building a Modern Data Stack, be sure to consider where your business transformations will take place, as this will likely determine the agility of your projects and team. There are a couple common ways to do this:
Python: This is usually used in combination with a library like pandas to emulate SQL for data manipulation.
SQL: Applied in the destination database which can leverage the powerful capabilities of a cloud data warehouse like Snowflake, BigQuery or Redshift
If you’re interested in learning more about the benefits of either approach, George Fraser (our CEO), goes into more detail in his Towards Data Science blog post. In addition to the aforementioned methods, you may also use an open source tool like Airflow or dbt to orchestrate transformations.
Fivetran automatically addresses many of the common issues that data professionals face with transformations. In addition to scheduling extracts from your data sources, Fivetran automatically creates and manages schemas and data types appropriately for your destination. This creates tables that are immediately ready to query in your data warehouse or data lake.
In addition, Fivetran offers features such as column blocking and hashing to exclude or anonymize sensitive data. Fivetran also supports dbt through our extensive package library that can be readily applied to our supported data sources. Finally, for any post-load SQL transformations that you’re running, Fivetran Transformations can trigger your SQL code upon new data loads.