Learn why you need data transformation tools and how to apply them in your data strategy.
Data is a business asset, but only if you can make use of it. Like ore in an underground lode, it’s worth much more after it’s mined, refined, and fashioned into something useful.
In a business setting, the ore is data that lives in business applications — Salesforce, Google Analytics, and the like. To maximize its value, you have to extract the data and load it into a repository from which you can derive business intelligence.
That’s what ELT software does: It extracts data from hundreds of different sources, loads it into a data warehouse or data lake, and transforms it into shape for use in analysis.
Everyone has a sense of what extraction and loading are, but their understanding of that third step, data transformation, may be a little nebulous. Let’s see if we can make it clearer.
Data transformation in the context of analytics can refer to a broad variety of topics, ranging from data wrangling to modeling. The way that we’ll define data transformation here is about preparing your data for analytics, from manipulating extracted data to creating a final rollup table or data mart.
As it applies to data analytics, transformation means restructuring data to fit a destination, converting fields as necessary, and enriching the raw data to make it more useful.
An example of restructuring might be taking data from a nested data structure, which is not fully normalized (in relational database terms), and representing it differently in a data warehouse that doesn’t support nested data.
An example of conversion might be taking a data source that uses floating point numbers at a certain level of precision and replicating it to a data warehouse that supports a different level of precision.
As for enriching, we’ll get to that in a moment.
The benefit of using data to inform your decisions is that you’ll have real information to make decisions. However, this data can sometimes be hard to work with — from deciphering extracted data to values that break your modeling and/or reporting. Some common problems that transformation aims to address are:
Data deduplication: various bugs and errors can erroneously duplicate records.
Data cleanup: incorrect or null values should be removed or corrected. Sensitive data should be anonymized or encrypted.
Data manipulation from extracted data: JSON returned from APIs should be flattened. Data types must be mapped to their counterparts in the destination.
Data manipulation in loading data: schemas must be updated as they change at the source. The appropriate records must be updated. “Raw” data needs to be aggregated and cleaned into a report-ready format.
Once data has been loaded into your data warehouse, you can manipulate it in various ways to make it more useful. For instance:
You can use common database expressions to join data tables, in order to relate several sources together.
You can limit data to certain data ranges, or select data from only certain divisions.
You can consolidate multiple transactions into sums, counts, or averages.
How do you transform data? You can use whatever tools you’re most familiar with. If the data is in a data warehouse, which is just a big column-oriented relational database, you can use SQL, a language that every data professional is familiar with. If you’re a developer, you might use R, an open source programming language for statistical computing and machine learning.
Many people use Excel for data transformation, but it’s not the best choice. Using Excel to do data transformation is like using a chainsaw for sculpture — you could do it, but the results might be pretty ugly.
By the way, the same advice applies to analytics tools. Don’t get your data just-so and then report on it using Excel. Get a cloud-based analytics tool to make beautiful, sophisticated reports. Reporting tools to consider include Tableau, Looker, Microsoft Power BI, Google Data Studio, Chartio, Mode, Sisense, and Grafana, among dozens of others.
Fivetran automatically addresses many of the common issues that data professionals face with transformations. In addition to scheduling extracts from your data sources, Fivetran automatically creates and manages schemas and data types appropriately for your destination. This creates tables that are immediately ready to query in your data warehouse or data lake.
In addition, Fivetran offers features such as column blocking and hashing to exclude or anonymize sensitive data. Fivetran also supports dbt through our extensive package library that can be readily applied to our supported data sources. Finally, for any post-load SQL transformations that you’re running, Fivetran Transformations can trigger your SQL code upon new data loads.