Data can transform a struggling business into a successful one. Learn what data transformation is, how it’s done, and why it’s essential for your analytics program.
Businesses analyze data to better understand their customers, identify new opportunities, and generally make better decisions. In order to analyze data, data must first be organized and structured in a manner that can be easily interpreted by analysts and formed into dashboards, reports, and predictive models.
Raw data is seldom structured or formatted in a way that is directly conducive to analysis. Transformation refers to any of the operations involved in turning raw data into analysis-ready data models. Data models are representations of reality that help users accomplish specific goals. For the purposes of analytics, businesses need KPIs and other metrics in order to quantify and understand what, and how, they are doing.
The main challenges concerning transformation are project turnaround time and accessibility. Traditionally, data transformation has been the purview of engineers well-versed in scripting languages, and has required lengthy project work cycles. Modern approaches to transformation leverage cloud-based tools and technologies that collectively form the modern data stack.
Data can be transformed in the following ways:
Cleaning: altering values for formatting compatibility
Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
Key restructuring: altering values so that they can be used as keys across tables
Deduplication: identifying and removing duplicate records.
Data validation: evaluate the validity of a record by the completeness of the data
Derivation: performing simple cross-column calculations
Summarization: using aggregate functions to get summary values
Pivoting: turning row values into columns
Sorting, ordering and indexing: to improve search performance
Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
Filtering: selecting only certain rows or columns
Joining: linking data across sources
Merging/appending/union: combining records from multiple sources
Integration: reconcile names and values for the same data element across different sources
The following example involves pivoting and summarizing. Imagine you have raw data in a table with a stream of activities:
The table above isn’t particularly amenable to being read or interpreted by an analyst, nor can it readily be represented by a trendline.
Suppose you want to create a chart illustrating purchases and returns by date. In order to turn the preceding data into a trendline, you will have to group the data by date and count the number of each activity. You may end up with a table like this:
By creating this new table, you have pivoted the original table by turning values from the “activity” column into columns of their own, under which you have summarized the data by counting instances.
Data can be transformed at two stages of the data integration process. It can take place before or after the data is loaded to its destination, typically a data warehouse.
In traditional ETL, data is transformed into analysis-ready data models before it is loaded to the destination. Combining transformation with loading into the same step can preserve storage and compute resources, but introduces a great deal of brittleness into the data engineering workflow. This also means that the software used for transformations is typically written using scripting languages such as Python and Java. In addition, transformations in ETL may require a great deal of complex orchestration using tools such as Airflow.
As a result, transformations under ETL tend to be the purview of expert users, specifically engineers and data scientists.
In the modern ELT workflow, raw data is transformed into analysis-ready data models after data is loaded. Data in a warehouse environment can be transformed using SQL. This makes transformations accessible to analysts and other SQL-literate members of your organization, rather than only engineers, data scientists, and other people with serious coding chops.
A suite of data integration tools, called a modern data stack, that leverages ELT looks like so and funnels data through the following stages in the order specified:
Sources – data from operational databases, SaaS apps, event tracking
Data pipeline – extracts data from sources and loads it into the data warehouse, sometimes normalizing it
Data warehouse – stores data in a relational database optimized for analytics
Data transformation tool – a SQL-based tool (such as dbt) that uses data from the source to create new data models within the data warehouse
Analytics tool – tools for generating reports and visualizations, such as business intelligence platforms
Every transformation is custom-built to produce the specific data models a particular organization needs for analytics. Using ELT to decouple extraction and loading from transformation allows you to automate and outsource the extraction and loading stages of the data integration process and use a dedicated SQL-based transformation tool once the data is warehoused.
Roughly speaking, transformed data models within the data warehouse can be views or materialized views. Views are not stored on disk. Every time someone accesses a view, the data warehouse runs a query to return the relevant data. This includes performing joins, aggregations, and other operations on the fly. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.
Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, data warehouses will sometimes completely choke if you try to surface tables on-the-fly with hundreds of millions of rows.
Look for transformation tools that are SQL-based and work in the data warehouse environment. Some features to look for in transformation tools include test-driven development, collaboration and version control. Some BI platforms also offer proprietary tools and languages to build transformations.
In order for outsourced, automated ELT to work at scale, the outputs of the extraction and loading pipelines must be standardized. The simplest way to standardize a data model is through normalization.
Normalization of raw data eliminates redundancies, duplicates, and derived values, and organizes tables from the data into the clearest, simplest possible set of interrelations. This helps analysts easily interpret the underlying data model of the source app, and then construct new analysis-ready data models accordingly. Since normalization is a kind of transformation, ELT is, strictly speaking, often EtLT rather than ELT.
Normalization, the small “t,” can be an extremely difficult and labor-intensive task. To correctly normalize the data from a source, you need a keen understanding of the source application’s underlying functionality and data model.
The best way to circumvent this challenge is to outsource extraction and loading to a team that has extensive experience with data engineering for that particular source. At Fivetran, we are ready to help you get started