Data can transform a struggling business into a successful one, but first, you need to transform the data. Here’s how.
Businesses analyze data to better understand their customers, identify new opportunities, and generally make better decisions. In order to analyze data, data must first be organized and structured in a manner that can be easily interpreted by analysts and formed into dashboards, reports, and predictive models.
Raw data is seldom structured or formatted in a way that is directly conducive to analysis. Transformation refers to any of the operations involved in turning raw data into analysis-ready data models. Data models are representations of reality that help users accomplish specific goals. For the purposes of analytics, businesses need KPIs and other metrics in order to quantify and understand what, and how, they are doing.
For example, in a business setting, the most valuable data often lives in business applications — Salesforce, Google Analytics, and the like. To maximize its value, you have to extract the data, load it into a repository, and transform it into a data model — from which you can derive business intelligence.
That’s what ELT software does: It extracts data from hundreds of different sources, loads it into a data warehouse or data lake, and transforms data from one format into another for use in analysis.
The main challenges concerning transformation are project turnaround time and accessibility. Traditionally, data transformation has been the purview of engineers well-versed in scripting languages, and has required lengthy project work cycles.
Modern approaches to transformation leverage cloud-based tools and technologies that collectively form the modern data stack.
Data can be transformed in the following ways:
Cleaning: altering values for formatting compatibility
Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
Key restructuring: altering values so that they can be used as keys across tables
Deduplication: identifying and removing duplicate records.
Data validation: evaluate the validity of a record by the completeness of the data
Derivation: performing simple cross-column calculations
Summarization: using aggregate functions to get summary values
Pivoting: turning row values into columns
Sorting, ordering and indexing: to improve search performance
Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
Filtering: selecting only certain rows or columns
Joining: linking data across sources
Merging/appending/union: combining records from multiple sources
Integration: reconcile names and values for the same data element across different sources
The following example involves pivoting and summarizing. Imagine you have raw data in a table with a stream of activities:
The table above isn’t particularly amenable to being read or interpreted by an analyst, nor can it readily be represented by a trendline.
Suppose you want to create a chart illustrating purchases and returns by date. In order to turn the preceding data into a trendline, you will have to group the data by date and count the number of each activity. You may end up with a table like this:
By creating this new table, you have pivoted the original table by turning values from the “activity” column into columns of their own, under which you have summarized the data by counting instances.
Data can be transformed at two stages of the data integration process. It can take place before or after the data is loaded to its destination, typically a data warehouse.
In traditional ETL, data is transformed into analysis-ready data models before it is loaded to the destination. Combining transformation with loading into the same step can preserve storage and compute resources, but introduces a great deal of brittleness into the data engineering workflow. This also means that the software used for transformations is typically written using scripting languages such as Python and Java. In addition, transformations in ETL may require a great deal of complex orchestration using tools such as Airflow.
As a result, transformations under ETL tend to be the purview of expert users, specifically engineers and data scientists.
In the modern ELT workflow, raw data is transformed into analysis-ready data models after data is loaded. Data in a warehouse environment can be transformed using SQL. This makes transformations accessible to analysts and other SQL-literate members of your organization, rather than only engineers, data scientists, and other people with serious coding chops.
For instance, suppose you’re a retailer with both physical and virtual storefronts. Your systems generate a lot of data, from inventory, point of sale, marketing, shipping, taxes, and more. You can replicate data from all of the systems that create it, then update it to a data warehouse for analysis — that’s the extract and load part.
How might you make use of the data?
You might want to create a dashboard showing timely key results and historical trends over time.
To maximize your revenue and your profits, you might generate prescriptive analytics reports. These might consider available resources, current and past performance, and desirable metrics, then suggest a strategy or specific actions.
You might even want to use the data to train a machine learning algorithm to handle tasks like just-in-time inventory and manufacturing.
Read more about the difference between ETL and ELT in our recent post on the topic.
A suite of data integration tools, called a modern data stack, that leverages ELT looks like so and funnels data through the following stages in the order specified:
Sources – data from operational databases, SaaS apps, event tracking
Data pipeline – extracts data from sources and loads it into the data warehouse, sometimes normalizing it
Data warehouse – stores data in a relational database optimized for analytics
Data transformation tool – a SQL-based tool (such as dbt) that uses data from the source to create new data models within the data warehouse
Analytics tool – tools for generating reports and visualizations, such as business intelligence platforms
When building a modern data stack, be sure to consider where your business transformations will take place — in a data warehouse or before loading the data — as this will likely determine the agility of your projects and team.
There are a couple of tools often used to do data transformation:
Python: This is usually used in combination with a data analysis library like pandas.
SQL: Used in a destination cloud data warehouse to leverage the powerful capabilities of the cloud, e.g. Snowflake, BigQuery or Redshift.
If you’re interested in learning more about the benefits of either approach, Fivetran CEO George Fraser goes into more detail in his blog post on Towards Data Science.
You can also do data transformation in R, an open source programming language for data processing, statistical computing and machine learning that many developers and data scientists are familiar with.
It is certainly possible to transform data in Excel as well. It’s not a good idea, though: Excel lacks the robustness of modern data transformation tools and can only handle small numbers of records. More importantly, using Excel for data transformation generally means you’re downloading data and removing it from data governance systems where it can be controlled, shared, and updated.
Every transformation is custom-built to produce the specific data models a particular organization needs for analytics. Using ELT to decouple extraction and loading from transformation allows you to automate and outsource the extraction and loading stages of the data integration process and use a dedicated SQL-based transformation tool once the data is warehoused.
Roughly speaking, transformed data models within the data warehouse can be views or materialized views. Views are not stored on disk. Every time someone accesses a view, the data warehouse runs a query to return the relevant data. This includes performing joins, aggregations, and other operations on the fly. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.
Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, data warehouses will sometimes completely choke if you try to surface tables on-the-fly with hundreds of millions of rows.
Look for transformation tools that are SQL-based and work in the data warehouse environment. Some features to look for in transformation tools include test-driven development, collaboration and version control. Some BI platforms also offer proprietary tools and languages to build transformations.
In order for outsourced, automated ELT to work at scale, the outputs of the extraction and loading pipelines must be standardized. The simplest way to standardize a data model is through normalization.
Normalization of raw data eliminates redundancies, duplicates, and derived values, and organizes tables from the data into the clearest, simplest possible set of interrelations. This helps analysts easily interpret the underlying data model of the source app, and then construct new analysis-ready data models accordingly. Since normalization is a kind of transformation, ELT is, strictly speaking, often EtLT rather than ELT.
Normalization, the small “t,” can be an extremely difficult and labor-intensive task. To correctly normalize the data from a source, you need a keen understanding of the source application’s underlying functionality and data model.
The best way to circumvent this challenge is to outsource extraction and loading to a team that has extensive experience with data engineering for that particular source. At Fivetran, we are ready to help you get started