Learn the how and what of analytics and data integration.
What is data analytics
How do you integrate data?
Should you build or buy a data analytics solution?
What are some business and technical considerations for choosing a data analytics tool, and how can you get started?
Let’s start with the first two questions.
Analytics is about using data to answer questions and solve practical problems. The following diagram is called a polar area chart or coxcomb diagram:
This particular example was created by Florence Nightingale during the Crimean War. Florence Nightingale was a trained statistician as well as a nurse. By carefully tabulating and visualizing morbidity and mortality data, Nightingale determined that it was disease, not combat injuries, that was overwhelmingly killing British soldiers in theater.
Beer aficionados-cum-statistics nerds might also know that William Sealy Gosset, the chief brewer at Guinness more than a hundred years ago, was also a trained statistician. He developed the Student’s t-test to ensure beer quality.
In the context of the modern business world, analytics has the following applications:
Ad hoc reporting – Decision makers and stakeholders within an organization will often ask for the answer to a specific question on a one-off or occasional basis. Examples include identifying last quarter’s best-performing product, or the breakdown of paid, referral, and organic traffic on a website.
Business intelligence – A more organized and systematic approach to analytics than ad hoc reporting. This includes reports and dashboards with data models and visualizations to guide business decisions and strategies.
Data as a product – Companies can make data available to third parties in the form of embedded dashboards, data streams, recommendations, and other data-driven products and services. A real-world example is Yelp, a consumer-facing product that aggregates huge amounts of data to repackage into information and recommendations for customers.
Artificial intelligence/machine learning – The most sophisticated use of analytics is to build systems that can use information to automate decisions, both internally, on behalf of the company, and externally, as a product for customers.
At the organizational level, analytics allows an organization to pursue the following objectives:
Democratizing access to data/data literacy – As employees use data more and more to make decisions, an organization will react more intelligently to changing circumstances. With the right BI tools, even non-technical team members can make decisions based on data.
Enhancing your products and services – Insights gained from analytics will help to improve your offerings and provide additional transparency and reporting options to your customers
Keeping your organization competitive – Data literacy allows you to make the most of finite resources and uncover new opportunities that would otherwise be invisible
Knowledge is power. It pays to know more than the competition.
The goal of analytics is gaining a big-picture, 360-degree view of your organization’s operations so you can see how the parts work together instead of having the different business units within your organization work only with siloed, isolated samples of your organization’s data. In order to do this, you need a central repository of record that contains all of your data and updates more-or-less in real-time. The process of moving data into this central repository is data integration.
Data is collected from the real world in a variety of ways, such as:
Digital activity recorded by software triggers, like clicks on a website or application
Digital documents and content, such as social media posts
Sensor inputs, like scans at a checkout line
Manual data entry, like forms that are sent and collected
This data is usually stored in cloud-based files and databases. For a business or other organization, that means you access this data through the following means:
Database logs and query results
The volume and variety of this cloud-based data we just described have grown explosively. The chart below represents the overall volume of the world’s data in zettabytes (one billion terabytes). The trend is clearly exponential.
Source: IDC Global Datasphere 2018
How this translates into the on-the-ground reality for companies looks like this:
This graphic shows how many apps a typical company in each size bracket uses. Even small operations use dozens of data sources, and the range of apps very quickly grows alongside headcount. All of these tools leave digital footprints that can give you a more complete picture of an organization’s activities.
There are four basic steps to data integration:
Data starts off in files, operational databases, and other “sources.”
Data is extracted from sources to a destination, typically a data warehouse.
Data is cleaned, modeled and prepared for its final use.
Data is used for analytics and building data products such as artificial intelligence applications.
Non-scalable approaches to data integration also exist, particularly manually gathering and stitching data together, but such approaches are very labor-intensive and inadvisable. Rather, data integration is best performed using a suite of technological tools called a data stack.
The traditional architecture of a data stack, dating back to the 1970s, is ETL.
ETL consists of the following sequence of steps:
First, various data sources create and store data.
Then, the data pipeline extracts and transforms the data, turning it into models that analysts use to build reports and dashboards.
Then, data is loaded into a destination, typically a data warehouse.
The data is accessed through a business intelligence tool connected to the data warehouse. With the proper expertise and creativity, a team can also build machine learning or artificial intelligence applications using the data warehouse.
The project workflow for handling ETL is represented by this image
First, identify sources.
Then, scope the exact needs of the analytics work.
Then, define the schema. This means organizing the data into tables that analysts will be able to use.
Then, build the actual pipeline.
Finally, turn the data into analytics and extract insights.
The problem with the ETL approach is that the workflow is quite brittle. Extraction and transformation are tightly coupled with each other, which means that if one fails, so does the other. Staging transformation before loading introduces two failure states that prevent the data from correctly loading into its destination:
The schema changes at the source, so transformations break.
The specific data models that analysts or business users need change, making the ETL pipeline obsolete.
These conditions should, and will, happen repeatedly. Apps frequently update and data professionals should be constantly interrogating the data to learn new things.
In short, ETL pipelines demand constant maintenance and inevitably cause downtime.
So why is such a fragile workflow the norm in data analytics? Fundamentally, ETL is a method for preserving machine resources like computation and storage at the expense of human labor. It’s a product of a time with very different technological constraints. When ETL was invented in the 1970s punch-card computers were still common.
The chart below illustrates a version of Moore’s law, in which the cost of computation plummets in a matter of decades.
The cost of storage has likewise fallen, with a gigabyte costing a million just 40 years or so ago to just cents today.
Likewise, the cost of internet transit has crashed as well.
The modern-day alternative to ETL that leverages these cost savings is ELT.
By decoupling extraction and transformation, ELT removes the brittleness from extraction and loading, ensuring that data will populate the destination regardless of whether transformations are valid or not.
The ELT workflow features a shorter cycle for failure recovery:
Identify desired data sources.
Perform automated extraction and loading.
Scope the exact analytics needs the project is meant to solve.
Create data models by building transformations.
Conduct actual analytics work and extract insights.
Instead of transforming before loading, the system loads data more-or-less directly from the source to the destination. Then, with the data already in hand, analysts and data engineers can build transformations as needed, then surface the resulting data models in a business intelligence tool.
Extraction and loading are now separated from transformation, and no longer require custom configuration. This means that extraction and loading can be:
Outsourced to an outside party
Scaled up and down as needed over the cloud
We will further discuss the ramifications of outsourcing and automating in Part 2.