Automated data integration can help you jump-start your machine learning efforts.
It’s an oft-cited factoid that data scientists spend only 20% of their time doing actual data science work, while the rest of their time is spent on doing what is often delightfully referred to as “data munging” — that is, obtaining, cleaning, and preparing data for analysis. Regardless of how true that statistic actually is, most data scientists I know agree with the sentiment — they spend a lot of their time on tasks that do not actually require much science.
Grossly generalizing, the traditional data science workflow involves a lot of custom data extraction and cleaning work. A data scientist comes up with an idea, goes on the hunt for data, talks with a few other people about where those data might be stored and what their quirks might be, iterates a few times “cleaning” the data and discovering heretofore unknown quirks, and only after all of that process are they able to actually start working on their statistical or machine learning (ML) model*. Then, once they have an ML model that works sufficiently well, they have to try to get that model somewhere so that the predictions it produces can actually be used. In this process, the ML component, the actual “data science” component, is the least time-consuming of the whole process!
And this is a bummer! If you’re a data scientist, you want to spend your time sciencing the data, not doing tedious data-cleaning work. And if you’re a person who hires and employs data scientists, you want them spending their time on the high-value work that only data scientists can do, not re-writing the same core business definitions over and over.
In this blog post, I’m going to talk about how we as data scientists can make use of a data infrastructure paradigm semi-commonly known as The Modern Data Stack to work more efficiently and ship more ML models, faster. By building our data infrastructure on top of the modern data stack, we can make our data science more efficient, our machine-learning more robust, and our research cycle more rapid.
Note: For the rest of this blog post, I’m going to be talking about data science in the context of building and shipping ML models, although I believe that most of this blog post will generalize nicely to other types of non-production data science work as well.
I’m a firm believer in practicing lean data science — that is, applying the lean-startup methods of short iteration-times and testing early and often to in-house data science projects. I believe that by being disciplined in this way we can avoid locking ourselves in the data-science ivory tower and make sure that we’re continuously delivering value.
In general, this means shipping more ML models, faster. We want to empower our data scientists to quickly experiment with new ML models in new contexts and then put those models into production. We want our data scientists to be able to easily improve working ML models that are already working in production. And we want to be able to do all of that as cheaply as possible. Easy, right?
The goal of the infrastructure we’re going to review in this post is to meet these needs. We want
to empower data science teams to ship more ML models
data scientists to spend more time on ML modeling and less time on everything else
avoid data quality pitfalls that often slow down and frustrate data science teams
We can achieve all of these goals by building our data science infrastructure on top of the Modern Data Stack.
The modern data stack consists of:
3rd-party ingestion, handled by a service like Fivetran
A cloud data warehouse/data lake like Snowflake, Bigquery, Redshift, or Databricks
An in-warehouse data modeling layer like dbt
A BI tool for surfacing those insights to the business
This has become the standard, baseline recommendation for the majority of companies looking to upgrade their data game (though, of course, there are companies with specialty use cases that will need variations on this setup). There is a great deal of further reading about the advantages of a system like this (e.g., here, here, and here) so definitely check those out if you’re not familiar.
One thing to note is that in the ideal case, the third-party ingestion works in an append-only manner so that changes to data are preserved rather than over-written. Tools like Fivetran often support this depending on the data source, but where possible it’s best to work towards a system that preserves all history of the objects that are important to your business, not just their current state (we’ll discuss why that can be important a bit later).
Our goal for data science infrastructure is for a data scientists to be able to ship a new ML model simply by:
Writing a simple query
Training a ML model
Publishing predictions to somewhere they can be used and monitored
Ideally, steps one and three should be easy and seamless, so that data scientists can spend the bulk of their time adding value by working on the details of the ML model.
So how does the Modern Data Science Stack (MDSS) help us with this goal? Here’s what the MDSS looks like once we include our ML applications:
If we use our data warehouse as the source of truth for our data modeling efforts and as the receptacle for our predictions, we get ML model-building with much less munging and ML model monitoring without any custom software.
This allows the data scientist to make use of these business concepts in their ML models without having to figure out the definitions on their own. Even better, as other team members contribute to improving the data quality in the warehouse (finding and squashing bugs, refining business logic, etc.) those data quality improvements will flow directly into the machine learning models. Another win!
Once we have this infrastructure set up and working, the ML modeling workflow gets streamlined so that we can spend more time modeling and less time doing data wrangling. Instead of spending the first days or weeks of a project cleaning and organizing data, we can quickly assemble the data that we need and start on machine learning.
This is a huge win that few people appreciate — first and foremost, we can reduce the data-munging burden on data scientists by defining all of our core business concepts in the data warehouse. Answers to questions like “do we consider a customer churned when their credit card fails?” and “how do we calculate net revenue?” can be pre-calculated and readily available for ML modeling, saving data scientists tons of time. And, we can be sure that we’re using the same definitions for critical business concepts as everyone else in the organization.
So the ML modeling workflow looks like:
Write a simple query to select the data that we want from the tables that are available
Train the ML model offline, just like normal
Deploy the trained ML model to make predictions
Feed the trained ML model with data using the same query from step 1
Write predictions back into the data warehouse for ongoing monitoring or use in other applications
With a little bit of infrastructure, we can make steps one, two, four, and five super simple so that we can spend the majority of our time on step three -- where we can actually add the most value as data scientists.
Critically, by making use of the common infrastructure already used and supported by the rest of the data team, we can worry much less about all of the parts of data pipelining that can be a real headache if we’re trying to work in a silo.
By relying on an off-the-shelf tool like Fivetran for data extraction and loading, we get a pipeline that Just Works™, data that’s consistent and easy to reason about, and, at least in some cases, the change-data-capture we crave all right out of the box.
Assuming we’ve got a data engineering or warehousing team maintaining the data warehouse or data lake, we can sleep easy knowing they’re making sure the pipes are hooked up and running consistently. As data scientists, we want to take advantage of the infrastructure that already exists for common purposes so we don’t have to build or maintain our own infrastructure.
There are further benefits that we haven’t covered in detail, too! If we follow the pattern laid out above, we also get ML model inspectability, reproducibility, and generalizability. We will discuss those topics in another post.
Until then, remember how the modern data stack can help your entire organization work more efficiently. If you have questions/thoughts/ideas about how to put this into practice, I’d love to hear them. Please reach out.
* Unfortunately, in the data world the term “model” is overloaded and can refer to both a “data model” (the schema and semantic meaning of a relation in a database) as well as a statistical model or machine learning model used for in-depth analysis or prediction. In this blog post I will refer to the former explicitly as a “data model” and the latter as an “ML model”.