Avoiding a common pitfall in data science by enabling history mode.
In my last blog post on the modern data science stack, I discussed ways to use the modern data stack as a foundation for doing data science. Today’s blog post is centered around the idea that we can (and should) use our cloud data warehouse for more than just business intelligence (BI) — if set up and managed properly, we can use our centralized data store to power common data science and machine learning tasks without having to invest in large amounts of costly custom software.
The most common problem that plagues machine-learning and predictions efforts in this environment is the issue of untracked slowly changing dimensions. In this situation, the data are stateful and changes to the underlying values aren’t tracked, making it impossible to recreate the state-of-the-world as of some time in the past. If not handled correctly, this problem can cause your machine-learning models to have very poor real-world predictive accuracy.
Let’s imagine we have a subscription business that bills customers monthly. We want to build a model that predicts which of our customers are going to churn before their next order so that we can intervene by offering them a discount to keep from them churning.
It turns out that, like many data science problems in the real world, plausibly the hardest part of this project is getting the problem set up correctly, before we even get to the actual machine-learning and prediction piece. This problem exemplifies a few pitfalls which are both extremely common and also very subtle. Many data scientists (including yours truly) have fallen prey to these issues in practice — in this blog post we are going to use this example to examine some of those data perils and how they impact our data science projects.
The first way we might approach this problem is to obtain our dataset from our data warehouse subscriptions and customers table that looks something like this:
This table has one row per subscription with information about when they churned and related customer information with subscription details. You can imagine this table to have hundreds of columns, ideally all features that have already been created and stored in the data warehouse.
So, naively, the problem seems simple enough— we will take this dataset and build a classifier that predicts the “unsub” column using all of the other columns to the right. For the discussion, the choice of classifier actually does not concern us too much —we could use a logistic regression, Xgboost, a deep learning model or whatever the machine learning flavor of the week is.
Unfortunately, no matter which classification algorithm we choose to work with, if we follow this path we are going to get bad results. The way we’ve set up this problem is wrong!
The incorrect setup is due to the fact that this model doesn’t line up well with the prediction problem our business is really facing. This setup is more oriented toward a more academic survival-style analysis, but it does not match the problem we will be facing with live data in production. We need to make a prediction every month among the subscriptions that are active as of that month. So, we will end up making repeated predictions for each subscription over time.
Since the setup described above only makes one prediction per subscription with all data as of the day the data were pulled, it means that our model would not match up with our real-world use case, and any performance metrics (e.g., holdout precision and recall) that we attempt to calculate from our model will not tell us anything about how the model will perform once we take it live (and we’re throwing away lots of useful observations for training!).
We want our model to reflect the prediction problem we’re going to actually face — so we can create a data set that looks more like the data we’ll see in practice. What we’re going to do is to reorganize the data set we created above so that it has one row per subscription per month instead of just one-row per subscription.
I have selected a few example subscriptions from the table above to show how the re-formatted data look. Since this is what the data will look like when we make our predictions (since we are predicting every month), this is what we want the data to look like when we train our model.
This is the first pitfall that data scientists sometimes fall into —it is important to think critically about the real-world phenomenon that’s being modeled and make sure your model-training process matches that phenomenon as closely as possible.
Unfortunately, that was the easiest pitfall to identify. The next problem is more nefarious and more subtle: it’s the problem of slowly changing dimensions. The issue is that what we want is for the data we use to train the model to be like what the model would have seen if the model were live in production — but that means that we need to know the “history” of all of the columns as of the time when the prediction would have been made (as of month 1, as of month 2), etc.
The astute reader may already have noticed that the ltv column in this table is suspect — that column appears to be showing the cumulative “lifetime value” (LTV) for each customer up until today — so if we use it as a predictor, it will mislead the model because it “knows too much” about the future. When we are making predictions in “production”, that column will be substantially less informative than it would be in training since as it works today, it encodes information about whether or not that customer has unsubscribed or not (customers with higher LTV are less likely to have unsubscribed, since that is how a customer gets to have a high LTV!).
The ltv example is pretty obvious, and it is pretty easy to imagine how to fix that calculation so that it correctly reflects a customers lifetime value as the sum of that customer’s payments only up to that point. Unfortunately, other columns are not nearly so obvious but can be just as misleading. In the data above, I’ve imagined a situation in which the business has a policy of scrubbing personally-identifying-information from subscribers who have canceled their subscriptions — that means that the production application has overwritten the email address with a NULL value, and so our “email domain” field has NULL values for every customer that has churned.
This is obviously going to be problematic for our machine learning model! During the training process, our model will identify that the email domain is highly predictive of churn, and so will generate very accurate predictions on our validation data. Unfortunately, this model will not generalize at all and we will see disastrous results if we attempt to use it in production.
And this example is among the most obvious of this type of pitfall, and I use it for illustrative purposes, but really any “slowly changing dimension” can suffer from this problem. In the example above, what happens if a customer moves? Are we correctly changing the geo_state column? If they change their subscription, are we changing the subscription_amt column?
This problem can actually be extremely difficult and error-prone to handle in practice with most data storage systems that are in common use. It’s important to note that this problem isn’t limited to data scientists working off of data warehouses — any data science project based off of any data source could potentially fall victim to the problem of slowly-changing dimensions.
The data structure that we want in order to be able to easily set up this analysis is a log rather than a table. That is, we want to track every change that’s made to the different values in the database and store those in a long time series such that we can “replay” the state of the data at any point in time (if you take the last logged change for every record that was made before a point in time, that should give you the state of the database at that point in time). This is not git for data, but rather it makes use of the table-log-duality to both track all changes and present the “current state” of the data for when that is most useful.
Depending on your data infrastructure, collecting all of the changes to your slowly-changing dimensions may seem impossible. However, Fivetran recently released a new feature called “history mode” that will automatically record and capture changes to slowly changing dimensions so that they can be stored in the log format we want above.
Once you have these data available as a log, compiling the dataset we need for this model is straightforward —we will write a loop to calculate all of the features we want to use in our machine learning model as of every month that we have history for —since our history-model log table has a field that indicates the last time the record was changed, we can re-calculate our metrics using only data where the last-changed-at field is less than the as-of prediction time.
This solves the slowly-changing dimension problem and will help ensure that our ML model will generalize nicely from training to production.