# Machine Learning in One Lesson

*Note: This blog post assumes a working knowledge of Looker and LookML. Here’s a tutorial.*

Our goal at Fivetran is to make data science more accessible by automating the grunt work of data engineering. Our counterparts at Google and Looker have made it even easier to ascend the hierarchy of data needs by providing machine learning through BigQuery and LookML.

As Abhishek Kashyap noted at Cloud Next ‘18, data scientists are scarce and moving data at scale in and out of BigQuery (or any other data warehouse) is a pain. BigQuery ML allows you to build machine learning models in SQL and keep them in the same environment as the original data. Likewise, Looker allows you to keep all of your analytics code on one platform.

While you won’t be able to run convolutional neural networks or other forms of deep learning using BigQuery ML, you can create simple prediction (linear regression) and classification (binary and multiclass logistic regression) models, as well as k-means clustering models. Regression models are commonly used in both industry and academia because of their interpretability and fast execution.

## The Basic Machine Learning Workflow

Your basic machine learning workflow looks something like this:

- Collect data
- Generate a hypothesis and define features
- Prepare a training set
- Assemble a model
- Validate the model
- Export the model

The majority of a data scientist’s time is spent acquiring and wrangling data; this includes both **steps 1** and **3**. With Fivetran, BigQuery ML and LookML in the mix, **steps 1** and **3** become greatly simplified. **Step 6** becomes wholly unnecessary, as you can build, test and save your models on the same platform as the data.

Let’s walk through an example.

## Linear Regression

### Bivariate Analysis

It can be hard to come up with a hypothesis in a vacuum, and “common sense” will only take you so far. Your goal is to see how some important metric, called the **dependent variable** or **target** — say, revenue, renewal rates, page views or sign-ups — appears to relate to other factors, called **independent variables** or **features**, so that you know what to influence in order to achieve a positive outcome.

Choose an **explore** and thoroughly explore it for associations between variables. The most straightforward approach is to perform pairwise comparisons between every pair of variables that plausibly shares some kind of relationship. Scatterplots are the simplest way to visually explore these relationships:

Positive, zero and negative correlations, courtesy of Free Code Camp.

In a Looker Explore, you can add trendlines to your visualizations by using the `EDIT`

button in the upper right of the `VISUALIZATION`

dropdown, selecting the `Y`

tab, and scrolling down. Note that in order to draw trendlines through scatterplots, you must have the appropriate admin privileges. One of the assumptions of multiple linear regression is that the associations between the target and features are, in fact, linear, so it helps to confirm that they can be reasonably described using straight trendlines.

As another basic sniff test, you can get correlation coefficients using table calculations in your explore. Click the `Calculations`

button in the `DATA`

dropdown and use the `correl`

function. The rule of thumb for correlations is that:

Size of Correlation | Interpretation |
---|---|

0.90 to 1.00 (-0.90 to -1.00) | Very high positive (negative) correlation |

0.70 to 0.90 (-0.70 to -0.90) | High positive (negative) correlation |

0.50 to 0.70 (-0.50 to -0.70) | Moderate positive (negative) correlation |

0.30 to 0.50 (-0.30 to -0.50) | Low positive (negative) correlation |

0.0 to 0.30 (0.0 to -0.30) | Minimal correlation |

Make note of which variables are most strongly correlated with your desired target. Feel free to ignore those with minimal correlation.

Scatterplots only work if you’re comparing numeric variables. If they are categorical (but ordinal), then a table or column chart with the appropriate sorting might work:

### Generate a Hypothesis, Define Features, and Build Your Model

Once you have examined the correlations and determined a list of prospective features, you should be ready to begin.

Your goal is to produce a model with the form:

`Y = a + b1*X1 + b2*X2 + ... + bp*Xp`

Where `Y`

is the **target** `[X1, X2, …, Xp]`

are features `[b1, b2, …, bp]`

are **weights** or coefficients `a`

is the intercept.

Go into **development mode** in Looker and create a new **view** for your data model. You will need to set up a view that includes your regression model inputs, training information, weights and feature info. You will also need to go into the **model** and write a join between the explore you are using and the prediction explore (more on that in a bit) based on whatever unique identifier applies to the individual records you are trying to predict the values of.

**A cautionary note:** In my case, the explore I was referring was derived from an extremely complex query (composed of joins between over a dozen tables), and attempting to run calculations on it using BigQuery caused a `Resources exceeded during query execution`

error. My workaround was to copy the data from the derived table into its own temporary table, and then to turn that into an **explore** in its own right.

I organized my **view** like so. Your views, explores and columns will obviously have different names, but note how everything is organized:

```
# List your features in this view.
# Use the “explore_source” parameter to refer to the explore containing the features and target
view: marketing_regression_input {
derived_table: {
explore_source: marketing_activity_source {
column: account_name {}
column: cost {}
column: count_records {}
column: count_blogs {}
column: count_people {}
column: total_won_amount {}
filters: {
field: marketing_activity_source.account_name
value: "-NULL"
}
filters: {
field: marketing_activity_source.is_won
value: "Yes"
}
}
}
}
# This is the actual model. Note where it refers back to the “marketing_regression_input” view
view: marketing_regression {
derived_table: {
datagroup_trigger: eod_last_day_of_month
sql_create:
CREATE or REPLACE MODEL ${SQL_TABLE_NAME}
OPTIONS(model_type='linear_reg'
, labels=['total_won_amount']
) AS
SELECT
* EXCEPT(account_name)
FROM ${marketing_regression_input.SQL_TABLE_NAME};;
}
}
# These are all statistics that show you the quality of your model - r-squared (r2_score) and various error averages, or differences between
# These, too, refer back to the “marketing_regression_input” view
# By instantiating them as “explores” here you will be able to access them for visualizations
# without having to cite them in the “model.” The downside is that they can be accessed via
# any of your data models in the “Explore” menu.
explore: marketing_regression_evaluation {}
view: marketing_regression_evaluation {
derived_table: {
sql: SELECT * FROM ml.EVALUATE(
MODEL ${marketing_regression.SQL_TABLE_NAME},
(SELECT * FROM ${marketing_regression_input.SQL_TABLE_NAME}));;
}
dimension: mean_absolute_error {type: number}
dimension: mean_squared_error {type: number}
dimension: mean_squared_log_error {type: number}
dimension: median_absolute_error {type: number}
dimension: r2_score {type: number}
dimension: explained_variance {type: number}
}
# This allows you to compare your predicted values with the real ones
# You should join this with the explore_source in the model
explore: marketing_regression_prediction {}
view: marketing_regression_prediction {
derived_table: {
sql: SELECT * FROM ml.PREDICT(
MODEL ${marketing_regression.SQL_TABLE_NAME},
(SELECT * FROM ${marketing_regression_input.SQL_TABLE_NAME}));;
}
dimension: predicted_total_won_amount {type: number}
dimension: account_name {hidden: yes}
dimension: total_won_amount {type: number}
dimension: residual {
type: number
sql: ${predicted_total_won_amount} - ${total_won_amount} ;;
}
dimension: residual_percent {
type: number
value_format_name: percent_1
sql: 1.0 * ${residual}/NULLIF(${total_won_amount}, 0) ;;
}
}
# This shows the intercept and coefficients for each feature in your regression equation.
# Unfortunately, BigQuery ML doesn’t provide P-values
# Consider standardizing if your features use very different scales
# If you don’t want to standardize, then omit the parameter “, STRUCT(true as standardize)”
explore: marketing_regression_weights {}
view: marketing_regression_weights {
derived_table: {
sql: SELECT * FROM ml.WEIGHTS(
MODEL ${marketing_regression.SQL_TABLE_NAME}, STRUCT(true as standardize));;
}
dimension: processed_input {type: string}
dimension: weight {type: number}
}
# This provides summary statistics for each feature you have selected
explore: marketing_regression_feature_info {}
view: marketing_regression_feature_info {
derived_table: {
sql: SELECT * FROM ml.FEATURE_INFO(
MODEL ${marketing_regression.SQL_TABLE_NAME});;
}
dimension: input {type: string}
dimension: min {type: number}
dimension: max {type: number}
dimension: mean {type: number}
dimension: stddev {type: number}
dimension: null_count {type: number}
}
```

## Selecting Features

There are several ways to select features. Since the features are all encoded in `TRAINING/TESTING INPUTS`

, you can comment out any features you don’t want to include. I chose to start with the features with the strongest correlation with my target and so commented out everything except that feature. This process is known as forward selection; you start with an empty equation and gradually introduce features, noting how much they contribute (or not) to the R-squared, or quality-of-fit, of the equation.

Normally, there is nothing preventing you from overfitting a model by simply including more features and thus artificially raising the R-squared score. BigQuery ML does us the favor of using “predicted” or “adjusted” R-squared (the documentation doesn’t specify; I’m open to being corrected), so you may notice that adding more features might actually result in a poorer-fitting model. Use this to your advantage! Pick the combination of features that returns the highest R-squared score.

## Evaluating Your Predictions

To get a visual sense of whether your predictions are reasonably in line with the actual data and whether they systemically overestimate or underestimate, you can visualize the explore source that you joined together with your prediction explore in your model earlier. You should be able to find it under the Explore menu; use the prediction as a dimension and the actual values as a measure.

In the example above, the predictions are generally fair but appear to sometimes seriously underestimate the “total won amount,” especially when the values are on the lower end of the range. You can place this visualization, as well as any of the summary numbers and tables from the other **explores**, in a dashboard.

Consult the “weights” to get a sense of the magnitude of influence that each feature has over your target. As mentioned in the LookML code sample, you should consider standardizing the model weights if the scales are very different from each other. You’ll be able to populate the model of form `Y = a + b1*X1 + b2*X2 + ... + bp*Xp`

from earlier.

The numerical predictions you’ve created can be used on an ongoing basis to automate various actions whenever a new record shows up, provided the data for features are available. You’re limited only by your imagination.

## Next Steps in Machine Learning

What we’ve shown here is only the beginning of the story. The linear regression model we have demonstrated is meant for making numerical predictions. With binary and multi-class logistic regression, you have supervised learning and can predict what categories a record belongs to; with k-means clustering, you have unsupervised learning and can detect patterns by grouping records into clusters with similar characteristics.

Learn more about what Looker, BigQuery and Fivetran can do for your data team with a free demo.