Use SQL to group data so that you can spot trends, uncover opportunities, and see where and how your business generates value.
This is a guest post by Eric Feng of Numeracy, a company specializing in lightweight business intelligence tools. Before he joined Numeracy, Eric worked at Dropbox and Sentry in developer relations, engineering and product.
Bucketing, also known as binning, is useful to find groupings in continuous data (particularly numbers and time stamps). While it’s often used to generate histograms, bucketing can also be used to group rows by business-defined rules. I’ll walk through the simple bucketing various data types as well as custom buckets. Finally, to illustrate how these types of bucketing might be used, I’m going to walk through an example with Numeracy’s own schemas and mock data on when our users connect a database.
The most common and naive method of bucketing is through truncating. trunc
rounds floats down to the nearest integer. As a bonus, if you’re trying to bucket values to the nearest 10, trunc
takes a second argument for the decimal point you want to round to.
select trunc(1.4); -- returns 1
select trunc(14, -1); -- returns 10, the equivalent of trunc(14 / 10) * 10
Try it out on Numeracy
Like trunc
for numbers, date_trunc
is used to bucket dates together. This is particularly useful for tracking user activity, where you’d like to smooth out weekend dips by bucketing weeks together. It accepts a wide variety of fields to truncate against, from microseconds to millennia.
select date_trunc('month', '2018-11-13'::timestamp); -- returns 2018-11-01 00:00:00
select date_trunc('week', '2018-11-13'::timestamp); -- returns 2018-11-12 00:00:00
Try it out on Numeracy
To bucket time intervals, you can use either date_trunc
or trunc
. date_trunc
accepts intervals, but will only truncate up to an hour.
select date_trunc('hour', '97 minutes'::interval); -- returns 01:00:00
Try it out on Numeracy
date_trunc
cannot truncate for months and years because they are irregular intervals. Be careful: instead of throwing an exception or returning null, date_trunc
quietly returns a 0 interval.
select date_trunc('month', '31 days'::interval); -- returns 00:00:00
Try it out on Numeracy
Surprisingly, date_trunc
returns a 0 interval for day and week as well, despite being regular intervals. C’est la vie.
select date_trunc('day', '25 hours'::interval); -- returns 00:00:00
Try it out on Numeracy
If you want to bucket by intervals other than second, minute, or hour, your second option is to change the interval into a number (namely epoch or seconds since 1970), divide by the desired interval, and then trunc
the result.
select extract(epoch from '60 seconds'::interval); -- returns 60
select trunc(extract(epoch from '139 seconds'::interval) / 60); -- returns 2
Try it out on Numeracy
Once again, it’s important to note that other fields (like hour or day) do not work this way. It instead extracts the field as a value, and does not convert the value into that unit of time. In the event you try and extract irregular intervals, results can vary in unexpected ways. I’d recommend you stick with epoch.
select extract(minute from '61 minutes'::interval); -- returns 1
select extract(month from '31 days'::interval); -- returns 0
select extract(month from '1 month'::interval); -- returns 1
select extract(year from '1 year'::interval); -- returns 1
select extract(day from '1 year'::interval); -- returns 0
select extract(month from '1 year'::interval); -- returns 0
Try it out on Numeracy
You can create custom bucket with the Postgres width_bucket
function. The function scans through the array, using the listed values as upper bounds, and returns the last index. That’s why it’s important to always have your buckets sorted by ascending values (easily done with Postgres’ intarray
extension). For the array {10, 20, 30}
, you effectively get the four ranges <10, 10-20, 20-30, and >30.
select width_bucket(4, '{10, 20, 30}'); -- returns 0
select width_bucket(24, '{10, 20, 30}'); -- returns 2
select width_bucket(60, '{10, 20, 30}'); -- returns 4
Try it out on Numeracy
While it’s possible to manually duplicate this logic with CASE expression, it’s much more verbose. width_bucket
also has the advantage of being able to use values to bucket, allowing you to generate or even store buckets as SQL values, instead of needing to hard code them in your SQL statements.
Let’s look at our example: tracking when Numeracy users first connect a database.
First, here’s how we might pull the “time to connect” from raw event tables with ‘organization.created’ and ‘database.connected’ event types:
with conn_event as (
select org_id, min(created) as created
from events
where type = 'database.connected'
group by 1
),
org_event as (
select org_id, min(created) as created
from events
where type = 'organization.created'
group by 1
),
time_to_connect as (
select extract(epoch from conn_event.created - org_event.created) as seconds
from org_event
left join conn_event
on org_event.org_id = conn_event.org_id
)
We’ll assume all of the following statements to be prefixed with these CTEs. For a first pass, let’s group by day.
select trunc(seconds / 3600 / 24) as day, count(1) as org_count
from time_to_connect
group by 1
order by 1
Try it out on Numeracy
If you run this, you can see almost everyone connects within a week. Most, in fact, connect within a day of signing up. Our intuition tells us that the users connecting on the first day actually connect in minutes. However, grouping by minute would produce hundreds of thousands of “buckets” from the long tail of users who connect months later.
select trunc(seconds / 60) as minutes,
count(1) as org_count
from time_to_connect
where trunc(seconds / 60) <= 1440
group by 1
order by 1
Try it out on Numeracy
In most businesses, you want to define “personas”: types of users who find your software valuable. In Numeracy’s case, it’s easy to see three distinct buckets: users who connect in the first 15 minutes, in the first week, or later. Having defined and reached consensus on these, I might want to categorize these users in my queries and examine other attributes. For this, we’ll use width_bucket
.
select width_bucket(seconds / 60, '{15, 10080}') as bucket,
count(1) as org_count
from time_to_connect
group by 1
Try it out on Numeracy
In this case, I’ve passed the array {15, 10080}
to the width_bucket
function. If a user connects in less than 15 minutes, width_bucket
returns 0. If between 15 minutes to 10,080 minutes (or 1 week), it returns 1. If more than a week, 2.
In Numeracy’s case, if you run the queries above, you’ll see 60% of organizations that will connect a database will do so in the first 15 minutes. 30% connect in the next 7 days. The final 10% trickle in over near-infinite time afterwards.
Now that we’ve bucketed our users, we can answer questions like, “Which bucket converts at the highest rate?” In Numeracy’s case, our initial intuition told us that users who instantly connected were more likely to convert. However, our highest conversion rates were actually amongst the second group: users who returned within a week. In retrospect, this makes sense. people with database credentials on hand are more likely ops teams, whereas Numeracy is built for business or product teams.
The goal of bucketing is to group user behaviors into segments that can inform your marketing efforts and product design. By using buckets to organize data in different ways, you can detect trends that might otherwise have gone unnoticed, determine where your business currently generates the most value, and find unexploited opportunities.
Launch any Fivetran connector instantly.