28 Nov 2018 | Analyst Recipe

Guest Post: Bucketing in SQL

Use SQL to group data together so that you can see where your business generates value, spot trends, and find opportunities.
Eric Feng
Eric Feng
Guest Post: Bucketing in SQL

Guest Post: Bucketing in SQL

The following is a guest post by Eric at Numeracy, a company specializing in lightweight business intelligence tools. Before Numeracy, Eric worked at Dropbox and Sentry in developer relations, engineering, and product.

Bucketing, also known as binning, is useful to find groupings in continuous data (particularly numbers and timestamps). While it’s often used to generate histograms, bucketing can also be used to group rows by business-defined rules. I’ll walk through the simple bucketing various data types as well as custom buckets. Finally, to illustrate how these types of bucketing might be used, I’m going to walk through an example with Numeracy’s own schemas and mock data on when our users connect a database.

Bucketing Numbers

The most common and naive method of bucketing is through truncating. trunc rounds floats down to the nearest integer. As a bonus, if you’re trying to bucket values to the nearest 10, trunc takes a second argument for the decimal point you want to round to.

    select trunc(1.4); -- returns 1
    select trunc(14, -1); -- returns 10, the equivalent of trunc(14 / 10) * 10

Try it out on Numeracy

Bucketing Time

Like trunc for numbers, date_trunc is used to bucket dates together. This is particularly useful for tracking user activity, where you’d like to smooth out weekend dips by bucketing weeks together. It accepts a wide variety of fields to truncate against, from microseconds to millennia.

    select date_trunc('month', '2018-11-13'::timestamp); -- returns 2018-11-01 00:00:00
    select date_trunc('week', '2018-11-13'::timestamp); -- returns 2018-11-12 00:00:00

Try it out on Numeracy

Bucketing Intervals

To bucket time intervals, you can use either date_trunc or trunc. date_trunc accepts intervals, but will only truncate up to an hour.

    select date_trunc('hour', '97 minutes'::interval); -- returns 01:00:00

Try it out on Numeracy

date_trunc cannot truncate for months and years because they are irregular intervals. Be careful: instead of throwing an exception or returning null, date_trunc quietly returns a 0 interval.

    select date_trunc('month', '31 days'::interval); -- returns 00:00:00

Try it out on Numeracy

Surprisingly, date_trunc returns a 0 interval for day and week as well, despite being regular intervals. C’est la vie.

    select date_trunc('day', '25 hours'::interval); -- returns 00:00:00

Try it out on Numeracy

If you want to bucket by intervals other than second, minute, or hour, your second option is to change the interval into a number (namely epoch or seconds since 1970), divide by the desired interval, and then trunc the result.

    select extract(epoch from '60 seconds'::interval); -- returns 60
    select trunc(extract(epoch from '139 seconds'::interval) / 60); -- returns 2

Try it out on Numeracy

Once again, it’s important to note that other fields (like hour or day) do not work this way. It instead extracts the field as a value, and does not convert the value into that unit of time. In the event you try and extract irregular intervals, results can vary in unexpected ways. I’d recommend you stick with epoch.

    select extract(minute from '61 minutes'::interval); -- returns 1
    select extract(month from '31 days'::interval); -- returns 0
    select extract(month from '1 month'::interval); -- returns 1
    select extract(year from '1 year'::interval); -- returns 1
    select extract(day from '1 year'::interval); -- returns 0
    select extract(month from '1 year'::interval); -- returns 0

Try it out on Numeracy

Custom Buckets

You can create custom bucket with the Postgres width_bucket function. The function scans through the array, using the listed values as upper bounds, and returns the last index. That’s why it’s important to always have your buckets sorted by ascending values (easily done with Postgres’ intarray extension). For the array {10, 20, 30}, you effectively get the four ranges <10, 10-20, 20-30, and >30.

    select width_bucket(4, '{10, 20, 30}'); -- returns 0
    select width_bucket(24, '{10, 20, 30}'); -- returns 2
    select width_bucket(60, '{10, 20, 30}'); -- returns 4

Try it out on Numeracy

While it’s possible to manually duplicate this logic with CASE expression, it’s much more verbose. width_bucket also has the advantage of being able to use values to bucket, allowing you to generate or even store buckets as SQL values, instead of needing to hard code them in your SQL statements.

Numeracy Example

Let’s look at our example: tracking when Numeracy users first connect a database.

First, here’s how we might pull the “time to connect” from raw event tables with ‘organization.created’ and ‘database.connected’ event types:

    with conn_event as (
      select org_id, min(created) as created
      from events
      where type = 'database.connected'
      group by 1
    ),
    org_event as (
      select org_id, min(created) as created
      from events
      where type = 'organization.created'
      group by 1
    ),
    time_to_connect as (
      select extract(epoch from conn_event.created - org_event.created) as seconds
      from org_event
      left join conn_event
      on org_event.org_id = conn_event.org_id
    )

We’ll assume all of the following statements to be prefixed with these CTEs. For a first pass, let’s group by day.

    select trunc(seconds / 3600 / 24) as day, count(1) as org_count
    from time_to_connect
    group by 1
    order by 1

Try it out on Numeracy

If you run this, you can see almost everyone connects within a week. Most, in fact, connect within a day of signing up. Our intuition tells us that the users connecting on the first day actually connect in minutes. However, grouping by minute would produce hundreds of thousands of “buckets” from the long tail of users who connect months later.

    select trunc(seconds / 60) as minutes,
      count(1) as org_count
    from time_to_connect
    where trunc(seconds / 60) <= 1440
    group by 1
    order by 1

Try it out on Numeracy

Segmenting Users Into Custom Buckets

In most businesses, you want to define “personas”: types of users who find your software valuable. In Numeracy’s case, it’s easy to see three distinct buckets: users who connect in the first 15 minutes, in the first week, or later. Having defined and reached consensus on these, I might want to categorize these users in my queries and examine other attributes. For this, we’ll use width_bucket.

    select width_bucket(seconds / 60, '{15, 10080}') as bucket,
      count(1) as org_count
    from time_to_connect
    group by 1

Try it out on Numeracy

In this case, I’ve passed the array {15, 10080} to the width_bucket function. If a user connects in less than 15 minutes, width_bucket returns 0. If between 15 minutes to 10,080 minutes (or 1 week), it returns 1. If more than a week, 2.

In Numeracy’s case, if you run the queries above, you’ll see 60% of organizations that will connect a database will do so in the first 15 minutes. 30% connect in the next 7 days. The final 10% trickle in over near-infinite time afterwards.

Now that we’ve bucketed our users, we can answer questions like, “Which bucket converts at the highest rate?” In Numeracy’s case, our initial intuition told us that users who instantly connected were more likely to convert. However, our highest conversion rates were actually amongst the second group: users who returned within a week. In retrospect, this makes sense. people with database credentials on hand are more likely ops teams, whereas Numeracy is built for business or product teams.

The goal of bucketing is to group user behaviors into segments that can inform your marketing efforts and product design. By using buckets to organize data in different ways, you can detect trends that might otherwise have gone unnoticed, determine where your business currently generates the most value, and find unexploited opportunities.

Are You A Data Expert?

Get started with a free trial today.

Discover the smartest solution for data-driven results.
Offline. No network available. Check your connection and try again.