Create Retention Curves
Retention curves (opens in a new tab) allow you to see how many users are still active at a given point in time and compare different cohorts of users to see how well your product retains engagement. This article doesn't dedicate too much time to how to analyze retention curves since there is plenty of content dedicated to this out there.
The approach is very similar to the general approach to data modeling: you'll prepare a set of Measures that define the percent of users that are still active in a month and add attributes that help you understand the variation. See the article Data Modeling Best Practices
For retention curves, it's oftentimes useful to add experiments as attributes: did you try something in the product to retain users more effectively? Maybe a new onboarding flow, a new user flow?
For this walk-through, we're going to look at the user engagement on the tech news aggregation website Hacker News. The dataset is available on BigQuery's public datasets and is kept up to date.
Prepare curves in your data warehouse
As with all tools you build in Glean, retention curves will be as powerful as the data model you produce for them. For large user tables, it probably makes sense to prepare this data in your data warehouse, but it's also possible to calculate retention curves on the fly using a SQL-based data model. See Add Data Model
To prepare our curves, we are going to track whether a user is active or not in any given month. So that all of our users are being compared fairly, we'll also track the number of months since a user signed up - this will be the x-axis of our retention curve.
You should be targeting a data model that looks something like the following table:
userid | user signup month | activity month | months since signup | is_active |
---|---|---|---|---|
carlos@glean | August 1, 2020 | August 1, 2020 | 0 | 1 |
carlos@glean | August 1, 2020 | September 1, 2020 | 1 | 1 |
carlos@glean | August 1, 2020 | October 1, 2020 | 2 | 0 |
carlos@glean | August 1, 2020 | November 1, 2020 | 3 | 0 |
nathaniel@glean | July 1, 2020 | July 1, 2020 | 0 | 1 |
nathaniel@glean | July 1, 2020 | August 1, 2020 | 1 | 1 |
nathaniel@glean | July 1, 2020 | September 1, 2020 | 2 | 1 |
nathaniel@glean | July 1, 2020 | October 1, 2020 | 3 | 1 |
nathaniel@glean | July 1, 2020 | November 1, 2020 | 4 | 1 |
For the Hacker News dataset the query looks something like this (using BigQuery syntax):
with all_months as (
select
distinct timestamp_trunc(timestamp, month) as month
from `bigquery-public-data.hacker_news.full`
),
active_months as (
select
distinct `by` as userid
, timestamp_trunc(timestamp, month) as month
from`bigquery-public-data.hacker_news.full`
),
users as (
select
`by` as userid
, min(timestamp_trunc(timestamp, month)) as first_month
, max(timestamp_trunc(timestamp, month)) as last_month
from `bigquery-public-data.hacker_news.full`
group BY userid
)
select
u.userid
, m.month
, first_month as cohort_month
, cast(first_month as string) as cohort_month_str
, date_diff(DATE(m.month), DATE(u.first_month), MONTH) as month_number
-- could add cohort
-- , experiment_cohort as experiment
, if(m.month <= u.last_month, 1.0, 0.0) as unbounded_active
, if(a.month IS NOT NULL, 1.0, 0.0) as active
from users as u
cross join all_months as m
left join active_months as a
on a.userid = u.userid and a.month = m.month
where m.month >= u.first_month
;
A few notes on the resulting columns:
- userid - self-explanatory, this is string representing our user
- month - this is a date representation of the month in which an activity happens
- cohort_month - this is the month a user joined the site
- cohort_month_str - this is a bit of a workaround: you will cast your cohort month to a string so you can use it as an attribute and cut by it in Glean
- month_number - this is the number of months since a user joined the platform, this will be the x axis of our curve
- experiment - this is a randomly assigned experiment for each user. Note that in this column and for this model, each user can only be assigned to a single experimental cohort.
- active - either 1 or 0 for each user in a month. A value of 1 means the user was active in that month and zero means inactive.
- unbounded_active - either 1 or 0 for each user in a month. A value of 1 means that a user was either active in that month or in any month after. This will help us answer the question "of users in this month, what percent of users abandoned us and have not returned since"
Create the model in Glean
-
From the data exploration page, click Add Data Model
-
Either select your prepared retention table from your data warehouse or click
SQL Query
and enter a SQL query similar to the one above as the basis for your model. ClickCreate Model
-
Set up your data model in Glean:
-
Make sure that
month
is selected as your primary date, by selecting the gear next to it. -
Add
cohort_month_str
,month_number
as attributes along with any experimental cohort attributes you prepared in your data. -
Add two columns as Measures:
- Unbounded Active Retention - is defined as the
average
of columnunbounded_active
- Active Retention - is defined as the
average
of columnactive
- Active Users (Optional) - if you want to create a retention pivot, you should also add a measure for unique active users in a month:
COUNT(DISTINCT IF(active = 1, userid, null))
- Unbounded Active Retention - is defined as the
-
-
Click
Publish
Create retention curves
- Change the chart type to a
line
- Select
month_number
as the x-axis variable - Select
Unbounded Active Retention
as your measure - Select some
cohort_month_str
as breakouts to see how different cohort months retain users in your product. Alternatively, you can filter for a specific cohort month and breakout by an experimental variable
Note that not breaking out (or filtering) by cohort_month_str
has some
implications for your analysis. This is because not every cohort month has an
equal number of future months. To correct for this, it's possible to use a
Kaplan-Meier estimator which will be left as a separate exercise.
Create retention table
- Make sure you add the Active Users measure specified above.
- Select a
pivot table
as your visualization type from the chart dropdown on the top left. - Select
cohort
as your rows. - Select
month_number
as your columns. - Click into the
control
tab on the right-hand side and make sure you sort rows bycohort
descending and sort the columns bymonth_number
ascending. - As the measure, select
Active Users
as the calculation, selectPercentage of Row Max
and finally as the Color scale, selectPercentage Color Scale
.