I started my first SaaS product while doing consulting. The SaaS revenue was building, but it didn’t compare to my hourly rate. Stupidly, I found it hard to say no to high-paid hourly work.
Why? I hadn’t seen a cohort analysis. I didn’t understand the value of slow-and-steady subscription growth over a long period of time. I didn’t understand that a little bit of onboarding help often leads to almost no revenue effort in later periods. Such a noob!
In this post, I’ll show how to generate a cohort analysis of your customer base from your Stripe invoices. Don’t sweat it if this feels like a lot of code. In the end, I’ll share a Python script that generates a Stripe cohort heatmap with one line of code.
I’ll be using Jupyter Notebooks and a couple of Python packages. If you are new to Python, I suggest installing Jupyter Notebooks via Anaconda. This will install Pandas — the Python data analysis library — as well. Jupyter Notebooks gives you an interactive way to explore your data and share your analysis.
I’m going to follow Greg Reda’s seminal approach to cohort analysis and apply this to your Stripe invoices.
We’ll build cohorts from Stripe invoices. We can easily export your Stripe invoices into a Pandas dataframe with PetalData. Copy & Paste the following into your notebook:
<a href="https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href">https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href</a>
If you have thousands of invoices (I hope you do!), it takes a bit of time to download. Grab your favorite beverage while the download executes.
The Pandas dataframe created by Petaldata has a created column, which is the time the invoice was created. Since most subscription services are monthly, we’ll do monthly cohorts. The code below names your cohorts in a format like 2019-05 (that’s May 2019).
df['invoice_period'] = df.created.apply(lambda x: x.strftime('%Y-%m'))
Your customers came into existence when they received their first invoice. We’ll assign them to their cohort group based on the date of that invoice.
df.set_index('customer_email', inplace=True)
df['cohort_group'] = df.groupby(level=0)['created'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True)
Now we aggregate by the cohort group and invoice period. Our first-level index will be the month the customer signed up (ex: 2018-01) and our 2nd-level index will be each invoice period (for example, each month from Jan 2018 onward):
grouped = df.groupby(['cohort_group', 'invoice_period'])
# count the unique customers, invoices, and total revenue per group + period
cohorts = grouped.agg({'customer': pd.Series.nunique,
'created': "count",
'amount_due': np.sum})
# make the column names more meaningful
cohorts.rename(columns={'customer': 'total_customers',
'created': 'total_invoices'}, inplace=True)
cohorts.head()
Which will generate output like:
To compare different cohorts (ex: how do March 2018 and March 2019 signups compare in month two?), we need to label each row with the number of months from first purchase:
def cohort_period(df):
df['cohort_period'] = np.arange(len(df)) + 1
return df
cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head()
You’ll see the new cohort_period column:
Visualizing cohorts over time can be noisy. To make it easier to understand, we’ll use a Seaborn heatmap. The functions below are derived from Greg’s blog post and allow us to easily generate a heatmap from multiple columns.
Add these to your notebook:
def cohort_heatmap(cohorts,col,title):
size = group_size(cohorts,col)
over_time = cohort_over_time(cohorts,size,col)
_cohort_heatmap(over_time,title)
def group_size(cohorts,col):
# reindex the DataFrame
cohorts.reset_index(inplace=True)
cohorts.set_index(['cohort_group', 'cohort_period'], inplace=True)
# create a Series holding the total size of each cohort_group
return cohorts[col].groupby(level=0).first()
def cohort_over_time(cohorts,group_size,col):
return cohorts[col].unstack(0).divide(group_size, axis=1)
def _cohort_heatmap(cohort_over_time,title):
plt.figure(figsize=(24, 16))
plt.title(title)
sns.heatmap(cohort_over_time.T, mask=cohort_over_time.T.isnull(), annot=True, fmt='.0%');
It’s all come to this! Just run the following to view customer retention over time as a heatmap:
cohort_heatmap(cohorts,"total_customers","Customer Retention")
Which generates:
Do you have the holy grail of SaaS, negative churn? This means your revenue per-cohort grows over time even with churn. Let’s see:
cohort_heatmap(cohorts,"amount_due","Revenue Retention")
I’ve created a GitHub repo with a Python script that does all of the steps above. To use:
git clone [email protected]:petaldata/stripe-cohort-analysis.git
STRIPE_API_KEY=[YOUR API KEY] python stripe_cohort_analysis.py
The script will generate two heatmaps (saved as PNGs in the current directory) of your customer cohorts over time.