Stripe Cohort Analysis with Python and Pandas

I started my first SaaS product while doing consulting. The SaaS revenue was building, but it didn’t compare to my hourly rate. Stupidly, I found it hard to say no to high-paid hourly work.

Why? I hadn’t seen a cohort analysis. I didn’t understand the value of slow-and-steady subscription growth over a long period of time. I didn’t understand that a little bit of onboarding help often leads to almost no revenue effort in later periods. Such a noob!

In this post, I’ll show how to generate a cohort analysis of your customer base from your Stripe invoices. Don’t sweat it if this feels like a lot of code. In the end, I’ll share a Python script that generates a Stripe cohort heatmap with one line of code.

Tools

I’ll be using Jupyter Notebooks and a couple of Python packages. If you are new to Python, I suggest installing Jupyter Notebooks via Anaconda. This will install Pandas — the Python data analysis library — as well. Jupyter Notebooks gives you an interactive way to explore your data and share your analysis.

Approach

I’m going to follow Greg Reda’s seminal approach to cohort analysis and apply this to your Stripe invoices.

1. Export Stripe Invoices

We’ll build cohorts from Stripe invoices. We can easily export your Stripe invoices into a Pandas dataframe with PetalData. Copy & Paste the following into your notebook:

<a href="https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href">https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href</a>

If you have thousands of invoices (I hope you do!), it takes a bit of time to download. Grab your favorite beverage while the download executes.

2. Create an `invoice_period` column based on the `created` column

The Pandas dataframe created by Petaldata has a created column, which is the time the invoice was created. Since most subscription services are monthly, we’ll do monthly cohorts. The code below names your cohorts in a format like 2019-05 (that’s May 2019).

df['invoice_period'] = df.created.apply(lambda x: x.strftime('%Y-%m'))

3. Create a `cohort_group` column based on the customer’s first invoice

Your customers came into existence when they received their first invoice. We’ll assign them to their cohort group based on the date of that invoice.

df.set_index('customer_email', inplace=True)

df['cohort_group'] = df.groupby(level=0)['created'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True)

4. Rollup data by cohort_group & invoice_period

Now we aggregate by the cohort group and invoice period. Our first-level index will be the month the customer signed up (ex: 2018-01) and our 2nd-level index will be each invoice period (for example, each month from Jan 2018 onward):

grouped = df.groupby(['cohort_group', 'invoice_period'])

# count the unique customers, invoices, and total revenue per group + period
cohorts = grouped.agg({'customer': pd.Series.nunique,
                       'created': "count",
                       'amount_due': np.sum})

# make the column names more meaningful
cohorts.rename(columns={'customer': 'total_customers',
                        'created': 'total_invoices'}, inplace=True)
cohorts.head()

Which will generate output like:

5. Label the cohort_period for each cohort_group

To compare different cohorts (ex: how do March 2018 and March 2019 signups compare in month two?), we need to label each row with the number of months from first purchase:

def cohort_period(df):
    df['cohort_period'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head()

You’ll see the new cohort_period column:

6. Create functions to generate a heatmap

Visualizing cohorts over time can be noisy. To make it easier to understand, we’ll use a Seaborn heatmap. The functions below are derived from Greg’s blog post and allow us to easily generate a heatmap from multiple columns.

Add these to your notebook:

def cohort_heatmap(cohorts,col,title):
    size = group_size(cohorts,col)
    over_time = cohort_over_time(cohorts,size,col)
    _cohort_heatmap(over_time,title)
    
def group_size(cohorts,col):
    # reindex the DataFrame
    cohorts.reset_index(inplace=True)
    cohorts.set_index(['cohort_group', 'cohort_period'], inplace=True)

# create a Series holding the total size of each cohort_group
    return cohorts[col].groupby(level=0).first()

def cohort_over_time(cohorts,group_size,col):
    return cohorts[col].unstack(0).divide(group_size, axis=1)

def _cohort_heatmap(cohort_over_time,title):
    plt.figure(figsize=(24, 16))
    plt.title(title)
    sns.heatmap(cohort_over_time.T, mask=cohort_over_time.T.isnull(), annot=True, fmt='.0%');

7. Customer Retention

It’s all come to this! Just run the following to view customer retention over time as a heatmap:

cohort_heatmap(cohorts,"total_customers","Customer Retention")

Which generates:

8. Revenue Retention

Do you have the holy grail of SaaS, negative churn? This means your revenue per-cohort grows over time even with churn. Let’s see:

cohort_heatmap(cohorts,"amount_due","Revenue Retention")

Where else can you take this?

Use absolute numbers vs. percentages
Filter by account_country
Filter by custom_fields
Filter by customer sizes (ex: retention of your largest customers)
Join with Hubspot data (ex: retention by customers who signed up via organic search)

Try it now!

I’ve created a GitHub repo with a Python script that does all of the steps above. To use:

git clone [email protected]:petaldata/stripe-cohort-analysis.git
STRIPE_API_KEY=[YOUR API KEY] python stripe_cohort_analysis.py

The script will generate two heatmaps (saved as PNGs in the current directory) of your customer cohorts over time.