I started my first SaaS product while doing consulting. The SaaS revenue was building, but it didn’t compare to my hourly rate. Stupidly, I found it hard to say no to high-paid hourly work. I didn’t understand the value of slow-and-steady subscription growth over a long period of time. I didn’t understand that a little bit of onboarding help often leads to almost no revenue effort in later periods. Such a noob! Why? I hadn’t seen a cohort analysis . In this post, I’ll show how to generate a cohort analysis of your customer base from your invoices. Don’t sweat it if this feels like a lot of code. Stripe In the end, I’ll share a Python script that generates a Stripe cohort heatmap with one line of code. Tools I’ll be using and a couple of Python packages. If you are new to Python, I suggest . This will install — Python data analysis library — as well. Jupyter Notebooks gives you an interactive way to explore your data and share your analysis. Jupyter Notebooks installing Jupyter Notebooks via Anaconda Pandas the Approach I’m going to follow Greg Reda’s and apply this to your Stripe invoices. seminal approach to cohort analysis 1. Export Stripe Invoices We’ll build cohorts from Stripe invoices. We can easily export your Stripe invoices into a Pandas dataframe with . Copy & Paste the following into your notebook: PetalData <a href="https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href">https://medium.com/media/46e8254e46b7abe039043f01fa9937df/href</a> If you have thousands of invoices (I hope you do!), it takes a bit of time to download. Grab your favorite beverage while the download executes. 2. Create an `invoice_period` column based on the `created` column The Pandas dataframe created by Petaldata has a created column, which is the time the invoice was created. Since most subscription services are monthly, we’ll do monthly cohorts. The code below names your cohorts in a format like 2019-05 (that’s May 2019). df['invoice_period'] = df.created.apply(lambda x: x.strftime('%Y-%m')) 3. Create a `cohort_group` column based on the customer’s first invoice Your customers came into existence when they received their first invoice. We’ll assign them to their cohort group based on the date of that invoice. df.set_index('customer_email', inplace=True)

df['cohort_group'] = df.groupby(level=0)['created'].min().apply(lambda x: x.strftime('%Y-%m'))
df.reset_index(inplace=True) 4. Rollup data by cohort_group & invoice_period Now we aggregate by the cohort group and invoice period. Our first-level index will be the month the customer signed up (ex: 2018-01) and our 2nd-level index will be each invoice period (for example, each month from Jan 2018 onward): grouped = df.groupby(['cohort_group', 'invoice_period'])

# count the unique customers, invoices, and total revenue per group + period
cohorts = grouped.agg({'customer': pd.Series.nunique,
                       'created': "count",
                       'amount_due': np.sum})

# make the column names more meaningful
cohorts.rename(columns={'customer': 'total_customers',
                        'created': 'total_invoices'}, inplace=True)
cohorts.head() Which will generate output like: 5. Label the cohort_period for each cohort_group To compare different cohorts (ex: how do March 2018 and March 2019 signups compare in month two?), we need to label each row with the number of months from first purchase: def cohort_period(df):
    df['cohort_period'] = np.arange(len(df)) + 1
    return df

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.head() You’ll see the new cohort_period column: 6. Create functions to generate a heatmap Visualizing cohorts over time can be noisy. To make it easier to understand, we’ll use a heatmap. The functions below are derived from Greg’s and allow us to easily generate a heatmap from multiple columns. Seaborn blog post Add these to your notebook: def cohort_heatmap(cohorts,col,title):
    size = group_size(cohorts,col)
    over_time = cohort_over_time(cohorts,size,col)
    _cohort_heatmap(over_time,title)
    
def group_size(cohorts,col):
    # reindex the DataFrame
    cohorts.reset_index(inplace=True)
    cohorts.set_index(['cohort_group', 'cohort_period'], inplace=True)

# create a Series holding the total size of each cohort_group
    return cohorts[col].groupby(level=0).first()

def cohort_over_time(cohorts,group_size,col):
    return cohorts[col].unstack(0).divide(group_size, axis=1)

def _cohort_heatmap(cohort_over_time,title):
    plt.figure(figsize=(24, 16))
    plt.title(title)
    sns.heatmap(cohort_over_time.T, mask=cohort_over_time.T.isnull(), annot=True, fmt='.0%'); 7. Customer Retention It’s all come to this! Just run the following to view customer retention over time as a heatmap: cohort_heatmap(cohorts,"total_customers","Customer Retention") Which generates: 8. Revenue Retention Do you have the holy grail of SaaS, ? This means your revenue per-cohort grows over time even with churn. Let’s see: negative churn cohort_heatmap(cohorts,"amount_due","Revenue Retention") Where else can you take this? Use absolute numbers vs. percentages Filter by account_country Filter by custom_fields Filter by customer sizes (ex: retention of your largest customers) Join with (ex: retention by customers who signed up via organic search) Hubspot data Try it now! I’ve created a with a Python script that does all of the steps above. To use: GitHub repo git clone git@github.com:petaldata/stripe-cohort-analysis.git
STRIPE_API_KEY=[YOUR API KEY] python stripe_cohort_analysis.py The script will generate two heatmaps (saved as PNGs in the current directory) of your customer cohorts over time.

Stripe

Stripe Cohort Analysis with Python and Pandas

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A tour of Elixir performance & monitoring tools

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

A tour of Elixir performance & monitoring tools

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps