Nowadays, easy access to data is table-stakes for high-performing companies.
Easy access doesn't come for free, though: it requires investment and a careful selection of tools. For young companies like us, the question is how much? And when do you make that investment?
Having grown to ten people, several without engineering backgrounds but with strong data needs, we decided 2022 was going to be that time.
Two weeks of experimentation and a couple of company workshops later, we're super happy with what we've built. If you're a company with similar needs and care about data access for all, follow this guide and we guarantee you'll get a great result.
With data warehouse solutions (
In general, you'll want to:
Finding complementary tools is difficult, and we spoke to data engineers from several companies to find what was working for them (and what wasn't). We arrived at the following stack:
On the left, you'll see data sources (only a sample, we have many more than this), tools where we generate and work with data. One of them is Product, the Postgres database that powers incident.io, while
Once in BigQuery, we clean and transform the data using a tool called
Finally,
That's it, the entire data stack. We think these tools are all we'll need to grow incident.io into a much larger company, and given one of our goals (always, for everything!) is simplicity, we're happy the list is so small.
Let's take a more detailed look at how this works.
Modern companies produce a huge amount of data across an increasing number of tools.
While each tool provides a custom view of its data, I'm a
It used to be that companies would write their own hacky scripts to perform this extraction - I've had terrible incidents caused by ETL database triggers in the past, and even
From experience then, believe me when I say you don't want to build these. Thankfully, ETL products like
In both Fivetran and Stitch, the process is the same:
Here's a view of the connectors we have configured in Fivetran:
It's worth noting the variety of sources. If you're building a SaaS product, one of the most crucial data sources will be whatever database powers your product.
organisations
, a row for each organization using the product
users
, each user for an organization
incidents
, every incident managed in the product
This data literally is our product, and your equivalent is likely the most high-value data asset to push into your warehouse.
Your database will limit how you can replicate that data, but Fivetran provides connectors that cover most circumstances - as an example, our
Fivetran will automatically manage the warehouse schema on your behalf, creating tables within the destination dataset for each table in the source. As an example, here's a view of the BigQuery datasets from our production database:
Other data sources like your CRM will be powered by API access to that tool. For those connectors, Fivetran will paginate through API resources using a cursor that is set each run, taking those responses and pushing them into the destination.
Expect this data to be of low quality, or awkward to use without some additional effort. Translating API resources to a data warehouse format needs some human effort, and that's where our transformation step comes in.
Until now, you were probably wondering why we have two ETL tools. Fivetran looks great, can't we just use that?
The answer is we'd like to, as Fivetran has been exceptional where we've been able to use it, and the pricing is amazing - I think our bill is about $100/month for all these connectors, which is a bargain for the value we get from it.
We're limited by connector availability, though. Stitch is a similar product to Fivetran but is built on the open-source
With Singer as the underlying technology, Stitch has a more comprehensive offering of connectors than Fivetran. Of the two tools, only Stitch has support for
Inconvenience aside, I don't think this is a bad strategy. The goal of ETL tools is to get your data into the warehouse, and provided that happens in a reliable fashion, it shouldn't matter too much how it got there.
For that reason, our policy is “prefer Fivetran, but Stitch otherwise”. It works for us!
Now we have our data in the warehouse, we need to transform it into an appropriate form for querying and analysis. That means turning raw source data into recognizable business concepts, along with tidying artifacts of the ETL process (removing deleted rows, etc).
There is an obviously correct choice for tooling at this stage, which is
While the tool choice is obvious, how to use dbt is going to be more controversial. There's a load of
We've arrived at a structure that minimises the number of dbt models (and their schema files, which is the real maintenance burden) and categorises tables by the quality and intended usage.
What follows is taken from our “Intro to data” tutorial:
Not all data is created equal, and it's useful to understand the different types of model you can query from Metabase, so you can make best use of it.
Firstly, the data we expect people to be querying is what we produce from dbt. All tables produced by dbt are located in the All (dbt)
dataset in Metabase.
This dataset contains many tables, and each table will be either:
In general, prefer to use (in this order):
Normally described as 'marts', these tables represent key business data and have been engineered for ease of use when querying.
An example of a dimension table is dim_organisations
. In general, dimension tables have a single row per entity they describe (ie, an organisation) and have a large number of columns that can be used to filter and group those entities.
Our dim_organisations
table has columns like:
id
and name
, identifying the organisationin_setup_flow
, whether they have completed setupis_paying
, if there is an active Stripe subscriptionlead_id
, the ID of the Lead in Close (CRM)Note that most of the hard work to enrich an organisation with data from other sources (such as Close) has already happened, making it easy to filter on any of the available dimensions without complex joins.
Fact tables are complementary to dimension tables, and can be seen as a list of things that happened (also referred to as an 'event stream'). Where dimension tables have one row per entity, you'll find many rows (events) for each of those entities in a fact table, and you'll be expected to join against the relevant dimension table to perform filtering.
As an example, we have an fct_organisation_statistics
table. This contains a row for each day an organisation has existed, along with a number of measurements like total Slack users, number of incidents, count of users who have created incidents, etc.
These are the highest quality data tables we offer, and you should prefer to use them when they exist.
Source data, which is what Fivetran and Stitch generate, is unstable and often awkward to use.
As an example, the source Close data has no concept of opportunities, as those exist within the lead data model and aren't modelled as first-class objects.
Instead of having people query the awkward source tables, we choose to build staging tables from the source data which:
is_
or has_
)And whatever else is needed to turn the source data into the 'ideal' data for querying.
Going back to our Close example, we have the following tables:
stg_close__leads
which cleans the source lead data and makes custom fields easy to consumestg_close__opportunities
which is built entirely from the nested opportunity data in the leads source, presented as you would want to query itFallback to staging data when there isn't a dimension or fact table that can better serve your use case.
Those last few sections are from our data tutorial, and is how we teach choosing the right data tables.
It's normal that this mirrors how we structure our dbt repository:
dbt/models
├── marts
│ └── core
│ ├── core.yml
│ ├── dim_incidents.sql
│ ├── dim_organisations.sql
│ ├── dim_users.sql
│ ├── fct_organisation_statistics.sql
│ └── fct_user_incident_graph.sql
└── staging
├── close
│ ├── README.md
│ ├── src_close.yml
│ ├── stg_close.yml
│ ├── stg_close__activities.sql
│ ├── stg_close__leads.sql
│ └── stg_close__opportunities.sql
└── product
├── README.md
├── src_product.yml
├── stg_product.yml
├── stg_product__actions.sql
├── stg_product__announcement_rules.sql
├── ...
└── stg_product__workflows.sql
While I was speed-running dbt tutorials, I would have loved to see this structure, to lend some confidence we were headed in the right direction.
With that caveat that I'm only an amateur dbt'ist, it's worth noting some decisions that went into this structure:
You'll see mentions of
We don't produce marts from staging models unless we join them across schemas (as with the dimension tables) or perform complex transformations on them (fact tables). Again, this is to reduce the number of dbt schemas.
We only have core marts right now. Until we add more marts, it makes little sense implementing a more complex structure (such as grouping marts under business units), better to adapt as we grow.
In terms of developing against dbt, each engineer has a separate BigQuery dataset (ie. dbt_lawrence
) that they target in local development for testing. Running a full build (dbt build
) is quick right now, making local development enjoyable.
While this setup makes trade-offs appropriate for incident.io's context, I think anyone setting up a data stack could adopt this with little issue, and it's easy to tweak things once you're more familiar with the tooling.
Now our data is cleaned, transformed and shaped into an ideal format for analysis, it's time we pick a visualisation tool.
This choice is crucial, as whatever we pick will be how people interact with our data. One of our key goals for this stack was for non-technically trained staff to benefit, and past experience with tools like Looker suggested that was possible only with significant help from a BI team, which we wanted to avoid.
We picked
Whether a tool is intuitive is subjective, but Metabase certainly feels that way to me. As an example, let's try and chart the number of incident timeline items created per month and see what that process looks like:
Things to notice:
All of this helps immensely with onboarding, especially for non-technical staff.
Metabase can really help lower the barrier to accessing your data, but it's limited by how well it knows the schema.
One example is visible in the previous screen recording, where joining Stg Product Timeline Items with Stg Product Organisations prefilled the join key (Organisation ID). Metabase can only default this if it knows Organisation ID on Stg Product Timeline Items is a foreign key of Stg Product Organisations, and which field it references.
As it happens, our dbt schemas already know this information, through column tests. Here's the schema for the organisation_id
field of Stg Product Timeline Items:
---
models:
- name: stg_product__timeline_items
columns:
- name: organisation_id
description: "Organisation ID"
tests:
- not_null
- relationships:
to: ref('dim_organisations')
field: organisation_id
Whenever we have a column that specifies a relationship test, we can infer a foreign-key relationship against the parent table.
So how do we get this into Metabase? There's a tool called
Inferring foreign keys is just one use of this tool, though - you can manually specific column types too.
As example of where this can be useful, annotating columns as a 'name' allows Metabase to power typeaheads for that entities primary key by the name value, such as using a customer name in a customer ID filter.
Here's an example of setting the type:
columns:
- name: name
decription: "Name of the organisation"
meta:
metabase.semantic_type: type/Name
- name: archived_at
decription: "When the resource was archived"
meta:
metabase.semantic_type: type/DeletionTimestamp
- name: opportunity_annual_value
meta:
metabase.semantic_type: type/Currency
description: "If an opportunity event, annual value in in normal currency denomination"
That covers the entire stack. As a round-up:
It took about two weeks to figure all this out and get the bulk of our data sources exposed using dbt, and exposed in Metabase. That was a single engineer who hadn't ever used dbt before, and having to make a lot of these decisions on the fly. I expect someone following this guide could do it much faster.
Once we had the stack up and running, we ran two workshops aiming at different audiences:
Intro to dbt, for engineers who will build dbt models.
We can't hire a full-time BI team just yet, and think we can get by with engineers surfacing more complex data features to Metabase by defining them in dbt. This session covered our dbt setup and we mobbed on adding a feature to a dbt model, manually running the schema sync into Metabase.
Intro to data, for the entire company.
Everyone is expected to use data in their work, so everyone needs to be familiar with Metabase. This session included a rundown of the stack, then working as a group to visualize a few key business metrics.
Finally, we took a list of questions we've been dying to have answered and divided it among us, with people trying to answer them in pairs.
Both workshops went great, but the entire company session was my favorite. Two weeks is an eternity in a company like ours, and most of my decisions had been made with clear goals but little experience.
It was a big moment, then, when someone with no SQL experience came to show me a conversion funnel for our product tutorial, and it was exactly right:
Data is a currency at most companies, and it's extremely important that everyone has equal access to it. It's easy – especially when companies begin with highly technical teams – to end up with a few people who know all the answers, undermining the ownership and autonomy of those outside that group.
I'm really happy with our current setup, primarily because it provides equality of access that I've not seen before.
Finally, it's worth reiterating a point that can easily go unnoticed: this stack makes it easy to consolidate data from various sources into a single warehouse, which means you can combine it to unlock significantly more value than the data offers when separate.
Connecting this data means your product activity can be joined with your Sales tools, and your customer success reports can take into account things like how often a customer reached out on Intercom, or if they're in arrears in Stripe.
What this post describes are foundations for capitalizing on this industry-wide change. If you're a small company looking to punch above your weight, you'd be mad not to take that opportunity.
Also published here.