paint-brush
How We Use dbt (Client) In Our Data Teamby@4sushi
707 reads
707 reads

How We Use dbt (Client) In Our Data Team

by MartinAugust 1st, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Data analysts mainly use SQL scripts to build their analysis. Each script will create a table into BigQuery (with drop/create or incremental strategy) Analysis is split into 3 parts: cleaning raw data cross data and building some common metrics build visualisation tables that we can plug to the dashboards (here we use Data Studio) Data analysts are not able to run directly dbt commands, but only makefile commands. We implemented a “magic method” that run all models that wasn’t tested since the last test execution. By doing this, we are sure that data analysts have tested the code before creating a PR.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How We Use dbt (Client) In Our Data Team
Martin HackerNoon profile picture


Note: Here is not really an article, but more some notes about how we use dbt in our team. We only focus on the data analysis part. Some tools/services are not mentioned here.


Our team


Our technical stack

  • BigQuery: data storage + development platform (on SQL workspace)
  • Airflow on Google Cloud Composer: code execution platform, job scheduling, monitoring
  • dbt client: SQL scripts execution dependencies, documentation, and testing
  • Github: code versioning, CICD (deployment, code quality, tests)
  • Google Data Studio: dashboards platform
  • Visual Studio Code (or similar): develop code in local, execute tests


Google Cloud Composer and Google BigQuery are part of Google Cloud Platform (GCP) services. We handle permissions/access by using Identity and Access Management (IAM). We have 2 projects, one for staging and one for production.


Our architecture for analysis

Data analysts mainly use SQL scripts to build their analysis. We run every day several hundred SQL scripts. Each script will create a table into BigQuery (with drop/create or incremental strategy).


The analysis is split into 3 parts:

  • cleaning raw data

  • cross data and building some common metrics

  • build visualization tables that we can plug into the dashboards (here we use Data Studio)



The development process


What is compile step?

In dbt, you will use the Jinja template inside your SQL code. To evaluate that Jinja template and have only SQL code that you can run on the BigQuery SQL workspace, you need to compile your code


Run and test models

Inside dbt project, each SQL script represents a model. When you start to have a lot of models, you need to be careful when you run the whole workflow (dbt run without filters). You can quickly consume a lot of resources.


To avoid this, we create custom commands in a makefile. Data analysts are not able to run directly dbt commands, but only makefile commands. It's also easier and less technical for them, they don't have to know dbt commands with all possible parameters.


We implemented a “magic method” that runs all models that weren’t tested since the last test execution. That also formats the code and some other stuff. To do it, we store the md5 checksum of each models.


By doing this, we are sure that data analysts have tested the code before creating a PR. We also add control on Github CICD (Github actions), to be sure all models are well tested, otherwise the PR is rejected.


Documentation format

For each model, we create a .yml file that contains the documentation of the model (plus in some cases, tests).


This is an example of the documentation file:


models:
  - name: user
    description: User of the database (client)
    columns:
      - name: name
        description: First name of the user
        meta:
          sensitive: True
      - name: age
        description: Age of the user
        meta:
          sensitive: True


The problem here is that it’s quite hard to be sure that the documentation is up to date with the current model. We develop some utils functions in python that will check the model and compare it with the documentation, it will print differences.


It will also automatically generate the documentation file with pre-filled columns. We integrate a check in our Github CICD (Github actions) to be sure that the documentation is up to date, otherwise the PR can’t be merged.


Our current feeling about dbt

We start to use dbt client for 2 reasons:

  • Handle SQL script dependencies. Before that, we were executing scripts one by one directly on Airflow
  • Have documentation inside the code, and have a nice web UI to explore this documentation. Before that, we were using google spreadsheet (similar to Excel files) to document the code


Now we are very happy because we solve those 2 problems without implementing a “homemade solution”. dbt starts to be a standard for this kind of data analysis workflow, and it is always better for the lifecycle of the project to use a standard solution.


However, it is important to talk about the less positive points. dbt is a young product, and we saw it when we try to install it on the Google Composer platform. It was very problematic, in fact, the python dbt library has a lot of dependencies, so we had some conflicts with other default Composer libraries. This is not a positive point for the compatibility of the library, and it can very quickly become a constraint. We fix it by creating a virtual env inside the Airflow task, before running dbt.


We also realized that most of the development is focused on dbt Cloud, this is normal because it’s a paid solution. But when you use dbt client, it can be frustrating. As an example, we had some difficulties deploying the documentation. Normally, it was supposed to be a static website. But the way how they implement it, you need a web server.


To conclude, migration to dbt is not so easy if you don’t start a project from scratch. There are many things to consider. Feel free if you have any technical questions to contact me.