: Here is not really an article, but more some notes about how we use dbt in our team. We only focus on the data analysis part. Some tools/services are not mentioned here. Note Our team Our technical stack : data storage + development platform (on SQL workspace) BigQuery on : code execution platform, job scheduling, monitoring Airflow Google Cloud Composer : SQL scripts execution dependencies, documentation, and testing dbt client : code versioning, CICD (deployment, code quality, tests) Github : dashboards platform Google Data Studio (or similar): develop code in local, execute tests Visual Studio Code Google Cloud Composer and Google BigQuery are part of Google Cloud Platform (GCP) services. We handle permissions/access by using . We have 2 projects, one for staging and one for production. Identity and Access Management (IAM) Our architecture for analysis Data analysts mainly use SQL scripts to build their analysis. We run every day several hundred SQL scripts. Each script will create a table into BigQuery (with drop/create or incremental strategy). The analysis is split into 3 parts: cleaning raw data cross data and building some common metrics build visualization tables that we can plug into the dashboards (here we use Data Studio) The development process What is compile step? In dbt, you will use the Jinja template inside your SQL code. To evaluate that Jinja template and have only SQL code that you can run on the BigQuery SQL workspace, you need to compile your code Run and test models Inside dbt project, . When you start to have a lot of models, you need to be careful when you run the whole workflow ( without filters). You can quickly consume a lot of resources. each SQL script represents a model dbt run To avoid this, we create custom commands in a makefile. Data analysts are not able to run directly dbt commands, but only makefile commands. It's also easier and less technical for them, they don't have to know dbt commands with all possible parameters. We implemented a “magic method” that runs all models that weren’t tested since the last test execution. That also formats the code and some other stuff. To do it, we store the md5 checksum of each models. By doing this, we are sure that data analysts have tested the code before creating a PR. We also add control on Github CICD (Github actions), to be sure all models are well tested, otherwise the PR is rejected. Documentation format For each model, we create a file that contains the documentation of the model (plus in some cases, tests). .yml This is an example of the documentation file: models:
  - name: user
    description: User of the database (client)
    columns:
      - name: name
        description: First name of the user
        meta:
          sensitive: True
      - name: age
        description: Age of the user
        meta:
          sensitive: True The problem here is that it’s quite hard to be sure that the documentation is up to date with the current model. We develop some utils functions in python that will check the model and compare it with the documentation, it will print differences. It will also automatically generate the documentation file with pre-filled columns. We integrate a check in our Github CICD (Github actions) to be sure that the documentation is up to date, otherwise the PR can’t be merged. Our current feeling about dbt We start to use dbt client for 2 reasons: Handle SQL script dependencies. Before that, we were executing scripts one by one directly on Airflow Have documentation inside the code, and have a nice web UI to explore this documentation. Before that, we were using google spreadsheet (similar to Excel files) to document the code Now we are very happy because we solve those 2 problems without implementing a “homemade solution”. dbt starts to be a standard for this kind of data analysis workflow, and it is always better for the lifecycle of the project to use a standard solution. However, it is important to talk about the less positive points. dbt is a young product, and we saw it when we try to install it on the Google Composer platform. It was very problematic, in fact, the python dbt library has a lot of dependencies, so we had some conflicts with other default Composer libraries. This is not a positive point for the compatibility of the library, and it can very quickly become a constraint. We fix it by creating a virtual env inside the Airflow task, before running dbt. We also realized that most of the development is focused on dbt Cloud, this is normal because it’s a paid solution. But when you use dbt client, it can be frustrating. As an example, we had some difficulties deploying the documentation. Normally, it was supposed to be a static website. But the way how they implement it, you need a web server. To conclude, migration to dbt is not so easy if you don’t start a project from scratch. There are many things to consider. Feel free if you have any technical questions to contact me.

How We Use dbt (Client) In Our Data Team

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How to Handle Database Versioning on Multiple Environments

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

How to Handle Database Versioning on Multiple Environments

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps