Lessons From The Night I Met Dbt on Databricks

Written by patrickokare | Published 2025/11/28
Tech Story Tags: etl | medallion-architecture | data-engineering | data-build-tool | databricks | dbt | sales-model | measuring-customer-sentiment

TLDRThe Medallion Architecture is a framework that turns messy e-commerce data into business-ready insights. The framework is built on top of the Data Build Tool (Data Build Tool) (DBT) and Databricks. via the TL;DR App

It was late he kind of quiet evening when every human questions the meaning of life and their pipelines of course :).I’d been debugging a transformation script that broke for the fifth time that week. The problem wasn’t the data; it was the process.

That’s when I stumbled upon dbt (Data Build Tool), not a new ETL tool, but a framework that thinks like a software engineer. SQL models, modularity, testing, and lineage suddenly, transformation felt… elegant.

This article tells the story of how I implemented the Medallion Architecture using dbt + Databricks, transforming messy e-commerce data into business-ready insights.

Introduction — What is dbt, Really?

If you’ve ever written SQL transformations in a warehouse and wondered why everything feels disconnected — dbt is the missing link.

dbt is a data transformation framework that turns your warehouse into a software development environment. You define models as .sql files, manage dependencies with ref(), run automated tests, and version your transformations — all in Git.

Think of dbt as:

SQL + Jinja + CI/CD = Reproducible Data Models

In essence, Bronze, Silver and Gold aren’t just a naming convention — it’s a design philosophy. dbt lets you build, test, and document each layer with confidence.

Where Everything Begins

Like every good story, this one starts with chaos.

The Bronze layer is where raw data lands — unfiltered, imperfect, and full of potential.

In my project, the Bronze models ingested four datasets: ordersproductsusers, and reviews.

-- models/bronze/bronze_orders.sql
{{ config(materialized='table') }}
SELECT 
    order_id,
    user_id,
    product_id,
    quantity,
    unit_price,
    order_date,
    current_timestamp() AS ingestion_timestamp
FROM {{ source('raw', 'orders') }}

This stage is all about fidelity over perfection — preserving every record from the source system. If data were gold ore, the Bronze layer is the mining site.

Refinery

The Silver layer is where the alchemy begins. Here, the goal is to clean, standardize, and enrich — turning raw ore into polished data assets.

-- models/silver/silver_orders.sql
{{ config(materialized='table') }}
SELECT
    o.order_id,
    o.user_id,
    o.product_id,
    o.quantity,
    o.unit_price,
    (o.quantity * o.unit_price) AS order_amount,
    u.country,
    p.category,
    p.brand
FROM {{ ref('bronze_orders') }} o
JOIN {{ ref('bronze_users') }} u ON o.user_id = u.user_id
JOIN {{ ref('bronze_products') }} p ON o.product_id = p.product_id

This layer introduces business logic — standardizing currency, validating foreign keys, and computing new metrics.

I even used a simple unit test in dbt to validate calculations:

# models/silver/_silver.yml
unit_tests:
  - name: test_order_amount_calc
    model: silver_orders
    given: 
      - input: ref('bronze_orders')
        format: csv
        rows: |
          id,quantity,unit_price
          1,10,100.00
          2,2,5.00
    expect:
      format: csv
      rows: |
        id,order_amount
        1,1000.00
        2,10.00

Testing is where dbt shines — imagine asserting your data logic before it breaks production.

Decisions, Decisions, and More Decision Making

By the time data reaches the Gold layer, it has survived the turbulence of the Bronze lake and the refinements of the Silver refinery.

Now, it’s ready to speak business.

In the Medallion Architecture, the Gold layer is the semantic and analytics layer where data is no longer raw or transactional but aggregated, curated, and meaningful. It’s here that analysts, decision-makers, and BI tools finally meet the story your data has been trying to tell.

The Purpose of the Gold Layer

While the Silver layer focuses on conformance and cleansing, the Gold layer is where business value is defined.

This layer answers questions like:

  • “How much revenue did we generate yesterday?”
  • “Which products or categories are trending?”
  • “What’s the overall customer satisfaction score this week?”

These are not raw metrics — they are insights, built from Silver tables that have already handled all the data quality, relationships, and enrichments.

Building the Gold Sales Model

My first Gold model, gold_sales__daily.sql, takes transactional-level data from silver_orders and aggregates it into daily revenue facts.

-- models/gold/gold_sales__daily.sql
{{ config(materialized='view') }}
SELECT
    order_date,
    SUM(order_amount) AS total_sales,
    COUNT(DISTINCT order_id) AS total_orders
FROM {{ ref('silver_orders') }}
GROUP BY order_date

Each record in this table represents the business pulse for a single day: total revenue, number of orders, and sales volume. This model is perfect for trend analysisdaily dashboards, and KPI tracking in tools like Power BI, Preset, or Tableau.

Under the hood, dbt automatically understands dependencies via the ref(‘silver_orders’) function — ensuring this model only runs after silver_orders has successfully built.

Measuring Customer Sentiment

The second Gold model, gold_avg_rating__daily.sql, captures how customers feel about their purchases — a perfect complement to sales metrics.

-- models/gold/gold_avg_rating__daily.sql
{{ config(materialized='view') }}
SELECT
    review_date,
    AVG(rating) AS avg_rating,
    COUNT(review_id) AS total_reviews
FROM {{ ref('silver_reviews') }}
GROUP BY review_date

Here, I’m transforming individual review events into a daily sentiment summary, providing an at-a-glance view of customer satisfaction trends.

Testing the Gold Layer

Like the Silver layer, the Gold layer includes unit tests to verify that business logic holds even as upstream data changes.

# models/gold/_gold.yml
version: 2
unit_tests:
  - name: test_gold_avg_rating
    model: gold_avg_rating__daily
    given:
      - input: ref('bronze_reviews')
        format: csv
        rows: |
          id,created_at,product_id,rating
          1,2018-01-04T00:57:04.845+00:00,1,5
          2,2018-01-04T00:00:00.140+00:00,1,4
          3,2020-01-04T00:00:10.040+00:00,2,3
      - input: ref('silver_products')
        format: csv
        rows: |
          id,product_name
          1,'product1'
          2,'product2'
    expect:
      format: csv
      rows: |
        review_date,product_id,avg_rating
        2018-01-04,1,4.5
        2020-01-04,2,3

This test asserts that when reviews are averaged by date and product, the resulting daily ratings are correct — proof that the business logic in the model works as intended.

From Pipelines to Insights

At this point, the Gold layer feeds directly into dashboards and data applications. The same datasets that began as chaotic CSVs in Bronze are now refined insights driving executive reports, automated KPIs, and machine learning pipelines.

For example:

  • The Sales model powers revenue and growth dashboards.
  • The Average Rating model informs product feedback analysis and NPS tracking.

With dbt, the entire journey from ingestion to insight is transparent, testable, and reproducible — the holy trinity of modern data engineering. The Gold layer is where data storytelling truly happens.

Your transformations are no longer technical — they are narrative. Each SQL statement contributes to a chapter in the business story, one that explains not just what happened, but why it matters.

How dbt Keeps It All Together

By the time I reached this stage, I realized that dbt wasn’t just transforming data. Under the hood, dbt runs on an elegant philosophy: if you can describe your logic clearly, dbt can orchestrate it reliably. This chapter is about the machinery — the quiet genius that holds the Medallion Architecture together.

The Blueprint: dbt_project.yml

Every dbt project starts with a heart — a file called dbt_project.yml.

It’s the project’s DNA, describing where models live, how they’re built, and which schema or materialization rules to apply.

Here’s what mine looks like:

# dbt_project.yml
name: 'dbt_databricks_project'
version: '1.0.0'
profile: 'default'
model-paths: ["models"]
macro-paths: ["macros"]
test-paths: ["tests"]
snapshot-paths: ["snapshots"]models:
  dbt_databricks_project:
    bronze:
      +schema: bronze
      +materialized: table
    silver:
      +schema: silver
      +materialized: table
    gold:
      +schema: gold
      +materialized: view

This single file tells dbt:

“All my Bronze models should be built as tables,

all Silver models as tables,

and all Gold models as views (or tables for perormance) in their respective schemas.”

Simple, declarative, and consistent. That’s one of DBT’s core superpowers — configuration over chaos.

Dependencies That Build Themselves

In a traditional pipeline, you’d write scripts to manually define job dependencies. In dbt, this logic is handled elegantly with one word: ref().

SELECT * FROM {{ ref('silver_orders') }}

That single Jinja function call tells dbt:

“Hey, this model depends on silver_orders. Build that first.”

From there, dbt automatically constructs a directed acyclic graph (DAG) — the same kind of dependency tree you’d see in tools like Airflow or Dagster.

When you run:

dbt run --select gold

dbt doesn’t just execute a SQL script — it resolves the full dependency chain:

  1. First, it builds Bronze (raw tables)
  2. Builds Silver (cleansed tables)
  3. Then, finally, Gold (aggregated insights)
  4. All in the correct order, automatically.

Packages and Modularity

dbt encourages modularity through packages, just like Python or Node.

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.1

The dbt_utils package brings in a toolbox of reusable macros — functions that simplify things like testing, joins, and common data patterns. This modular design means you never reinvent the wheel. Need to check if a column is unique or not null? One line of YAML does it. dbt treats reusability as a design principle, not an afterthought.

Testing Assumptions, Not Just Data

Tests in dbt are not unit tests in code — they’re trust checks for your data.

They assert truths like:

  1. Every order_id should be unique.
  2. No user_id should be null.
  3. Ratings should only fall between 1 and 5.

These can be written inline in your YAML schema files or as custom tests using macros.

When you run:

dbt test

dbt runs every test in parallel, ensuring your pipelines are safe, validated, and production-ready.

Documentation and Lineage

Then comes my favourite feature — dbt Docs.

With a single command:

dbt docs generate
dbt docs serve

dbt builds a self-documenting data catalogue with Column descriptions, Data types, and source links. And a beautiful lineage graph showing how Bronze flows into Silver, and Silver into Gold.

It’s not just documentation; it’s transparency — the kind that builds confidence between engineers, analysts, and executives.

When all these pieces come together — configurations, references, packages, tests, and docs — dbt orchestrates a seamless symphony. You’re not writing isolated SQL scripts anymore; You’re building a data ecosystem where every model knows its place and purpose.

So when you run:

dbt run --select silver

dbt knows exactly what to do.

It compiles, orders, and executes every dependent model in sequence — no Airflow DAG required.

And the best part?

You can version it, test it, and redeploy it — just like code.

In Essence

dbt keeps the Medallion Architecture alive by automating discipline:

Every model has a schema, purpose, and test.

Every dependency is explicit.

Every layer — Bronze, Silver, and Gold — is reproducible from scratch.

It’s the invisible scaffolding that ensures your Lakehouse doesn’t collapse under complexity. In a world where pipelines often break in silence, dbt whispers clarity.

Core Lessons from the Lakehouse

Building this project offered several practical insights into implementing dbt and the Medallion Architecture effectively within a modern data platform.

1. Start with Clarity, Not Code

Successful data projects begin with a clear architectural vision, not SQL syntax.

Defining data flow from raw ingestion to curated output ensures every model has a specific purpose within the pipeline. Planning dependencies, naming conventions, and transformation logic upfront reduces complexity and rework later.

2. Data Quality Is Non-Negotiable

Reliable analytics depend on trusted data.

Integrating dbt tests into each layer (Bronze, Silver, Gold) enforces data validation and prevents downstream issues. Tests for nulls, duplicates, and calculation accuracy (e.g., order amount and average rating) help maintain integrity as data scales. In dbt, testing becomes part of the engineering process rather than an afterthought.

3. Documentation Drives Alignment

Documentation is as critical as modelling.

dbt automatically generates lineage, descriptions, and schema visibility through dbt docs generate, enabling transparency across teams. Clear documentation accelerates onboarding, simplifies debugging, and improves collaboration between engineering and analytics functions.

4. Apply Software Engineering Principles

dbt encourages data teams to think like software engineers: modular design, version control, and CI/CD integration.

Breaking logic into reusable components, leveraging packages such as dbt_utils, and maintaining models in Git repositories results in scalable, maintainable pipelines. Consistency and automation improve reliability and simplify operational governance.

5. Maintain Simplicity

Effective data architecture prioritizes readability and structure over complexity.DBT’s declarative configuration, ref() dependencies, and schema-based organization make pipelines self-describing and predictable. Simplicity ensures long-term maintainability and reduces friction when adapting to business changes.

6. Adopt a Layered Mindset

The Medallion Architecture emphasizes progression and control:

  • Bronze preserves raw data.
  • Silver cleanses, standardizes, and enriches.
  • Gold aggregates for analytics and reporting.

Viewing the pipeline as an evolving system, not isolated scripts, enforces discipline and traceability throughout the data lifecycle.

7. Build for Trust

Ultimately, modern data engineering is about trust in process, models, and outcomes. dbt and the Lakehouse pattern provide the structure to make that trust measurable through testing, documentation, and transparency.

Every model should answer a clear business question and pass objective validation.

In Summary

This project reinforced that scalable data platforms depend on design, discipline, and documentation.

By combining dbt’s engineering-driven approach with the Medallion Architecture, teams can build reproducible, testable, and insight-ready data systems that evolve reliably over time.

Why This Matters to You as a Developer

In the end, dbt isn’t just another data tool; it’s a movement.

It bridges the gap between analytics and engineering, empowering teams to build data like software, tested, versioned, and documented.


Written by patrickokare | I specialize in steering transformative cross-regional data analytics initiatives and data modernization projects.
Published by HackerNoon on 2025/11/28