What the Heck Is SDF?

Written by progrockrec | Published 2023/10/24
Tech Story Tags: programming | dbt | data-engineering | sql | what-is-sdf | semantic-data-fabric | data | compiler-and-build-system

TLDRSDF is a Rust-based alternative to dbt. Built on top of Datafusion, part of the Apache Arrow project. Allows you to annotate your SQL sources with metadata ranging from simple types and classifiers (PII) to table visibility and privacy policies.via the TL;DR App

Introduction

The year 2023 has seen significant developments in innovation, adoption, and competition. HashiCorp ignited a substantial controversy when it altered the licenses of its products to be less open-source-friendly.

This quickly led to the emergence of Terraform forks, such as OpenTofu.

We also observed increased competition among established players like Fivetran and dbt, often collaborating to provide solutions.

This brings us to the topic of this blog post—the Rust-based alternative to dbt, known as SDF (Semantic Data Fabric). So, what exactly is SDF?

SDF Overview

The origins of SDF can be traced back to Meta. It is written in Rust, meaning a single compact binary with no packages to manage. Once downloaded, you're ready to get started. SDF is built on Datafusion, which is part of the Apache Arrow project. With SDF, you'll primarily be working with YAML and SQL—languages that should be familiar to any data engineer.

At its core, SDF is a compiler and build system that applies static analysis to SQL code. It can then generate a robust dependency graph for a clearer view of your data assets. This feature enables easier detection of potential problems and optimization issues.

One of SDF's unique features is the capacity to annotate your SQL sources with metadata. This can range from simple types and classifiers (like PII) to table visibility and privacy policies. SDF refers to these as Checks. Here are some examples of the types of Checks you can perform:

  • Data Privacy Check: Ensures all personally identifiable information (PII) is suitably anonymized.
  • Data Ownership Check: Confirms that every table has an owner—a requirement of GDPR.
  • Data Quality Check: Prevents calculations that mix different currency types (for example, preventing £ + $ )

The resulting information schema extends beyond standard features. You can use SQL to explore the information schema and even write your own Checks to integrate into your workflows.

Additional Features

The SDF engine includes a multi-dialect SQL compiler, a static analyzer, a dependency manager, and a build cache. The following images illustrate the deploy, describe, and lineage commands.

These features are available in their CLI, and a cloud version offers an interactive interface to explore the catalog and lineage. You can easily zoom in and out, click on a node, and delve deeper. Here is a zoomed-out image:

As previously mentioned, the metadata and configurations for SDF are in YAML and form an SDF Workspace. SDF extends YAML's typical configuration usage by allowing you to specify data asset definitions simultaneously, referred to as definition blocks.

These blocks can be used to define or enrich tables, functions, classifiers, etc. Borrowing an example from their documentation, here's an illustrative SDF YAML workspace.

These classifiers can conveniently tag your columns and their descendants throughout a project. Coupled with available functions, you gain substantial control and visibility over your data models. The compilation step and helpful error messages contribute to a comprehensive environment.

Summary

This project's compact nature is impressive; you can install it locally and experiment with minimal effort. It proposes a different approach to dbt, which may cause resistance for some. However, uncertainty remains about how this would perform in extensive, complex environments where there's a desire to deploy thousands of modules.

Nevertheless, the Checks feature could be the solution to this concern. SDF is undeniably a valuable project that can meet many use cases. If you're weary of dbt or considering something akin to dbt, then SDF should probably be on your shortlist.

You can read more "What the heck" articles at the following links:

What The Heck Is DuckDB? What the Heck Is Malloy? What the Heck is PRQL? What the Heck is GlareDB? What the Heck is SeaTunnel? What the Heck is LanceDB?


Written by progrockrec | Technology and blockchain developer and enthusiast as well as a prolific musician.
Published by HackerNoon on 2023/10/24