In Digital Healthcare data platforms, data quality is no longer a nice-to-have — it is a hard requirement. Business decisions, regulatory reporting, machine learning models, and executive dashboards all depend on one thing: trustworthy data.
Yet, many data engineering teams still treat data quality as an afterthought, validating data only after it has already propagated downstream.
Databricks introduced a powerful shift in this mindset through Declarative Pipelines using Delta Live Tables (DLT).
Instead of writing complex validation logic manually, engineers can now declare what good data looks like and let the platform enforce, monitor, and govern it automatically.
This blog explores how declarative data quality works in Databricks, why it matters, and how to design production-grade pipelines using this approach.
The Traditional Problem with Data Quality
In traditional healthcare ETL pipelines, data quality is usually handled using:
- Custom IF conditions
- Separate validation jobs
- Manual logging tables
- Post-load reconciliation queries
While this approach may work initially, it quickly breaks down at scale:
- Validation logic becomes scattered across notebooks
- Failures are hard to trace back to the root cause
- Metrics are inconsistent across pipelines
- Reprocessing bad data becomes complex
Most importantly, bad data often reaches downstream systems silently, where the impact is far more expensive.
Declarative pipelines solve this problem by making data quality a first-class citizen of the pipeline itself.
What Is Declarative Data Quality?
Declarative data quality means defining rules and expectations, not procedural logic.
Instead of saying:
“Check if the amount is positive and then drop the record.”
You say:
“The amount must always be greater than zero.”
In Databricks, this is implemented using Delta Live Tables (DLT) Expectations.
Expectations allow you to attach data quality rules directly to tables, making the pipeline:
- Self-documenting
- Consistent
- Easier to maintain
Delta Live Tables and Expectations
Delta Live Tables provide a declarative framework to build batch and streaming pipelines. Data quality is enforced using Expectations, which are evaluated automatically during pipeline execution.
DLT supports three expectation behaviors:
1. Expect (Monitor Only)
This mode tracks data quality issues but allows all records to pass.
Use cases:
- Monitoring upstream data health
- Gradual rollout of quality rules
@dlt.expect("valid_date", "order_date IS NOT NULL")
2. Expect or Drop
Records that violate the rule are automatically removed from the dataset.
Use cases:
- Removing invalid or corrupt records
- Enforcing cleanliness in curated layers
@dlt.expect_or_drop("amount_positive", "amount > 0")
3. Expect or Fail
The pipeline fails immediately if the rule is violated.
Use cases:
- Business-critical constraints
- Regulatory or financial data
@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")
This clear separation allows teams to apply the right level of strictness at the right stage.
Data Quality in the Medallion Architecture
Declarative data quality works best when combined with the Bronze–Silver–Gold (Medallion) Architecture.
Bronze Layer – Raw Data
The Bronze layer focuses on ingestion reliability, not correctness.
- Schema-on-read
- Minimal validation
- Preserve raw data
Declarative expectations are usually avoided here, except for basic technical checks.
Silver Layer – Validated and Cleaned Data
The Silver layer is where most data quality rules live.
Typical rules include:
- Non-null checks
- Range validations
- Referential integrity
- Deduplication
Example:
@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
@dlt.expect("customer_present", "customer_id IS NOT NULL")
def silver_sales():
return dlt.read("bronze_sales")
This ensures only trusted data flows forward, while still maintaining visibility into quality issues.
Gold Layer – Business-Ready Data
The Gold layer serves analytics, reporting, and machine learning.
Here, expectations are strict:
- Business keys must exist
- Aggregations must be consistent
- No tolerance for invalid records
Fail-fast expectations are commonly used to protect consumers.
Built-In Observability and Metrics
One of the biggest advantages of declarative data quality in Databricks is automatic observability.
For every expectation, Databricks captures:
- Total records processed
- Passed and failed record counts
- Dropped records
- Failure reasons
These metrics are available through:
- Delta Live Tables UI
- Event log tables
- Databricks system tables
This eliminates the need for custom monitoring frameworks and significantly improves auditability.
Quarantine Pattern: Don’t Lose Bad Data
Dropping bad records is not always enough. In regulated or enterprise environments, teams often need to retain invalid data for analysis and reprocessing.
A common pattern is to write failed records to a quarantine table:
@dlt.table
def quarantine_sales():
return dlt.read("bronze_sales") \
.filter("amount <= 0 OR customer_id IS NULL")
Benefits of this approach:
- Root-cause analysis
- SLA and vendor issue tracking
- Reprocessing after fixes
Why Declarative Data Quality Scales Better
|
Traditional ETL |
Declarative Pipelines |
|---|---|
|
Manual validation code |
Built-in expectations |
|
Hard to audit |
Automatic metrics |
|
Complex error handling |
Clear rule enforcement |
|
Reactive fixes |
Preventive design |
Declarative pipelines reduce code complexity while increasing reliability — a rare but valuable combination.
Common Mistakes to Avoid
- Applying strict rules in the bronze layer
- Using expect_or_fail everywhere
- Ignoring quarantine tables
- Treating data quality as a one-time setup
Declarative quality works best when rules evolve with the data and business requirements.
Sample Data and Expected Output
To make declarative data quality more concrete, let’s walk through a simple end-to-end example using sample data and see how expectations affect the output at each layer.
Sample Input Data (Bronze Layer)
Assume this is raw sales data ingested from a source system into the Bronze table.
|
order_id |
customer_id |
amount |
order_date |
|---|---|---|---|
|
101 |
C001 |
250 |
2024-11-01 |
|
102 |
C002 |
-50 |
2024-11-01 |
|
103 |
NULL |
120 |
2024-11-02 |
|
104 |
C003 |
0 |
2024-11-02 |
|
NULL |
C004 |
300 |
2024-11-03 |
At this stage:
- No records are rejected
- Data is stored as-is for traceability
Data Quality Rules Applied (Silver Layer)
In the Silver layer, we apply declarative expectations:
- amount > 0 → expect_or_drop
- customer_id IS NOT NULL → expect (monitor only)
@dlt.table
@dlt.expect_or_drop("amount_positive", "amount > 0")
@dlt.expect("customer_not_null", "customer_id IS NOT NULL")
def silver_sales():
return dlt.read("bronze_sales")
Silver Output Table
|
order_id |
customer_id |
amount |
quality_status |
|---|---|---|---|
|
101 |
C001 |
250 |
PASS |
|
103 |
NULL |
120 |
WARN |
|
NULL |
C004 |
300 |
WARN |
Dropped Records:
- Order 102 (amount = -50)
- Order 104 (amount = 0)
DLT automatically records how many rows were dropped and which rule caused it.
Quarantine Table Output
Instead of losing dropped data, we capture it in a quarantine table.
@dlt.table
def silver_sales_quarantine():
return dlt.read("bronze_sales") \
.filter("amount <= 0")
Quarantine Output
|
order_id |
customer_id |
amount |
order_date |
Reason |
|---|---|---|---|---|
|
102 |
C002 |
-50 |
2024-11-01 |
Invalid amount |
|
104 |
C003 |
0 |
2024-11-02 |
Invalid amount |
This table is useful for:
- Root-cause analysis
- Vendor or upstream system feedback
- Reprocessing after fixes
Business Rules Applied (Gold Layer)
In the Gold layer, strict business rules are enforced:
● order_id IS NOT NULL → expect_or_fail
@dlt.table
@dlt.expect_or_fail("order_id_present", "order_id IS NOT NULL")
def gold_sales():
return dlt.read("silver_sales") \
.groupBy("customer_id") \
.agg(sum("amount").alias("total_spend"))
Gold Output Table
|
customer_id |
total_spend |
|---|---|
|
C001 |
250 |
Pipeline Failure Triggered:
- Record with order_id = NULL causes the pipeline to fail
This protects downstreRecordsam consumers by preventing incorrect aggregations.
What DLT Captures Automatically
For this example, Databricks automatically tracks:
- Total records ingested: 5
- Records dropped in Silver: 2
- Expectation violations per rule
- Pipeline failure reason in Gold
All metrics are visible in the DLT UI and event logs, with zero custom code.
Final Thoughts
This simple example demonstrates the real power of declarative data quality:
- Rules are clear and self-documenting
- Bad data is controlled, not hidden
- Outputs are predictable and auditable
Declarative pipelines ensure that every downstream dataset is built on explicit trust guarantees, making them ideal for production-grade data platforms.
