Debezium Introduction: Another Change Data Capture Tool
@suksantSuksant (Hui) Sae Lor
Data Leader with an Infinite Sleep Debt
Building an enterprise data warehouse can be either relatively straightforward or very sophisticated. It depends on many factors, such as the conceptual data model complexity and the variety of source systems. In many cases, applying the Change Data Capture (CDC)
approach can make the data integration simpler. Fortunately, there are plenty of CDC tools available in the market, many of which are easy-to-use and affordable, while others are cumbersome and expensive (for what it is).
What I am interested in doing is to move the data from a SQL Server to Google BigQuery
without too much hassle like changing the firewall rules. Many enterprise solutions can achieve this out of the box, such as Alooma
and Qlik Replicate
. The selection process should take into account at least
- Environment - A self-hosted or a fully managed solution. If it is the latter, a more thorough consideration of regulatory and compliance issues is required.
- Connectors - The availability of the connectors to source and target systems.
- Quality Attributes - Performance, robustness, and reliability of the solution.
- Pricing - A free, a connector-based, or a consumption-based pricing model.
I have recently looked at an open-source tool, Debezium
and found it fascinating, especially the Debezium Server
, which is an incubating feature.
To test this out, I use the Stack Overflow data (~10GB) provided by Brent Ozar
with a simple setup as follows.
Configure a SQL Server
- Restore the database using MDF, NDF, and LDF files. At present, Cloud SQL does not support CDC so to test this, either use Azure SQL Managed Instance or a Microsoft SQL Server Docker Image (the container must run with an environment variable 'MSSQL_AGENT_ENABLED=True' to enable the SQL Server Agent).
- Enable CDC on the database and all tables.
- Start the CDC job on the SQL Server.
Set Up a Debezium Server
- Download and extract the Debezium Server from the Maven Central Repository.
- Start a ZooKeeper server.
- Start a Kafka server.
- Create a file `conf/application.properties` as follows:
Create and Start the Processing PipelineCreate a BigQuery dataset and tables (must include column `op` and `ts_ms`).Create a Pub/Sub topic for each table and one for the database.Create an HTTP function to listen to Pub/Sub push subscriptions. This test streams only the latest data, the operation, and the timestamp into BigQuery.
# -*- coding: utf-8 -*-
This HTTP function is responsible for:
- Parsing the messages published by the Debezium Server.
- Inserting the records to relevant BigQuery tables.
This HTTP function is not responsible for:
- Handling or logging errors.
- Deduplicating data.
- Validating the input.
- Checking for dupliations.
from google.api_core import retry
from google.cloud import bigquery
BQ = bigquery.Client()
'''This function is executed whenever the endpoint is called'''
request_json = request.get_json(silent=True)
data = request_json["message"]["data"]
data = base64.b64decode(data).decode('utf-8')
data = json.loads(data)
payload = data["payload"]
record = payload["after"]
record["op"] = payload["op"]
record["ts_ms"] = payload["ts_ms"]
dataset_id = payload["source"]["db"]
table_id = payload["source"]["table"]
table = BQ.dataset(dataset_id).table(table_id)
_ = BQ.insert_rows_json(table=table, json_rows=[record])
- Create a subscription for each topic with the HTTP function above as a push endpoint.
- Run ./run.sh to start the CDC process.
The initial load of all the tables took quite a while to complete as it was configured to run sequentially. The HTTP function above was written for a quick test and should not be used in production. To make it production-ready, it requires a better error handling, a recovery mechanism, and a deduplication method to guarantee high-quality delivery.
With the support for SQL Server coming soon, it might be more convenient for those who are already on GCP to use the Dataflow CDC Example
to stream data directly into BigQuery in a more scalable and controllable way. For example, the embedded connector can be run in a single-topic mode, publishing all updates for a database to a single Pub/Sub topic.
I would also like to try out a framework like PipelineWise
, which is based on Singer.io
specification and already have a few log-based connectors. It seems that the industry is shying away from a locked-in ecosystem and aggressive price points at a rapid pace. Debezium and many other open-source tools are still cumbersome, but it is evolving both in terms of quality and usability. When it is not feasible to use fully managed services, these can become great options.
Subscribe to get your daily round-up of top tech stories!