Engineers, whether they’re data, software, or DevOps, often strive to build or design the perfect solution. They tend to focus predominantly on technical aspects, sometimes overlooking business outcomes. This is a common oversight.
However, our main goal is to develop a product that addresses a business problem, not necessarily to create a flawless solution. Therefore, we need to continually revise our design to ensure we’re on the right track.
Incremental design results in a working system at the end of implementation. On the other hand, iterative design produces a functioning system at the end of each iteration.
Notable examples of iterative design are presented in the picture, created by Henrik Kniberg. Sounds good?
In the realm of data engineering, consider the task of building a data pipeline. This pipeline needs to ingest data from a source, transform it, load it into a destination, and provide data access. To build the right solution, we need to address several questions:
What is the source (technology, format, etc.)?
How much data do we need to ingest (overall, regularly)?
What is the destination (technology, format, etc.)?
What are the business expectations from the pipeline (SLA, availability, latency, etc.)?
The first three questions are technical, while the fourth is business-oriented. The answers to these will guide our technical decisions. Based on business priorities and level of expertise, we might hear responses such as:
Assuming we have a relational database as a source system and a data platform (data warehouse, data lake, lakehouse, etc.) as a target, we can start designing our pipeline. Our initial solution might be a batch pipeline that runs daily and loads data into our data platform.
But with past experience in data ingestion with lower latency, or if we’ve used Change Data Capture (CDC), we might suggest a pipeline that ingests data with lower latency (hourly or near real-time).
However, implementing a streaming pipeline requires several tasks:
Changes on the source database, involving DBAs
Choosing, configuring, and installing CDC software, involving DevOp
2. Defining a target system for CDC data (file system, object storage, message broker).
3. Writing a streaming pipeline that ingests CDC data to our data platform (to a raw layer).
4. Writing a streaming pipeline that transforms data and incrementally loads it to the next layers.
5. Covering all components with monitoring, alerting, logging, etc.
Depending on the company’s size and existing technologies, implementing such a solution could take several weeks to months.
To avoid building a solution that might not be needed, we should build our system incrementally and iteratively. Here is a possible approach:
Implement a batch pipeline that ingests data from the source database and loads it to our data platform (to a raw layer). Depending on the amount of data, this could be a full load or a partial load based on certain criteria.
Implement a batch pipeline that reads only actual data from a raw layer and does truncate inserts to further layers. This approach saves us from dealing with updates, deletes, and merge operations, allowing faster implementation.
Cover all components with monitoring, alerting, logging, etc.
Enable CDC on the source database.
Define a target system for CDC data (file system, object storage, message broker).
Implement a streaming pipeline that ingests CDC data into our data platform and processes it incrementally to further layers.
Cover all components with monitoring, alerting, logging, etc.
At each phase, we have a working system that solves a business problem, allowing the business to benefit from our implementation. We can stop at any phase or continue to implement more features as per business needs.