paint-brush
System Design: An Iterative and Incremental Approachby@lexaneon
166 reads

System Design: An Iterative and Incremental Approach

by Alexey ArtemovFebruary 2nd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In data engineering, consider the task of building a data pipeline. This pipeline needs to ingest data from a source, transform it, load it into a destination, and provide data access. To build the right solution, we need to address several questions. The answers to these will guide our technical decisions.
featured image - System Design: An Iterative and Incremental Approach
Alexey Artemov HackerNoon profile picture

Engineers, whether they’re data, software, or DevOps, often strive to build or design the perfect solution. They tend to focus predominantly on technical aspects, sometimes overlooking business outcomes. This is a common oversight.


However, our main goal is to develop a product that addresses a business problem, not necessarily to create a flawless solution. Therefore, we need to continually revise our design to ensure we’re on the right track.


Incremental design results in a working system at the end of implementation. On the other hand, iterative design produces a functioning system at the end of each iteration.


Notable examples of iterative design are presented in the picture, created by Henrik Kniberg. Sounds good?


In the realm of data engineering, consider the task of building a data pipeline. This pipeline needs to ingest data from a source, transform it, load it into a destination, and provide data access. To build the right solution, we need to address several questions:


  1. What is the source (technology, format, etc.)?


  2. How much data do we need to ingest (overall, regularly)?


  3. What is the destination (technology, format, etc.)?


  4. What are the business expectations from the pipeline (SLA, availability, latency, etc.)?


The first three questions are technical, while the fourth is business-oriented. The answers to these will guide our technical decisions. Based on business priorities and level of expertise, we might hear responses such as:


  • We need fresh data daily.


  • Fresh data should be available during working hours.


Assuming we have a relational database as a source system and a data platform (data warehouse, data lake, lakehouse, etc.) as a target, we can start designing our pipeline. Our initial solution might be a batch pipeline that runs daily and loads data into our data platform.


But with past experience in data ingestion with lower latency, or if we’ve used Change Data Capture (CDC), we might suggest a pipeline that ingests data with lower latency (hourly or near real-time).


However, implementing a streaming pipeline requires several tasks:

  1. Enabling CDC on the source database, which may require:
    • Changes on the source database, involving DBAs

    • Choosing, configuring, and installing CDC software, involving DevOp


2. Defining a target system for CDC data (file system, object storage, message broker).


3. Writing a streaming pipeline that ingests CDC data to our data platform (to a raw layer).


4. Writing a streaming pipeline that transforms data and incrementally loads it to the next layers.


5. Covering all components with monitoring, alerting, logging, etc.


Depending on the company’s size and existing technologies, implementing such a solution could take several weeks to months.


To avoid building a solution that might not be needed, we should build our system incrementally and iteratively. Here is a possible approach:

Phase 1:

  1. Implement a batch pipeline that ingests data from the source database and loads it to our data platform (to a raw layer). Depending on the amount of data, this could be a full load or a partial load based on certain criteria.


  2. Implement a batch pipeline that reads only actual data from a raw layer and does truncate inserts to further layers. This approach saves us from dealing with updates, deletes, and merge operations, allowing faster implementation.


  3. Cover all components with monitoring, alerting, logging, etc.

Phase 2:

  1. Implement incremental data processing from the raw layer to further layers, which can reduce processing time and infrastructure costs.

Phase 3:

  1. Enable CDC on the source database.


  2. Define a target system for CDC data (file system, object storage, message broker).


  3. Implement a streaming pipeline that ingests CDC data into our data platform and processes it incrementally to further layers.


  4. Cover all components with monitoring, alerting, logging, etc.


At each phase, we have a working system that solves a business problem, allowing the business to benefit from our implementation. We can stop at any phase or continue to implement more features as per business needs.