How to design scalable and maintainable ETLs

Often people underestimate the ETL design and start with a simple idea which then grows to include additional features and transformations until they end up with something quite messy. That causes headaches for the IT while sleeping with fingers crossed that everything will not fail. Such ETL would look something like that:

There are several requirements that help the maintenance of the ETL:

Ability to restart/resume the process
Structured information about the execution history and the problems
External configuration
API to monitor and control
Easy failure recovery

And there are a few steps we can follow to achieve all the above.

ETL Atomicity

That’s just a fancy term for breaking our ETLs into small independent jobs. That’s the most important requirement to be able to achieve your goal. Simple jobs are easy to understand, easy to change and easy to maintain. I know that most of us probably think they do this anyway but in my experience in a bespoke software development company, it is often neglected. Let’s look at the following example:

It may look simple enough but what if you need to execute it without loading Table 2? You need to recreate the job and deploy a new version. Or when loading table 3 fails due to DB crash you restart everything instead of loading just that table.

Imagine if that job is split into small idempotent jobs like:

Load from FTP

and 3 load table jobs (which can be the same job reused with a parameter)

(Notice how the drop table and the load data is in the same job. We don’t want to split these into two because the load data will not be idempotent anymore and if we run it twice we will double the rows)

Configurable execution

Now when our ETL process consists of many small atomic jobs we need a way to execute them. A simple solution would be to chain them all in one master job but this isn’t very flexible and doesn’t allow us to configure it easyly. A better approach would be to create a master task that reads the required jobs from external configuration and executes them. This way we can modify the execution flow without redeploying the ETL and without touching the code.

A simple YAML to store the configuration might be sufficient for a small project. But to make it flexible and to be able to extend it, a database could be used. In our example ETL it can be as simple having two tables. The versioning will be used to keep proper execution history:

And our master job can be something like:

Maintaining execution flow history

Searching the log files of a big ETL can be a nightmare. Since we already have a database dedicated to the ETL it will be easy to extend the master task to keep track of the execution flow. This way our jobs wouldn’t have to worry about logging and failure handling (remember — they are small atomic pieces of code that do simple stuff). So, let’s add another couple of tables in that database:

Every time we run an ETL we read the required jobs for the current version, add a record in the ETL Execution Runs table and add records for all scheduled jobs in the ETL Run Jobs tables. This allows us (and our master task) to easily keep track of the execution. We also have information about previous ETL runs with their jobs, timings, statuses, etc. The master task takes care of these tables and controls the process. We can extend that by adding additional fields (or additional tables) for meta information like rows loaded, files processed etc.

Decoupling the ETL execution engine from the ETL

Now when we have our ETL as a structured list of small job and the master task with the database to keep track of the execution we can easily see that these two don’t need to be coupled together. Such master task can run different ETLs which consist of different jobs and maintain their track. It serves as an example of simple “ETL Server” but there are a few more things needed to make it a real server.

First, it requires a simple API to allow easier access to the information about the execution runs and to control these execution runs. It can be a simple CLI for the administrators or even better a web service to allow different user access and actions. In the end, it should be simple to see the status of the ETL and to resume/restart/stop even without access to the server. The database has all the information needed to see which step failed and we have created our steps idempotent, so they can be restarted.
We need an easy way to deploy a new ETL. There are many possibilities but it comes down to supplying a package that contains the jobs and the configuration.

A simple packaging would be a .zip file with all the jobs and a config.yml file in the root folder. When our server gets that etl.zip it can deploy it, schedule it and start executing it. The ETL developers don’t need to know how the server works they just need to describe the flow in the agreed format.

Conclusion

Now when we have the basics it’s up to us to extend that server to make it production ready. After dealing with the same problems with different ETLs over and over I was sure it could be done better. We started with exactly the above steps and even very similar table structure for one simple ETL and continued from there. In the end we ended up with a production ready server running multiple ETLs with easy monitoring, failure recovering and everything else needed for enterprise level customers. The examples here are made in Pentaho Data Integration but the same basic principles apply for all tools.

Please do not hesitate to ask, if you have any questions and please share recommendations!