An ETL (and it’s not so far off cousin ELT) is a concept that is not usually taught in college, at least not in undergrad courses. Yet, ELTs play an important piece of almost every company’s day to day operations.
ETLs are the pipelines that populate data into business dashboards and algorithms that provide vital insights and metrics to managers. These pipelines go from very simple processes that extract data from operational databases into a data warehouse to complex analytical layers that support more complex transformations and applications.
The problem is, many of our IT graduates are not taught this very important skill that they will more than likely use in their everyday life if they decide to pursue a career in data science or data engineering. Although data science work doesn’t really include ETLs technically, a lot of work they do can end up crossing over with data engineers.
We wanted to cover the basics of an ETL and good ETL design.
What does a good ETL look like, what are the key components, what makes a pipeline maintainable and healthy? These are all important questions data engineers and database managers need to ask before taking on new projects.
This will be the first of several posts. In this post, we will be covering some basic concepts such as extracting/loading raw data and logging. These are important topics that have a lot of nuances that we would like to start to cover
We will also be offering a free webinar February 23 10 AM PT to cover this topic. We would love to have you attended!
Operational Database
All of the data your ETL will be ingesting will have some source that represents real life transactions of one sort of another. The data could represent patients visiting hospitals, people posting on social media or customers purchasing products. Some might ask why you can’t just analyze data from the operational database instead of going through the extra steps of processing the data into another system.
The biggest issue is that trying to run analytical processes that aggregate and analyze large data set from the operational database will slow down the operational database for customers. In addition, operational databases are heavily normalized which make them inefficient to query from an analytical standpoint. This is why a data warehouse of one form or another is developed.
Here is an example of a method for bulk insert data into SQL Server
The first step in the processes is to extract the data into some form of raw format.
Raw File Extracts
How the data is pulled from the operational database can vary. However, one of the most consistent ways is to extract the data into a raw file like a CSV or XML file. Extracting the data into a raw file has several benefits. One, a direct link that pulls data limits the interactions outside systems have with the operational database. This will avoid blocking important operational tasks that need to occur. In addition, it creates a snapshot of what the data looked like at the time of pulling. This can be beneficial if you ever need to reload data because of an issue that might occur in the pipeline.
Having this back-up of data is beneficial just in case someone accidentally deletes crucial data. Then you will be very happy to have raw extracts that are easy to reload. These raw extracts should be tracked in some table that keeps track of what files have been loaded and where they exist. That way you can easily automate any reloading. Also, the files should have a naming standard.
An easy example would be DateExtracted_DateRangeInTheFile_BusinessObject (e.g. 20170101_20160901_20161231_Customers.csv)
Metadatabase And Logging
Tracking what is occurring inside your ETL process is a crucial step in developing an automated, maintainable and robust system. How extensive your tracking and logging depends on how large your company is and how many ETLs/ELTs and other data transformations need to occur. Larger companies might have a standardized tool like Airflow to help manage DAGs and logging.
However, if you are a start-up or a non-tech company, it will probably be ok to have a simplified logging system. The purpose of this system is to manage the files that need to be loaded/have been loaded, tracking of what stored procedures have run, errors, dependencies, etc.
Having good insight into these aspects of your system will help maintain it, improve it and backfill any data that either has been missed or loaded incorrectly. This means you need to actually create 3–4 tables (at least) that track what is running, how long do they take, computation, etc. Especially as we push for more and more automated systems, this will be the key to making sure the systems are maintainable.
Loading Raw Data Into Raw tables
Once you have extracted the raw data from the operational databases and updated your metatable with the new file information, then you can start focusing on loading the data.
It can be tempting to try to add a lot of complex business logic to your initial loads into raw. It can seem to simplify the process and create fewer steps in order to load the data. So why add more steps. The truth is adding too much logic to early can make it difficult to track where errors occur. Did the error occur in the extract, in the business logic or somewhere else? In addition, it becomes more difficult to update new logic because you have to analyze where the best place to update the logic will be.
All of these extra considerations make the overall system more difficult to maintain. That is why it is important to just load the raw data as is. If there is de-duplication logic or mapping that needs to happen then it can happen in the staging portion of the pipeline.
The next steps after loading the data to the raw database are QA and loading data into the staging database. We will continue that discussion in our next post. If you want to learn more about ETL development/automation then sign up for our ETL webinar where we will be discussing the process of creating an ETL from end to end.