In this post, we move the entire data pipeline we’ve built so far onto Docker, a tool that enables us to package the solution and its dependencies into neat little things called container images. This will make the app portable, enable it to run consistently on different machines, and make it easier to operate.
This is the fifth post in my series, Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale:
In this post, we’ll be using Docker. I understand that this might be a big jump for some: from running services through the shell to deploying containers. It is a different paradigm, and requires you to learn a new toolkit. This was why I summarised the launch / teardown code in the penultimate section in Part IV: Building the DAG - it’s for readers who are happy with the pipeline we have built so far and don’t feel the need to venture further. That’s perfectly fine.
But, if you’re ready to dive into Docker, make sure you have the pre-requisites below and read on.
Let’s start with an example. Suppose that you’ve been collecting Lego for years now. You have accumulated a huge box of Lego bricks, and want to try your hand at creating your own design, say, a house. You’re pretty proficient, so you’re able to quickly assemble a house. With so many bricks collected over the years, you’re not in short supply, and use all manners of bricks - anything that makes the house look good.
You have a few options for the product. You could:
If you’ve ever built Lego, you’d know that the first two options are out of the question. Option 1 sounds plain silly. We need to know where exactly each brick goes! Option 2 is a slight improvement, but it’s not guaranteed that we can reproduce the product. What if the customer builds a section of the house, only to find that there are insufficient bricks to create a ceiling?
Option 3 is what we expect from a Lego product. We have everything we need to re-assemble the house. It doesn’t matter who is assembling it or where it is shipped from - it works!
Now take your understanding of option 3 - the neat package of everything you need to assemble a Lego design - and apply it to software applications. Option 3 is exactly what container images do: They put all the application code, dependencies, and configurations for your app into a single package. The instructions for assembling container images is also available for you to re-build the container from scratch, and you would still end up with the exact image that was distributed. When distributed as a container image and instantiated/run as a container, the app works consistently, regardless of who runs it and regardless on which machine, because it contains everything that it needs to function.
This is what we plan to do with our data pipeline solution.
If we break our solution so far down into its components, we see only two:
The dynamic DAG script we developed does not count as a tool on its own. Although it is a key logical piece of our solution, it is just code that Airflow uses.
To containerize our solution, we will run Postgres and Airflow containers and link them up. Of course, there are intricacies that we’ll need to handle along the way. This post will provide a detailed walkthrough.
First, we set up the folder structure. See the diagram below for a reference.
root
├── dags # Contains DAGs to be made accessible to Airflow
│ └── open_options_chains.py # Our dynamic DAG script
├── db # Contains scripts for the Postgres instance
│ └── init-postgres.sh # Shell script to set up Postgres
├── logs # Contains logs from the Airflow instance
├── pgdata # Contains data from the Postgres instance
├── scripts # Contains scripts to initialise Airflow and run its services
│ ├── init-entrypoint.sh # Shell script to initialise Airflow
│ ├── scheduler-entrypoint.sh # Shell script to launch scheduler
│ └── webserver-postgres.sh # Shell script to launch webserver
├── .env # File containing environment variables
└── docker-compose.yml # File specifying what services to run, along with the configs
There are quite a few items here. We’ll run through the config files first. The folders and their contents will be addressed as we explore the services defined for our app (let’s call it an app for simplicity).
Docker Compose is a tool for defining and running multi-container Docker applications. In the file (below), we define several services (more on this later). All you need to know for now is that we need to explicitly state how each service should be configured, and which other services it must communicate with. I will reproduce the relevant sections of the configuration file as I describe the services.
version: '3.8'
services:
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
This file (.env
) defines variables that our containers can use. The contents should be as shown below. Do remember to input your TD Ameritrade (TDA) API key.
# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata
# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
AIRFLOW__CORE__EXECUTOR=LocalExecutor
# TDA API
API_KEY=<your TDA API key here>
Now, we move on to the key element: services. The services that will need to be run as part of the app are:
postgres
: An instance for the Postgres databaseairflow-init
: A short-lived instance to initialize Airflowscheduler
: An instance for the Airflow schedulerwebserver
: An instance for the Airflow webserverThe lines below define our Postgres instance:
postgres:
image: postgres
env_file:
- .env
volumes:
- ./db:/docker-entrypoint-initdb.d/
- ./pgdata:/var/lib/postgresql/data
ports:
- "8081:8080"
image
tag defines which pre-existing image on Docker Hub (see below) the container should be run on. In this case, we’re using the official Postgres Docker image.env_file
tag points to the environment file we created earlier. This allows us to define variables in one place. Our containers can then use them.volumes
tag indicates which local directories are mounted into the container. Any data that is generated in the specified folder inside the container is persisted in the local directory. For our app, we mount two volumes:
.db/
on our local machine, which contains the initialization scripts for the database, to /docker-entrypoint-initdb.d/
inside the container. The scripts in the latter are run when the container is launched./pgdata
, on our local machine to persist Postgres data, to /var/lib/postgresql/data
inside the container, where Postgres data is stored.ports
tag exposes a port to our local machine. The first port number is the port on our local machine, and the second one is the port inside the container for the Postgres service.
Note: Docker Hub is an open online repository for creating, managing, and sharing container images with others.
This is a service that initializes Airflow and shuts down. If Airflow was previously initialized, this service will not change the existing settings, and will shut down as well.
airflow-init:
image: apache/airflow
entrypoint: ./scripts/init-entrypoint.sh
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
restart:
on-failure:10
depends_on:
- postgres
image
used is the official Airflow Docker image.entrypoint
tag specifies commands/executables that will always run when the container is launched. We will examine the script below.restart
tag specifies when to restart the container. We request that this service restart on failure, up to 10 times.depends_on
tag expresses the dependency between services. This affects the order in which services are started. This airflow-init
service will start only after the Postgres instance has. Note that the Postgres service need not be ready for this service to start, and this causes problems that our entrypoint script will resolve.profiles
tag helps to group services together. It comes in handy when we want to choose specific related services to run.env_file
and volumes
tags do the same thing as they did for the Postgres service.
The entrypoint script (below) for this service does the following every time the airflow-init
container is run:
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
>&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
sleep 5
done
>&2 echo "PostgreSQL service started successfully. Initialising Airflow..."
# Initialise database
airflow db init
# Create account
airflow users create -u "$APP_AIRFLOW_USERNAME" -p "$APP_AIRFLOW_PASSWORD" -f Firstname -l Lastname -r Admin -e [email protected]
# Add connection
airflow connections add 'postgres_optionsdata' \
--conn-type 'postgres' \
--conn-login '$APP_DB_USER' \
--conn-password '$APP_DB_PASS' \
--conn-host 'postgres' \
--conn-port '5432' \
--conn-schema '$APP_DB_NAME' \
Initialising Airflow entails the following:
.env
file
If this service is run after Airflow has already been initialized, it will not affect the existing settings. After the script has run, there are no further terminal commands and this service will shut down on its own.
The lines below define our Airflow Scheduler service:
scheduler:
image: apache/airflow
entrypoint: ./scripts/scheduler-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
- ./logs:/opt/airflow/logs
image
used is the official Airflow Docker image - the same as the Airflow initialization service defined previously.entrypoint
script. We’ll discuss this along with the entrypoint script for the Webserver below.restart
, depends_on
, env_file
and volumes
tags do the same thing as they did for the other services.volumes
tag, we mount the DAGs, scripts, and logs folders from our local machine to the Docker container. This allows us to persist our DAGs, write scripts, and monitor logs on our local disk.The Airflow Webserver service has pretty much the same configuration as the Scheduler service:
webserver:
image: apache/airflow
entrypoint: ./scripts/webserver-entrypoint.sh
restart:
on-failure:10
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./scripts:/opt/airflow/scripts
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
ports:
- "8081:8080"
The ports
tag specifies the port on our local machine (8081) that we can use to access the webserver (8080 inside container).
The Scheduler and Webserver services have entrypoint scripts similar to the Airflow initialization service. The scripts do the following:
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
>&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
sleep 5
done
>&2 echo "PostgreSQL service started successfully. Launching Airflow Scheduler..."
# Launch scheduler
airflow scheduler
To enable data to be written to the mounted volumes, we need to enable read/write permissions on the DAGs and logs folders in our main directory:
chmod -R 777 dags
chmod -R 777 logs
Before we can launch the app proper, we need to initialize Airflow. To do so, run the following in a terminal:
cd <root folder for app>
docker-compose up airflow-init
There will be some activity in the terminal as Docker Compose launches Postgres, and then the Airflow initialization services. Once the initialization is complete, the Airflow initialization service will shut down. However, the Postgres service will continue running. You can stop it with docker-compose stop
if you want.
We can now launch the app for the first time with the following command:
docker-compose up -d
The -d
flag is to run the containers in detached mode, in the background. This allows you to close your terminal. In a browser, proceed to http://localhost:8081 to access the UI for your Airflow instance.
To shut down the app, ensure that the UI window is closed and use the following command:
docker-compose stop
This will stop the containers, and will not delete them. Airflow will remain initialized, and when you launch it again with docker-compose up
, all your data and settings will be intact.
Instead, if you happened to do a complete shutdown with docker-compose down
, don’t sweat it. This command will stop the running containers, remove the containers (not the images), and the networks defined for the app. Fortunately, because we persisted the data, scripts, logs, and DAGs on our local disk, re-launching the app with docker-compose up
(1) will not re-initialize the app, and (2) will keep all your data and settings intact.
To completely reset the app, do a complete shut down of the app, then delete everything inside the logs
and pgdata
folders on your local disk:
# Completely shut down app
docker-compose down
# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*
After the full reset, you’ll need to initialize Airflow again before running the app.
There are several ways to extract data from the app. First, we can export the tables to a CSV file. We would need to (1) extract the ID of the running Postgres container, (2) navigate to a folder for storing the CSV files, and (3) run a command in the container to save the data to a file. For (3), we run a command psql -U airflow -d optionsdata -c "..."
inside the Postgres container using docker exec -t
, and write the output to our target file ... > filename.csv
.
# Check Postgres container ID
docker ps
# Go to folder to store files
cd <directory-to-store-csv-files>
# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"
The second way to extract data is to connect directly to the Postgres instance. Notice that we used the ports
tag in the Docker Compose file to expose a port so we could access Postgres from our host machine. We use psycopg2
to establish the connection in Python.
import psycopg2 as pg2
# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')
You can then query the data using the connection object conn
as you would for any other database.
And there you have it - a data pipeline solution for collecting options data! In this post, we set up our app comprising a PostgreSQL instance and an Airflow instance using Docker Compose. If you’ve followed the series all the way, that’s awesome - you’d have replicated my solution on your own computer. And if you didn’t, you can always clone my Open Options Chains repository and follow the instructions in the Readme.
Credits for image: Kevin Ku on Unsplash
Also Published Here