Towards Open Options Chains Part V: Containerizing the Pipeline by@chrischow

Towards Open Options Chains Part V: Containerizing the Pipeline

image
Chris Chow HackerNoon profile picture

Chris Chow

Data Science enthusiast who loves to explore the world through data.

In this post, we move the entire data pipeline we’ve built so far onto Docker, a tool that enables us to package the solution and its dependencies into neat little things called container images. This will make the app portable, enable it to run consistently on different machines, and make it easier to operate.

The Series

This is the fifth post in my series, Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale:

  1. Database Design
  2. Foundational ETL Code
  3. Getting Started with Airflow
  4. Building the DAG
  5. Containerizing the Pipeline

Pre-requisites

In this post, we’ll be using Docker. I understand that this might be a big jump for some: from running services through the shell to deploying containers. It is a different paradigm, and requires you to learn a new toolkit. This was why I summarised the launch / teardown code in the penultimate section in Part IV: Building the DAG - it’s for readers who are happy with the pipeline we have built so far and don’t feel the need to venture further. That’s perfectly fine.

But, if you’re ready to dive into Docker, make sure you have the pre-requisites below and read on.

  • Operating system: Linux - I use Ubuntu 20.04 on Windows Subsystem for Linux
  • Software: Docker and Docker-compose

About Containers

Let’s start with an example. Suppose that you’ve been collecting Lego for years now. You have accumulated a huge box of Lego bricks, and want to try your hand at creating your own design, say, a house. You’re pretty proficient, so you’re able to quickly assemble a house. With so many bricks collected over the years, you’re not in short supply, and use all manners of bricks - anything that makes the house look good.

You have a few options for the product. You could:

  1. Provide a rough manual on what goes where, and throw in whatever bricks you thought you used
  2. Provide a detailed manual, and throw in whatever bricks you thought you used
  3. Provide a detailed manual and the exact quantities of the specific bricks you used

If you’ve ever built Lego, you’d know that the first two options are out of the question. Option 1 sounds plain silly. We need to know where exactly each brick goes! Option 2 is a slight improvement, but it’s not guaranteed that we can reproduce the product. What if the customer builds a section of the house, only to find that there are insufficient bricks to create a ceiling?

Option 3 is what we expect from a Lego product. We have everything we need to re-assemble the house. It doesn’t matter who is assembling it or where it is shipped from - it works!

Now take your understanding of option 3 - the neat package of everything you need to assemble a Lego design - and apply it to software applications. Option 3 is exactly what container images do: They put all the application code, dependencies, and configurations for your app into a single package. The instructions for assembling container images is also available for you to re-build the container from scratch, and you would still end up with the exact image that was distributed. When distributed as a container image and instantiated/run as a container, the app works consistently, regardless of who runs it and regardless on which machine, because it contains everything that it needs to function.

This is what we plan to do with our data pipeline solution.

Containerized Version of Our Solution

If we break our solution so far down into its components, we see only two:

  1. A PostgreSQL database
  2. Airflow

The dynamic DAG script we developed does not count as a tool on its own. Although it is a key logical piece of our solution, it is just code that Airflow uses.

To containerize our solution, we will run Postgres and Airflow containers and link them up. Of course, there are intricacies that we’ll need to handle along the way. This post will provide a detailed walkthrough.

Setting Up

Folder Structure

First, we set up the folder structure. See the diagram below for a reference.

root
├── dags                          # Contains DAGs to be made accessible to Airflow
│   └── open_options_chains.py    # Our dynamic DAG script
├── db                            # Contains scripts for the Postgres instance
│   └── init-postgres.sh          # Shell script to set up Postgres
├── logs                          # Contains logs from the Airflow instance
├── pgdata                        # Contains data from the Postgres instance
├── scripts                       # Contains scripts to initialise Airflow and run its services
│   ├── init-entrypoint.sh        # Shell script to initialise Airflow
│   ├── scheduler-entrypoint.sh   # Shell script to launch scheduler
│   └── webserver-postgres.sh     # Shell script to launch webserver
├── .env                          # File containing environment variables
└── docker-compose.yml            # File specifying what services to run, along with the configs

There are quite a few items here. We’ll run through the config files first. The folders and their contents will be addressed as we explore the services defined for our app (let’s call it an app for simplicity).

Config: Docker Compose File

Docker Compose is a tool for defining and running multi-container Docker applications. In the file (below), we define several services (more on this later). All you need to know for now is that we need to explicitly state how each service should be configured, and which other services it must communicate with. I will reproduce the relevant sections of the configuration file as I describe the services.

version: '3.8'
services:
    postgres:
        image: postgres
        env_file:
            - .env
        volumes:
            - ./db:/docker-entrypoint-initdb.d/
            - ./pgdata:/var/lib/postgresql/data
    airflow-init:
        image: apache/airflow
        entrypoint: ./scripts/init-entrypoint.sh
        env_file:
            - .env
        volumes:
            - ./scripts:/opt/airflow/scripts
        restart:
            on-failure:10
        depends_on:
            - postgres
    scheduler:
        image: apache/airflow
        entrypoint: ./scripts/scheduler-entrypoint.sh
        restart:
            on-failure:10
        depends_on:
            - postgres
        env_file:
            - .env
        volumes:
            - ./scripts:/opt/airflow/scripts
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
    webserver:
        image: apache/airflow
        entrypoint: ./scripts/webserver-entrypoint.sh
        restart:
            on-failure:10
        depends_on:
            - postgres
            - scheduler
        env_file:
            - .env
        volumes:
            - ./scripts:/opt/airflow/scripts
            - ./dags:/opt/airflow/dags
            - ./logs:/opt/airflow/logs
        ports:
            - "8081:8080"

Config: Environment File

This file (.env) defines variables that our containers can use. The contents should be as shown below. Do remember to input your TD Ameritrade (TDA) API key.

# Postgres
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflow
POSTGRES_DB=airflow
APP_DB_USER=openoptions
APP_DB_PASS=openoptions
APP_DB_NAME=optionsdata

# Airflow
APP_AIRFLOW_USERNAME=admin
APP_AIRFLOW_PASSWORD=password
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
AIRFLOW__CORE__EXECUTOR=LocalExecutor

# TDA API
API_KEY=<your TDA API key here>

Services

Now, we move on to the key element: services. The services that will need to be run as part of the app are:

  1. postgres: An instance for the Postgres database
  2. airflow-init: A short-lived instance to initialize Airflow
  3. scheduler: An instance for the Airflow scheduler
  4. webserver: An instance for the Airflow webserver

Postgres Database

The lines below define our Postgres instance:

postgres:
    image: postgres
    env_file:
        - .env
    volumes:
        - ./db:/docker-entrypoint-initdb.d/
        - ./pgdata:/var/lib/postgresql/data
    ports:
        - "8081:8080"
  • The image tag defines which pre-existing image on Docker Hub (see below) the container should be run on. In this case, we’re using the official Postgres Docker image.
  • The env_file tag points to the environment file we created earlier. This allows us to define variables in one place. Our containers can then use them.
  • The volumes tag indicates which local directories are mounted into the container. Any data that is generated in the specified folder inside the container is persisted in the local directory. For our app, we mount two volumes:
    • .db/ on our local machine, which contains the initialization scripts for the database, to /docker-entrypoint-initdb.d/ inside the container. The scripts in the latter are run when the container is launched
    • ./pgdata, on our local machine to persist Postgres data, to /var/lib/postgresql/data inside the container, where Postgres data is stored.
  • The ports tag exposes a port to our local machine. The first port number is the port on our local machine, and the second one is the port inside the container for the Postgres service.

Note: Docker Hub is an open online repository for creating, managing, and sharing container images with others.

Airflow Initialisation

This is a service that initializes Airflow and shuts down. If Airflow was previously initialized, this service will not change the existing settings, and will shut down as well.

airflow-init:
    image: apache/airflow
    entrypoint: ./scripts/init-entrypoint.sh
    env_file:
        - .env
    volumes:
        - ./scripts:/opt/airflow/scripts
    restart:
        on-failure:10
    depends_on:
        - postgres
  • The pre-existing image used is the official Airflow Docker image.
  • The entrypoint tag specifies commands/executables that will always run when the container is launched. We will examine the script below.
  • The restart tag specifies when to restart the container. We request that this service restart on failure, up to 10 times.
  • The depends_on tag expresses the dependency between services. This affects the order in which services are started. This airflow-init service will start only after the Postgres instance has. Note that the Postgres service need not be ready for this service to start, and this causes problems that our entrypoint script will resolve.
  • The profiles tag helps to group services together. It comes in handy when we want to choose specific related services to run.
  • The env_file and volumes tags do the same thing as they did for the Postgres service.

The entrypoint script (below) for this service does the following every time the airflow-init container is run:

  1. Run a while loop attempting to connect to Postgres
  2. Check if a connection can be established
  3. If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again
  4. If a connection is established, it means that Postgres is ready, and we initialize Airflow
 #!/usr/bin/env bash
 # Wait for Postgres service to be ready
 until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
 >&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
 sleep 5
 done
    
 >&2 echo "PostgreSQL service started successfully. Initialising Airflow..."

 # Initialise database
 airflow db init

 # Create account
 airflow users create -u "$APP_AIRFLOW_USERNAME" -p "$APP_AIRFLOW_PASSWORD" -f Firstname -l Lastname -r Admin -e [email protected]

 # Add connection
 airflow connections add 'postgres_optionsdata' \
     --conn-type 'postgres' \
     --conn-login '$APP_DB_USER' \
     --conn-password '$APP_DB_PASS' \
     --conn-host 'postgres' \
     --conn-port '5432' \
     --conn-schema '$APP_DB_NAME' \

Initialising Airflow entails the following:

  1. Initialise the Airflow database backend
  2. Create a new user based on the username and password combination specified in the .env file
  3. Add our Postgres connection - something that we performed via the Airflow UI previously

If this service is run after Airflow has already been initialized, it will not affect the existing settings. After the script has run, there are no further terminal commands and this service will shut down on its own.

Airflow Scheduler

The lines below define our Airflow Scheduler service:

scheduler:
    image: apache/airflow
    entrypoint: ./scripts/scheduler-entrypoint.sh
    restart:
        on-failure:10
    depends_on:
        - postgres
    env_file:
        - .env
    volumes:
        - ./dags:/opt/airflow/dags
        - ./scripts:/opt/airflow/scripts
        - ./logs:/opt/airflow/logs
  • The pre-existing image used is the official Airflow Docker image - the same as the Airflow initialization service defined previously.
  • This service has its own dedicated entrypoint script. We’ll discuss this along with the entrypoint script for the Webserver below.
  • The restart, depends_on, env_file and volumes tags do the same thing as they did for the other services.
  • For the volumes tag, we mount the DAGs, scripts, and logs folders from our local machine to the Docker container. This allows us to persist our DAGs, write scripts, and monitor logs on our local disk.

Airflow Webserver

The Airflow Webserver service has pretty much the same configuration as the Scheduler service:

webserver:
    image: apache/airflow
    entrypoint: ./scripts/webserver-entrypoint.sh
    restart:
        on-failure:10
    depends_on:
        - postgres
        - scheduler
    env_file:
        - .env
    volumes:
        - ./scripts:/opt/airflow/scripts
        - ./dags:/opt/airflow/dags
        - ./logs:/opt/airflow/logs
    ports:
        - "8081:8080"

The ports tag specifies the port on our local machine (8081) that we can use to access the webserver (8080 inside container).

Entrypoint Script for Scheduler and Webserver

The Scheduler and Webserver services have entrypoint scripts similar to the Airflow initialization service. The scripts do the following:

  1. Run a while loop attempting to connect to Postgres
  2. Check if a connection can be established
  3. If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again
  4. If a connection is established, it means that Postgres is ready, and we launch the Scheduler or Webserver respectively.
#!/usr/bin/env bash
# Wait for Postgres service to be ready
until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do
  >&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..."
  sleep 5
done
  
>&2 echo "PostgreSQL service started successfully. Launching Airflow Scheduler..."

# Launch scheduler
airflow scheduler

Providing Read and Write Access

To enable data to be written to the mounted volumes, we need to enable read/write permissions on the DAGs and logs folders in our main directory:

chmod -R 777 dags
chmod -R 777 logs

Operating the App

Setup

Before we can launch the app proper, we need to initialize Airflow. To do so, run the following in a terminal:

cd <root folder for app>
docker-compose up airflow-init

There will be some activity in the terminal as Docker Compose launches Postgres, and then the Airflow initialization services. Once the initialization is complete, the Airflow initialization service will shut down. However, the Postgres service will continue running. You can stop it with docker-compose stop if you want.

Launching the App

We can now launch the app for the first time with the following command:

docker-compose up -d

The -d flag is to run the containers in detached mode, in the background. This allows you to close your terminal. In a browser, proceed to http://localhost:8081 to access the UI for your Airflow instance.

Shutting Down the App

To shut down the app, ensure that the UI window is closed and use the following command:

docker-compose stop

This will stop the containers, and will not delete them. Airflow will remain initialized, and when you launch it again with docker-compose up, all your data and settings will be intact.

Instead, if you happened to do a complete shutdown with docker-compose down, don’t sweat it. This command will stop the running containers, remove the containers (not the images), and the networks defined for the app. Fortunately, because we persisted the data, scripts, logs, and DAGs on our local disk, re-launching the app with docker-compose up (1) will not re-initialize the app, and (2) will keep all your data and settings intact.

Full Reset - Take Caution!

To completely reset the app, do a complete shut down of the app, then delete everything inside the logs and pgdata folders on your local disk:

# Completely shut down app
docker-compose down

# Remove contents of mounted volumes
rm -rf logs/*
rm -rf pgdata/*

After the full reset, you’ll need to initialize Airflow again before running the app.

Extracting the Data

There are several ways to extract data from the app. First, we can export the tables to a CSV file. We would need to (1) extract the ID of the running Postgres container, (2) navigate to a folder for storing the CSV files, and (3) run a command in the container to save the data to a file. For (3), we run a command psql -U airflow -d optionsdata -c "..." inside the Postgres container using docker exec -t, and write the output to our target file ... > filename.csv.

# Check Postgres container ID
docker ps

# Go to folder to store files
cd <directory-to-store-csv-files>

# Extract CSV file
docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv"

The second way to extract data is to connect directly to the Postgres instance. Notice that we used the ports tag in the Docker Compose file to expose a port so we could access Postgres from our host machine. We use psycopg2 to establish the connection in Python.

import psycopg2 as pg2

# Connect to database
conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432')

You can then query the data using the connection object conn as you would for any other database.

Summary

And there you have it - a data pipeline solution for collecting options data! In this post, we set up our app comprising a PostgreSQL instance and an Airflow instance using Docker Compose. If you’ve followed the series all the way, that’s awesome - you’d have replicated my solution on your own computer. And if you didn’t, you can always clone my Open Options Chains repository and follow the instructions in the Readme.


Credits for image: Kevin Ku on Unsplash


Also Published Here

Comments

Signup or Login to Join the Discussion

Tags

Related Stories