In this post, we move the entire data pipeline we’ve built so far onto Docker, a tool that enables us to package the solution and its dependencies into neat little things called container images. This will make the app portable, enable it to run consistently on different machines, and make it easier to operate. The Series This is the fifth post in my series, : Towards Open Options Chains: A Data Pipeline for Collecting Options Data at Scale Database Design Foundational ETL Code Getting Started with Airflow Building the DAG Containerizing the Pipeline Pre-requisites In this post, we’ll be using Docker. I understand that this might be a big jump for some: from running services through the shell to deploying . It is a different paradigm, and requires you to learn a new toolkit. This was why I summarised the launch / teardown code in the penultimate section in - it’s for readers who are happy with the pipeline we have built so far and don’t feel the need to venture further. That’s perfectly fine. containers Part IV: Building the DAG But, if you’re ready to dive into Docker, make sure you have the pre-requisites below and read on. Operating system: Linux - I use Ubuntu 20.04 on Windows Subsystem for Linux Software: Docker and Docker-compose About Containers Let’s start with an example. Suppose that you’ve been collecting Lego for years now. You have accumulated a huge box of Lego bricks, and want to try your hand at creating your own design, say, a house. You’re pretty proficient, so you’re able to quickly assemble a house. With so many bricks collected over the years, you’re not in short supply, and use all manners of bricks - anything that makes the house look good. You have a few options for the product. You could: Provide a rough manual on what goes where, and throw in whatever bricks you you used thought Provide a detailed manual, and throw in whatever bricks you you used thought Provide a detailed manual and the exact quantities of the specific bricks you used If you’ve ever built Lego, you’d know that the first two options are out of the question. sounds plain silly. We need to know where exactly each brick goes! is a slight improvement, but it’s not guaranteed that we can reproduce the product. What if the customer builds a section of the house, only to find that there are insufficient bricks to create a ceiling? Option 1 Option 2 is what we expect from a Lego product. We have everything we need to re-assemble the house. It doesn’t matter who is assembling it or where it is shipped from - it works! Option 3 Now take your understanding of option 3 - the neat package of everything you need to assemble a Lego design - and apply it to software applications. Option 3 is exactly what container images do: They put all the application code, dependencies, and configurations for your app into a single package. The instructions for assembling container images is also available for you to re-build the container from scratch, and you would still end up with the exact image that was distributed. When distributed as a container image and instantiated/run as a container, the app works consistently, regardless of who runs it and regardless on which machine, because it contains everything that it needs to function. This is what we plan to do with our data pipeline solution. Containerized Version of Our Solution If we break our solution so far down into its components, we see only two: A PostgreSQL database Airflow The dynamic DAG script we developed does not count as a tool on its own. Although it is a key logical piece of our solution, it is just code that Airflow uses. To containerize our solution, we will run Postgres and Airflow containers and link them up. Of course, there are intricacies that we’ll need to handle along the way. This post will provide a detailed walkthrough. Setting Up Folder Structure First, we set up the folder structure. See the diagram below for a reference. root ├── dags # Contains DAGs to be made accessible to Airflow │ └── open_options_chains.py # Our dynamic DAG script ├── db # Contains scripts for the Postgres instance │ └── init-postgres.sh # Shell script to set up Postgres ├── logs # Contains logs from the Airflow instance ├── pgdata # Contains data from the Postgres instance ├── scripts # Contains scripts to initialise Airflow and run its services │ ├── init-entrypoint.sh # Shell script to initialise Airflow │ ├── scheduler-entrypoint.sh # Shell script to launch scheduler │ └── webserver-postgres.sh # Shell script to launch webserver ├── .env # File containing environment variables └── docker-compose.yml # File specifying what services to run, along with the configs There are quite a few items here. We’ll run through the config files first. The folders and their contents will be addressed as we explore the services defined for our app (let’s call it an app for simplicity). Config: Docker Compose File Docker Compose is a tool for defining and running multi-container Docker applications. In the file (below), we define several services (more on this later). All you need to know for now is that we need to explicitly state how each service should be configured, and which other services it must communicate with. I will reproduce the relevant sections of the configuration file as I describe the services. version: '3.8' services: postgres: image: postgres env_file: - .env volumes: - ./db:/docker-entrypoint-initdb.d/ - ./pgdata:/var/lib/postgresql/data airflow-init: image: apache/airflow entrypoint: ./scripts/init-entrypoint.sh env_file: - .env volumes: - ./scripts:/opt/airflow/scripts restart: on-failure:10 depends_on: - postgres scheduler: image: apache/airflow entrypoint: ./scripts/scheduler-entrypoint.sh restart: on-failure:10 depends_on: - postgres env_file: - .env volumes: - ./scripts:/opt/airflow/scripts - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs webserver: image: apache/airflow entrypoint: ./scripts/webserver-entrypoint.sh restart: on-failure:10 depends_on: - postgres - scheduler env_file: - .env volumes: - ./scripts:/opt/airflow/scripts - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs ports: - "8081:8080" Config: Environment File This file ( ) defines variables that our containers can use. The contents should be as shown below. Do remember to input your TD Ameritrade (TDA) API key. .env # Postgres POSTGRES_USER=airflow POSTGRES_PASSWORD=airflow POSTGRES_DB=airflow APP_DB_USER=openoptions APP_DB_PASS=openoptions APP_DB_NAME=optionsdata # Airflow APP_AIRFLOW_USERNAME=admin APP_AIRFLOW_PASSWORD=password AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB} AIRFLOW__CORE__EXECUTOR=LocalExecutor # TDA API API_KEY=<your TDA API key here> Services Now, we move on to the key element: services. The services that will need to be run as part of the app are: : An instance for the Postgres database postgres : A short-lived instance to initialize Airflow airflow-init : An instance for the Airflow scheduler scheduler : An instance for the Airflow webserver webserver Postgres Database The lines below define our Postgres instance: postgres: image: postgres env_file: - .env volumes: - ./db:/docker-entrypoint-initdb.d/ - ./pgdata:/var/lib/postgresql/data ports: - "8081:8080" The tag defines which pre-existing image on Docker Hub (see below) the container should be run on. In this case, we’re using the . image official Postgres Docker image The tag points to the environment file we created earlier. This allows us to define variables in one place. Our containers can then use them. env_file The tag indicates which local directories are mounted into the container. Any data that is generated in the specified folder inside the container is persisted in the local directory. For our app, we mount two volumes: volumes on our local machine, which contains the initialization scripts for the database, to inside the container. The scripts in the latter are run when the container is launched .db/ /docker-entrypoint-initdb.d/ , on our local machine to persist Postgres data, to inside the container, where Postgres data is stored. ./pgdata /var/lib/postgresql/data The tag exposes a port to our local machine. The first port number is the port on our local machine, and the second one is the port inside the container for the Postgres service. ports Docker Hub is an open online repository for creating, managing, and sharing container images with others. Note: Airflow Initialisation This is a service that initializes Airflow and shuts down. If Airflow was previously initialized, this service will not change the existing settings, and will shut down as well. airflow-init: image: apache/airflow entrypoint: ./scripts/init-entrypoint.sh env_file: - .env volumes: - ./scripts:/opt/airflow/scripts restart: on-failure:10 depends_on: - postgres The pre-existing used is the . image official Airflow Docker image The tag specifies commands/executables that will always run when the container is launched. We will examine the script below. entrypoint The tag specifies when to restart the container. We request that this service restart on failure, up to 10 times. restart The tag expresses the dependency between services. This affects the order in which services are . This service will start only after the Postgres instance has. Note that the Postgres service need not be for this service to start, and this causes problems that our entrypoint script will resolve. depends_on started airflow-init ready The tag helps to group services together. It comes in handy when we want to choose specific related services to run. profiles The and tags do the same thing as they did for the Postgres service. env_file volumes The entrypoint script (below) for this service does the following every time the container is run: airflow-init Run a while loop attempting to connect to Postgres Check if a connection can be established If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again If a connection is established, it means that Postgres is ready, and we initialize Airflow #!/usr/bin/env bash # Wait for Postgres service to be ready until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do >&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..." sleep 5 done >&2 echo "PostgreSQL service started successfully. Initialising Airflow..." # Initialise database airflow db init # Create account airflow users create -u "$APP_AIRFLOW_USERNAME" -p "$APP_AIRFLOW_PASSWORD" -f Firstname -l Lastname -r Admin -e admin@airflow.com # Add connection airflow connections add 'postgres_optionsdata' \ --conn-type 'postgres' \ --conn-login '$APP_DB_USER' \ --conn-password '$APP_DB_PASS' \ --conn-host 'postgres' \ --conn-port '5432' \ --conn-schema '$APP_DB_NAME' \ Initialising Airflow entails the following: Initialise the Airflow database backend Create a new user based on the username and password combination specified in the file .env Add our Postgres connection - something that we performed via the Airflow UI previously If this service is run after Airflow has already been initialized, it will not affect the existing settings. After the script has run, there are no further terminal commands and this service will shut down on its own. Airflow Scheduler The lines below define our Airflow Scheduler service: scheduler: image: apache/airflow entrypoint: ./scripts/scheduler-entrypoint.sh restart: on-failure:10 depends_on: - postgres env_file: - .env volumes: - ./dags:/opt/airflow/dags - ./scripts:/opt/airflow/scripts - ./logs:/opt/airflow/logs The pre-existing used is the - the same as the Airflow initialization service defined previously. image official Airflow Docker image This service has its own dedicated script. We’ll discuss this along with the entrypoint script for the Webserver below. entrypoint The , , and tags do the same thing as they did for the other services. restart depends_on env_file volumes For the tag, we mount the DAGs, scripts, and logs folders from our local machine to the Docker container. This allows us to persist our DAGs, write scripts, and monitor logs on our local disk. volumes Airflow Webserver The Airflow Webserver service has pretty much the same configuration as the Scheduler service: webserver: image: apache/airflow entrypoint: ./scripts/webserver-entrypoint.sh restart: on-failure:10 depends_on: - postgres - scheduler env_file: - .env volumes: - ./scripts:/opt/airflow/scripts - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs ports: - "8081:8080" The tag specifies the port on our local machine (8081) that we can use to access the webserver (8080 inside container). ports Entrypoint Script for Scheduler and Webserver The Scheduler and Webserver services have entrypoint scripts similar to the Airflow initialization service. The scripts do the following: Run a while loop attempting to connect to Postgres Check if a connection can be established If a connection can’t be established, return a message saying we’ll wait, and wait 5 seconds before trying again If a connection is established, it means that Postgres is ready, and we launch the Scheduler or Webserver respectively. #!/usr/bin/env bash # Wait for Postgres service to be ready until PGPASSWORD=$POSTGRES_PASSWORD psql -h "postgres" -U "$POSTGRES_USER" -c '\q'; do >&2 echo "PostgreSQL service unavailable. Retrying in 5 seconds..." sleep 5 done >&2 echo "PostgreSQL service started successfully. Launching Airflow Scheduler..." # Launch scheduler airflow scheduler Providing Read and Write Access To enable data to be written to the mounted volumes, we need to enable read/write permissions on the DAGs and logs folders in our main directory: chmod -R 777 dags chmod -R 777 logs Operating the App Setup Before we can launch the app proper, we need to initialize Airflow. To do so, run the following in a terminal: cd <root folder for app> docker-compose up airflow-init There will be some activity in the terminal as Docker Compose launches Postgres, and then the Airflow initialization services. Once the initialization is complete, the Airflow initialization service will shut down. However, the Postgres service will continue running. You can stop it with if you want. docker-compose stop Launching the App We can now launch the app for the first time with the following command: docker-compose up -d The flag is to run the containers in detached mode, in the background. This allows you to close your terminal. In a browser, proceed to http://localhost:8081 to access the UI for your Airflow instance. -d Shutting Down the App To shut down the app, ensure that the UI window is closed and use the following command: docker-compose stop This will stop the containers, and . Airflow will remain initialized, and when you launch it again with , all your data and settings will be intact. will not delete them docker-compose up Instead, if you happened to do a complete shutdown with , don’t sweat it. This command will stop the running containers, (not the images), and the networks defined for the app. Fortunately, because we persisted the data, scripts, logs, and DAGs on our local disk, re-launching the app with (1) will not re-initialize the app, and (2) will keep all your data and settings intact. docker-compose down remove the containers docker-compose up Full Reset - Take Caution! To completely reset the app, do a complete shut down of the app, then delete everything inside the and folders on your local disk: logs pgdata # Completely shut down app docker-compose down # Remove contents of mounted volumes rm -rf logs/* rm -rf pgdata/* After the full reset, you’ll need to initialize Airflow again before running the app. Extracting the Data There are several ways to extract data from the app. First, we can export the tables to a CSV file. We would need to (1) extract the ID of the running Postgres container, (2) navigate to a folder for storing the CSV files, and (3) run a command in the container to save the data to a file. For (3), we run a command inside the Postgres container using , and write the output to our target file . psql -U airflow -d optionsdata -c "..." docker exec -t ... > filename.csv # Check Postgres container ID docker ps # Go to folder to store files cd <directory-to-store-csv-files> # Extract CSV file docker exec -t <first-few-characters-of-docker-container> psql -U airflow -d optionsdata -c "COPY table_name to STDOUT WITH CSV HEADER" > "filename.csv" The second way to extract data is to connect directly to the Postgres instance. Notice that we used the tag in the Docker Compose file to expose a port so we could access Postgres from our host machine. We use to establish the connection in Python. ports psycopg2 import psycopg2 as pg2 # Connect to database conn = pg2.connect(host='localhost', database='optionsdata', user='airflow', password='airflow', port='5432') You can then query the data using the connection object as you would for any other database. conn Summary And there you have it - a data pipeline solution for collecting options data! In this post, we set up our app comprising a PostgreSQL instance and an Airflow instance using Docker Compose. If you’ve followed the series all the way, that’s awesome - you’d have replicated my solution on your own computer. And if you didn’t, you can always clone my and follow the instructions in the Readme. Open Options Chains repository Credits for image: Kevin Ku on Unsplash Also Published Here