The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere - locally, in dev / testing / prod environments, across all cloud providers, and on-premise - in a repeatable way.
The software engineering world has fully adopted Docker, and a lot of tools around Docker have changed the way we build and deploy software - testing, CI/CD, dependency management, versioning, monitoring, security. The popularity of Kubernetes as the new standard for container orchestration and infrastructure management follows from the popularity of Docker.
In the big data and Apache Spark scene, most applications still run directly on virtual machines without benefiting from containerization. The dominant cluster manager for Spark, Hadoop YARN, did not support Docker containers until recently (Hadoop 3.1 release), and even today the support for Docker remains “experimental and non-complete”.
Native support for Docker is in fact one of the main reasons companies choose to deploy Spark on top of Kubernetes instead of YARN.
In this article, we will illustrate the benefits of Docker for Apache Spark by going through the end-to-end development cycle used by many of our users at Data Mechanics.
1. Build your dependencies once, run everywhere
Your application will run the same way wherever you run it: on your laptop for dev / testing, or in any of your production environments.
In your Docker image, you can:
2. Make Spark more reliable and cost-effective in production.
Since you can control your entire environment when you build your Docker image, this means you don’t need any init scripts / bootstrap actions to execute at runtime.
These scripts have several issues:
3. Speed up your iteration cycle.
In Data Mechanics (a managed Spark-on-Kubernetes platform) our users end-to-end iteration cycle with Spark takes less than 30 seconds, from the time they make a code change in their IDE to the time this change runs as a Spark app on our platform.
This is mostly thanks to the fact that Docker caches previously built layers and that Kubernetes is really fast at starting / restarting Docker containers.
In this section, we’ll show you how to work with Spark and Docker, step-by-step. Example screenshots and code samples are taken from running a PySpark application on the Data Mechanics platform, but this example can be simply adapted to work on other environments.
1. Building the Docker Image
We’ll start from a local PySpark project with some dependencies, and a Dockerfile that will explain how to build a Docker image for this project. Once the Docker image is built, we can directly run it locally, or push to a Docker registry first then run it on the Data Mechanics cluster.
Thanks to Docker, there's no need to package third-party dependencies and custom modules in brittle zip files. You can just co-locate your main script and its dependencies.
In this example, we adopt a layout for our project that should feel familiar to most Python developers: a main script, a requirements file, and custom modules.
Let’s look at the Dockerfile. In this example:
2. Local Run
You can then build this image and run it locally. This means Spark will run in local mode; as a single container on your laptop. You will not be able to process large amounts of data, but this is useful if you just want to test your code correctness (maybe using a small subset of the real data), or run unit tests.
On an average laptop, this took 15 seconds. If you’re going to iterate a lot at this stage and want to optimize the speed further; then instead of re-building the image, you can simply mount your local files (from your working directory on your laptop) to your working directory in the image.
3. Run at Scale
The code looks correct, and we now want to run this image at scale (in a distributed way) on the full dataset. We’re going to push the image to our Docker registry and then provide the image name as a parameter in the REST API call to submit this Spark application to Data Mechanics.
It’s important to have a fast iteration cycle at this stage too - even though you ran the image locally - as you’ll need to verify the correctness and stability of your pipeline on the full dataset. It’s likely you will need to keep iterating on your code and on your infrastructure configurations.
On an average laptop, this step took 20 seconds to complete: 10 seconds to build and push the image, 10 seconds for the app to start on Data Mechanics - meaning the Spark driver starts executing our main.py code.
In less than 30 seconds, you’re able to iterate on your Spark code and run it on production data! And you can do this without leaving your IDE (you don’t need to paste your code to a notebook or an interactive shell). This is a game-changer for Spark developers, a 10x speed-up compared to industry average.
"I really love that we can use Docker to package our code on Data Mechanics. It’s a huge win for simplifying PySpark deployment. Based on the iterations I have completed it takes less than 30 seconds to deploy and run the app" - Max, Data Engineer at Weather2020.
This fast iteration cycle is thanks to Docker caching the previous layers of the image, and the Data Mechanics platform (deployed on Kubernetes) optimizing fast container launches and application restarts.
To run a Spark application on Data Mechanics, you should use one of our published Spark images as a base image (in the first line of your Dockerfile). We publish new images regularly, including whenever a new Spark version is released or when a security fix is available.
These Spark images contain:
They are freely available to everyone, and you can also use them to run Spark locally or on other platforms.
We’ve demonstrated how Docker can help you:
Running Spark on Kubernetes in a Docker native and cloud-agnostic way has many other benefits. For example, you can use the steps above to build a CI/CD pipeline that automates the building and deployment of Spark test applications whenever you submit a pull request to modify one of your production pipelines. We will cover this in a future blog post - so stay tuned!
Previously published at https://www.datamechanics.co/blog-post/spark-and-docker-your-spark-development-cycle-just-got-ten-times-faster