Tal Peretz

@talperetz24

Scaling Effectively: when Kubernetes met Celery

This is a story about software architecture, about a personal itch, and about scalability. And like any good tech story, it begins with a shaky architecture.

At Panorays, we help large enterprises to measure the security posture of their suppliers. But I’m not going to get into the whole 3rd party security management extravaganza with you. we came to talk about our architecture and process.

In the beginning, there was bash. 
and scripts to manage VMs. a lot of scripts.

  • There was a VM instance for each company we assessed.
  • Every VM executed sequential batch jobs that imitate the whole reconnaissance phase of the hacker’s lifecycle.
  • Company level parallelism is achieved by firing up more VMs.
  • We built an internal orchestration system via Cron & Bash (imagine how fun was that…).

Problems:

  • The parallelism was at the company level, and not at the job level.
  • The process wasn’t transparent.
  • Servers utilization was low.
  • Manually Triggered

The Rise of The Transporter

The Transporter
a Dynamic Workflow Engine, built to create workflows and execute them as Kubernetes Jobs.

A container based architecture makes The Transporter both flexible enough to configure jobs separately and efficient enough to scale.

It favors parallelism when possible, according to the workflow dependencies and provides a REST API for a fully automated pipeline.

Standing on the Shoulders of Giants

The Transporter’s API is automatically triggered whenever a new company is entered to the platform. The Transporter then deploys the jobs to kubernetes in parallel or sequentially, according to a predefined workflow.

Overview

As with the original transporter, The Transporter follows a few simple rules:

The Rules

The Deal is the Deal

The transporter deal is you define the workflow, The transporter will make it happen.
But in order to define a workflow we first need to define a job.

In our case a job is the equivalent of running a docker container.
A group of these jobs are a phase, a phase can be sequential or parallel.
A workflow is a sequence of phases.

Now, we can enjoy parallelism while still follow some rules.

Workflow Example

Never Make a Promise You Can’t Keep

The Transporter leverage a distributed task queue architecture.
In this architecture tasks get transported to queues, workers consume the tasks from the queues and perform these tasks.
This architecture makes it possible to retry a failed task,
set a timeout, set a priority, and schedule tasks for later.

We send notifications to alert on workflow start, success and failures.

Distributed Task Queue Architecture

Under The Hood

Now, we are ready to tie it all together — 
The transporter provides endpoints to manipulate a workflow
behind the scenes the workflow gets translated to Celery tasks. 
we use celery chains and celery groups to set dependencies.
these tasks get transported to queues based on the dependencies.
On the other side celery workers consume tasks from the queues
and deploy the corresponding KubernetesJob.
The result — a workflow getting accomplished according to job dependencies.

We also added endpoints to control workers for convenience.
The number of running workers sets the limit for how many jobs can run concurrently.

Never Open the (Package) Container

Our new deployment process includes security researchers building and pushing docker images to Google Cloud Registry.
The Transporter transports the corresponding jobs according to the workflow and a ConfigMap which defines the version of each job.
Kubernetes is the engine actually executing the underlying docker containers.

Updating Jobs

No Names

At first we set the KubernetesJob name the same as the original job name with a unique identifier appended at the end.

How Not to Name a Job

and this is how we discovered some Kubernetes naming limitations:

The first one is the regex for validating name a job name which is basically alphanumeric characters separated by dashes or underscores
We discovered the second limitation - for maximum characters
thanks to our security researchers who appreciate long overly-detailed names

#1 Tip - unique identifiers for names. 
but we still wanted to know the original job name and the company name which leads me to — 
#2 Tip - labels. labels everywhere.

Frameworks We Considered

Before implementing our own workflow engine we checked some existing solutions.

  • Airflow- Airflow is great. It works by rendering python files into DAGs which represents a workflow.
    If you have a static workflow which is determined pre-runtime you want to execute like an ETL flow I recommend to try working a solution with Airflow. Airflow’s problem lies in dynamic workflows — check out this proper way to create dynamic workflows in airflow on stack overflow.
    The reason we decided not to use it was our need for generating dynamic workflows, which changes based on our REST API requests.
  • Google Pub/Sub- Is google’s pub sub solution, we didn’t use it because it required a massive code change on all of the “jobs” side.
  • You can check out this task queue post for more alternatives.

What’s Next?

UI — We want to add a UI to make it easy to monitor and troubleshoot active and finished workflows.

Generify — If we make The Transporter a bit more generic maybe we could release it as an open source.

Call to Action

If you have a use case which involves running batch jobs according to certain dependencies (e.g Data Acquisition, Web Crawling System) and you are interested in scaling with The Transporter please comment or reach out to let me know.

If you enjoyed this post, feel free to hold down the clap button 👏🏽 and if you’re interested in posts to come, make sure to follow me on

Medium: https://medium.com/@talperetz24
Twitter:
https://twitter.com/talperetz24
LinkedIn:
https://www.linkedin.com/in/tal-per/

Topics of interest

More Related Stories