This is a story about software architecture, about a personal itch, and about scalability. And like any good tech story, it begins with a shaky architecture.
At Panorays, we help large enterprises to measure the security posture of their suppliers. But I’m not going to get into the whole 3rd party security management extravaganza with you. we came to talk about our architecture and process.
In the beginning, there was bash.
and scripts to manage VMs. a lot of scripts.
- There was a VM instance for each company we assessed.
- Every VM executed sequential batch jobs that imitate the whole reconnaissance phase of the hacker’s lifecycle.
- Company level parallelism is achieved by firing up more VMs.
- We built an internal orchestration system via Cron & Bash (imagine how fun was that…).
- The parallelism was at the company level, and not at the job level.
- The process wasn’t transparent.
- Servers utilization was low.
- Manually Triggered
The Rise of The Transporter
a Dynamic Workflow Engine, built to create workflows and execute them as Kubernetes Jobs.
A container based architecture makes The Transporter both flexible enough to configure jobs separately and efficient enough to scale.
It favors parallelism when possible, according to the workflow dependencies and provides a REST API for a fully automated pipeline.
The Transporter’s API is automatically triggered whenever a new company is entered to the platform. The Transporter then deploys the jobs to kubernetes in parallel or sequentially, according to a predefined workflow.
As with the original transporter, The Transporter follows a few simple rules:
The Deal is the Deal
The transporter deal is you define the workflow, The transporter will make it happen.
But in order to define a workflow we first need to define a job.
In our case a job is the equivalent of running a docker container.
A group of these jobs are a phase, a phase can be sequential or parallel.
A workflow is a sequence of phases.
Now, we can enjoy parallelism while still follow some rules.
Never Make a Promise You Can’t Keep
The Transporter leverage a distributed task queue architecture.
In this architecture tasks get transported to queues, workers consume the tasks from the queues and perform these tasks.
This architecture makes it possible to retry a failed task,
set a timeout, set a priority, and schedule tasks for later.
We send notifications to alert on workflow start, success and failures.
Under The Hood
Now, we are ready to tie it all together —
The transporter provides endpoints to manipulate a workflow
behind the scenes the workflow gets translated to Celery tasks.
we use celery chains and celery groups to set dependencies.
these tasks get transported to queues based on the dependencies.
On the other side celery workers consume tasks from the queues
and deploy the corresponding KubernetesJob.
The result — a workflow getting accomplished according to job dependencies.
We also added endpoints to control workers for convenience.
The number of running workers sets the limit for how many jobs can run concurrently.
Never Open the (Package) Container
Our new deployment process includes security researchers building and pushing docker images to Google Cloud Registry.
The Transporter transports the corresponding jobs according to the workflow and a ConfigMap which defines the version of each job.
Kubernetes is the engine actually executing the underlying docker containers.
At first we set the KubernetesJob name the same as the original job name with a unique identifier appended at the end.
and this is how we discovered some Kubernetes naming limitations:
The first one is the regex for validating name a job name which is basically alphanumeric characters separated by dashes or underscores
We discovered the second limitation - for maximum characters
thanks to our security researchers who appreciate long overly-detailed names
#1 Tip - unique identifiers for names.
but we still wanted to know the original job name and the company name which leads me to —
#2 Tip - labels. labels everywhere.
Frameworks We Considered
Before implementing our own workflow engine we checked some existing solutions.
- Airflow- Airflow is great. It works by rendering python files into DAGs which represents a workflow.
If you have a static workflow which is determined pre-runtime you want to execute like an ETL flow I recommend to try working a solution with Airflow. Airflow’s problem lies in dynamic workflows — check out this proper way to create dynamic workflows in airflow on stack overflow.
The reason we decided not to use it was our need for generating dynamic workflows, which changes based on our REST API requests.
- Google Pub/Sub- Is google’s pub sub solution, we didn’t use it because it required a massive code change on all of the “jobs” side.
- You can check out this task queue post for more alternatives.
UI — We want to add a UI to make it easy to monitor and troubleshoot active and finished workflows.
Generify — If we make The Transporter a bit more generic maybe we could release it as an open source.
Call to Action
If you have a use case which involves running batch jobs according to certain dependencies (e.g Data Acquisition, Web Crawling System) and you are interested in scaling with The Transporter please comment or reach out to let me know.
If you enjoyed this post, feel free to hold down the clap button 👏🏽 and if you’re interested in posts to come, make sure to follow me on