Moving from CircleCI to Buildkite: Everything You Need to Know

Written by komodor | Published 2021/07/28
Tech Story Tags: devops | cicd | circle-ci | buildkite | devops-tools | kubernetes | python | good-company

TLDR Rona Hirsh is a DevOps/Full Stack Developer at Komodor, revolutionizing Kubernetes troubleshooting. He recently started at an early-stage startup that moved from CircleCI to Buildkite. He explains the reasons behind the move, and the benefits we’ve seen from using the tool for our CI/CD operations. The move was based on a few core capabilities that were critical to our success: Python vs. YAML for Configuration Files, and Docker Layer Caching Performance.via the TL;DR App

As a DevOps engineer for quite a few years, I’ve had the opportunity to work with a variety of CI/CD tools, from Jenkins, to Gitlab, TFS, and CircleCI. But, as the saying goes, DevOps is about the culture and processes, as the tooling is constantly changing.

Enter Buildkite. I recently started at an early-stage startup building Kubernetes-native tooling, and when I joined the team I found that they had recently migrated from CircleCI to Buildkite. I found the transition to Buildkite from CircleCI to be an interesting new adventure, after years of working with the same tooling. I’m going to share some of the reasons behind the migration, and the benefits we’ve seen from using Buildkite for our CI/CD operations.

So we’ll start by saying that CircleCI and Buildkite really do serve the same purpose, and they are very easily interchangeable for similar tasks. I, personally, have worked with CircleCI for years, and have been extremely happy with its capabilities. Our choice to move to Buildkite was a result of a better match with the Komodor use case, and in no way has any bearing on the quality of CircleCI as a tool. It is equally great and has its own use cases that it excels at (we can dive into those in another post).

In this post, I’ll review our primary considerations for moving to Buildkite from CircleCI, and some of the stuff we’re doing with it today.

So just to quickly note how they differentiate themselves, Buildkite proclaims itself as a platform for running fast, secure, and scalable continuous integration and delivery pipelines on your own infrastructure whereas CircleCI positions itself as a continuous integration and delivery platform that automates the build, test, and deployment of software.

For us the security aspect plays a critical role, likewise running on our own infrastructure (I’ll get into how this applies in the post).

Why We Moved to Buildkite

For Komodor, the decision to move to Buildkite was based on a few core capabilities that were critical to our success:

  • Python vs. YAML for Configuration Files
  • Docker Layer Caching Performance
  • Buildkite Agent Security & Job Concurrency
  • Cost

Each of these individually may not have been enough of a factor to migrate away from our existing CI tooling, but together they provided us with enough of an improvement that we decided to take the plunge.

Python vs. YAML for Configuration Files

Anyone who has ever worked with YAML or configured software with YAML will say that, on the one hand, it is extremely powerful for declarative config management, however it comes with its own complexity. This would be a good talk to watch in that respect: YAML Considered Harmful - Philipp Krenn, Elastic.

As a Python shop, we wanted to explore other ways to configure our pipelines instead of just YAML, and naturally Python - historically a favorite DevOps choice for building automation - seemed like a more native approach for our entire engineering team (offloading the need to configure and optimize YAML files by operations alone). By configuring with Python, all of our engineers were able to create their own configuration files as part of the development process, all while significantly reducing the lines of code required for the configurations.

See below sample code snippets for configuration in Python vs. YAML:

This is what a typical YAML config file would look like. Yes, it’s really that long.

version: 2.1

executors:
  common:
    working_directory: "~/base/mono"
    docker:
      - image: 634375685434.dkr.ecr.us-east-1.amazonaws.com/build-tools:latest
        aws_auth:
          aws_access_key_id: $AWS_ACCESS_KEY_ID
          aws_secret_access_key: $AWS_SECRET_ACCESS_KEY

commands:
  build-service:
    parameters:
      service:
        type: string
    steps:
      - setup_remote_docker:
          docker_layer_caching: true
      - restore_cache: # **restores saved dependency cache if the Branch key template or requirements.txt files have not changed since the previous run**
          key: v1-repo-{{ .Environment.CIRCLE_SHA1 }}
      - run: "komo ci build-and-push << parameters.service >>"

jobs:
  init-workspace:
    executor: common
    steps:
      - checkout
      - save_cache: # ** special step to save dependency cache **
          key: v1-repo-{{ .Environment.CIRCLE_SHA1 }}
          paths:
            - "."

  shared:
    executor: common
    steps:
      - build-service:
          service: "shared"

  test-authz:
    executor: common
    steps:
      - checkout
      - setup_remote_docker:
          docker_layer_caching: true
      - run: "yarn workspace authz-tests run start:ci"
  web:
    executor: common
    steps:
      - build-service:
          service: "web"
  admin:
    executor: common
    steps:
      - build-service:
          service: "admin"
  brain:
    executor: common
    steps:
      - build-service:
          service: "brain"
  cloudtrail-collector:
    executor: common
    steps:
      - build-service:
          service: "cloudtrail-collector"
  es-proxy:
    executor: common
    steps:
      - build-service:
          service: "es-proxy"
  github-collector:
    executor: common
    steps:
      - build-service:
          service: "github-collector"
  github-installer:
    executor: common
    steps:
      - build-service:
          service: "github-installer"
  komodor-service-api:
    executor: common
    steps:
      - build-service:
          service: "komodor-service-api"
  pager-duty-collector:
    executor: common
    steps:
      - build-service:
          service: "pager-duty-collector"
  k8s-events-collector:
    executor: common
    steps:
      - build-service:
          service: "k8s-events-collector"
  sentry-collector:
    executor: common
    steps:
      - build-service:
          service: "sentry-collector"
  hasura-actions:
    executor: common
    steps:
      - build-service:
          service: "hasura-actions"
  slack-collector:
    executor: common
    steps:
      - build-service:
          service: "slack-collector"
  opsgenie-collector:
    executor: common
    steps:
      - build-service:
          service: "opsgenie-collector"
  slack-installer:
    executor: common
    steps:
      - build-service:
          service: "slack-installer"
  setup:
    executor: common
    steps:
      - build-service:
          service: "setup"
  deploy:
    executor: common
    steps:
      - add_ssh_keys:
          fingerprints:
            - "3b:25:2e:68:8c:e9:81:29:47:7f:79:5f:d4:43:bd:a3"
      - checkout
      - run: >
          komo ci deploy
          shared
          brain
          web
          admin
          cloudtrail-collector
          es-proxy
          github-collector
          github-installer
          komodor-service-api
          pager-duty-collector
          k8s-events-collector
          sentry-collector
          opsgenie-collector
          slack-collector
          slack-installer
          hasura-actions
          setup

workflows:
  version: 2
  build-and-test:
    jobs:
      - init-workspace:
          context: aws
      - shared:
          context: aws
          requires:
            - init-workspace
      - brain:
          context: aws
          requires:
            - init-workspace
      - web:
          context: aws
          requires:
            - shared
      - admin:
          context: aws
          requires:
            - shared
      - cloudtrail-collector:
          context: aws
          requires:
            - shared
      - es-proxy:
          context: aws
          requires:
            - shared
      - github-collector:
          context: aws
          requires:
            - shared
      - github-installer:
          context: aws
          requires:
            - shared
      - komodor-service-api:
          context: aws
          requires:
            - shared
      - pager-duty-collector:
          context: aws
          requires:
            - shared
      - opsgenie-collector:
          context: aws
          requires:
            - shared
      - k8s-events-collector:
          context: aws
          requires:
            - shared
      - sentry-collector:
          context: aws
          requires:
            - shared
      - hasura-actions:
          context: aws
          requires:
            - shared
      - slack-collector:
          context: aws
          requires:
            - shared
      - slack-installer:
          context: aws
          requires:
            - shared
      - setup:
          context: aws
          requires:
            - shared
      - test-authz:
          context: aws
          requires:
            - shared
      - deploy:
          context: aws
          requires:
            - brain
            - web
            - admin
            - k8s-events-collector
            - cloudtrail-collector
            - github-collector
            - github-installer
            - komodor-service-api
            - pager-duty-collector
            - slack-collector
            - slack-installer
            - sentry-collector
            - hasura-actions
            - setup
          filters:
            branches:
              only: master

Note how the Python config file is SIGNIFICANTLY shorter.

#!/usr/bin/env python3
from os import environ
from typing import Optional, List
​
import yaml
​
services = [
    "slack-installer",
    "opsgenie-collector",
    "hasura-actions",
    "web",
    "admin",
    "brain",
    "github-collector",
    "github-installer",
    "pager-duty-collector",
    "k8s-events-collector",
    "sentry-collector",
    "slack-collector",
    "slack-message-sender",
    "rest-api",
    "datadog-app",
    "webhook-listener",
]
​
with open("umbrella/Chart.yaml") as f:
    dep = yaml.load(f)["dependencies"]
    dep_names = [d["name"] for d in dep]
    for s in services:
        assert s in dep_names, f"Missing {s} chart dependency in umbrella/Chart.yaml"
​
​
def do_command(
    label: str,
    command: str,
    key: Optional[str] = None,
    concurrency_group: str = None,
    concurrency: int = 1,
    soft_fail: bool = False,
    depends_on: Optional[List[str]] = None,
):
    print(f"  - command: {command}")
    print(f"    label: '{label}'")
    if key:
        print(f"    key: '{key}'")
    if concurrency_group:
        print(f"    concurrency_group: {concurrency_group}")
        print(f"    concurrency: {concurrency}")
    if soft_fail:
        print("    soft_fail:")
        print("        - exit_status: 1")
        print("        - exit_status: 2")
    if depends_on:
        print("    depends_on:")
        for dependency in depends_on:
            print(f'        - "{dependency}"')
​
​
def wait():
    print("  - wait")
​
​
def require_approval(reason: str):
    print(f"  - block: '{reason}'")
    print("    branches: master")
​
​
def pipeline():
    # begin the pipeline.yml file
    print("steps:")
    do_command(
        ":docker: build and push shared :docker:", "komo ci build-and-push shared"
    )
    wait()
    for svc in services:
        do_command(f":docker: {svc}", f"komo ci build-and-push {svc}")
    do_command(
        ":docker: e2e (testcafe)",
        "komo ci build-and-push e2e --dockerfile tests/e2e/Dockerfile",
    )
    if environ.get("BUILDKITE_BRANCH") == "master":
        wait()
        services_and_shared = services[:] + ["shared"]
        services_and_shared_as_string = " ".join(services_and_shared)
        production_arguments = '"-f values-production.yaml --timeout 20m"'
        staging_arguments = '"-f values-staging.yaml"'
        do_command(
            label=":docker: upload images :ecr:",
            command=f"komo ci upload-images {services_and_shared_as_string}",
            concurrency_group="ecr/tagging",
        )
        wait()
        do_command(
            label=":kubernetes: deploy staging :rocket:",
            command=f"komo ci deploy --cluster-name komodor-staging-eks --extra-arg {staging_arguments}",
            concurrency_group="staging/deploy",
            key="staging-deploy",
        )
        do_command(
            label=":rooster: E2E Test - Staging",
            command="komo test trigger testcafe --env staging",
            concurrency_group="staging/deploy",
            depends_on=["staging-deploy"],
        )
        # require_approval("Approve deploy to production manually")
        wait()
        do_command(
            label=":kubernetes: deploy production :rocket:",
            command=f"komo ci deploy --extra-arg {production_arguments}",
            concurrency_group="production/deploy",
            key="production-deploy",
        )
        # do_command(
        #    label=":rooster: E2E Test - Production",
        #    command="komo test trigger testcafe",
        #    concurrency_group="production/deploy",
        #    depends_on=["production-deploy"],
        # )
​
​
if __name__ == "__main__":
    pipeline()

Docker Layer Caching Performance

For those who are not familiar (feel free to skip this part - for those of you who are), each Docker image consists of layers that need to be built as a part of every deployment. Docker Layer Caching essentially caches the layers of the Docker image, instead of having to rebuild these with each deployment or build. So basically just the delta between each deployment/build is rebuilt whereas the unchanged layers are cached. This method is used by basically all CIs to improve and optimize build performance significantly.

After benchmarking our own builds (this may of course vary between organizations and depending on Docker images being used), we discovered an advantage in performance with Buildkite, where our build time was reduced from ~1.5 minutes per build to around 20 seconds per build.

While both CircleCI and Buildkite have the Docker Layer Caching functionality, this performed better for us with our own Docker images (which brings me to the next point). As a small startup that needs to move very rapidly, we found that even an incremental improvement in build time would ultimately translate to big gains in terms of long-term performance.

Buildkite Agent Security & Job Concurrency

Buildkite agents are small, reliable and cross-platform build runners that run automated builds. This allowed us to run our Buildkite agents on a dedicated Kubernetes cluster hosted in our own AWS account. This was important for our use case as Buildkite’s self-hosted agent lets you keep sensitive data behind your firewall. When you host your runners, your code never leaves your servers, and this was an added layer of build security we wanted to ensure (initially as a stealth-mode startup), and now that we have exited from stealth as well.

In addition to the security this makes possible, BuildKite’s self-hosted agents give you full control to parallelize your CI tasks at any scale. Buildkite allows you to configure your build in order to run parallel jobs and obtain considerably faster results. For us this was an important consideration for future-proofing our CI operations - with a focus on being able to scale and grow with our tooling, and not have to consider migrating away when our operations grow.

Last but Not Least: Cost

As an early-stage startup, our runway and time to market will make or break us. So speed and cost must always be top of mind considerations when we are choosing our tooling and the technology partners that will help us achieve our next phases of growth. What we liked about Buildkite, also a fairly new player in the market that is also looking to grow and scale its business, is that there is one affordable plan that gives you everything at a reasonable price.

With Buildkite you get the option to decide where and when to run the workers, hence you have greater control over how much you ultimately pay. This transparency is also important for opex planning. While, we of course will always be willing to pay for excellent tooling, and this wasn’t a primary consideration - it certainly helped to make the final decision for moving to Buildkite.

So just to wrap it up, Buildkite was the right choice for us, as an early-stage startup looking to move fast without breaking too many things (and the bank), fit natively into our engineering process and operation, and seemed like the right fit for a Kubernetes-native product.

I hope this helped you think about some of the benefits and considerations you’d need to think about before migrating to Buildkite (or any other CI/CD tool), if you are at a similar crossroads.


Written by komodor | Rona Hirsh is a DevOps/Full Stack Developer at Komodor, revolutionizing Kubernetes troubleshooting .
Published by HackerNoon on 2021/07/28