What Is GitOps And Why Is It (Almost) Useless? Part 2

Hi Hackernoon!

Tech experts warmly welcomed my previous article on GitOps. Even a top-level expert from Weaveworks contacted me after reading it. And I was greatly surprised. After some warm words, we agreed that they would wait for the second part of my article. Well, the time has come.

Let’s dive in!

Content Overview

The Multiple Environments Problem
Values From
Secrets Issue
CI Ops vs GitOps
So, Who Really Needs GitOps?
Conclusion

The Multiple Environments Problem

We've already dealt with Flux workflow in the previous part. Now let's move on. Here is our Helm release.

Suppose we rolled out changes to the development environment, tested everything, and wanted to roll out to the stage. How can we do that? We have GitOps, so obviously, we want to maximize the Git approach (merge code from dev to stage). But in this Helm release configuration, we can't do that because (for example) we have a different base in the stage environment (lines 20...23).

What does GitOps say about managing environments?

GitOps says,

"Why don't you just have one environment, and you won't have to do anything stupid". I, for one, am not sure how many teams have only one environment. GitOps ideologues don't seem to be sure about that either. So they say, "Well, if you really need multistage, do it somewhere outside of the GitOps scope. Within your CI/CD system, for example."

Values From

So we think, review the options, and find a few.

The first one is valuesFiles (lines 15, 16, 17).

We can describe separate values files for individual environments and plug them into HelmRelease. But this scheme doesn't fit well to git merge either.

Next, we try valuesFrom.

Our environment namespace may contain one or more config maps and/or secrets (lines 22...25). From them, Helm-controller can take Values and customize our Helm chart. And I like that. I even make a separate Helm chart where I configure all microservices, link it to the application charts with the dependsOn key (line 16), and proudly call it a single point of platform configuration.

And then, the following happens: I have several Kubernetes clusters. My colleagues come and say: "Let's write another config map, and there will be cluster-specific Values (lines 26...28).

For example, cluster name, Ingress suffix, region, and access rights. And we will create this config map with Terraform, together with the cluster.

When this happens, I need to take a weightier argument and consistently explain the depth of their downfall to my colleagues. Because in this particular case, our application configuration will depend on 4 repositories (application chart repository, platform setup repository, flux infrastructure repository, terraform file repository) and two deployment systems (Flux and terraform).

This is where I can give you some advice. If you want to work with valuesFrom, you should have only one value in your variables and config map. It must be the name of the environment you are deploying to. Everything else should be done inside the Helm chart using templating mechanisms.

Secrets Issue

Next point. We have a Helm release with a line that should not be there - the database password (line 23).

What does GitOps tell us about managing secrets?

It says,

Never store your passwords in clear text in a Git repository.

What can we say?

Thanks, Cap!

The next thing GitOps offers us is:

Let's generate your passwords randomly when you create your environment. And you will make all objects that require password protection based on those passwords. That way, the password will never leave the environment, everything will be secured, and everyone will be happy.

Nice, sure. But how many people, for example, have their SQL database in Kubernetes? Or is a cross-plane installed there? As a rule, we don't create heavy, persistent services inside or from inside a Kubernetes cluster.

Read on for the following alternative approaches:

Let's use tools like Mozilla SOPS or Bitnami Sealed Secrets when we do public and private keys. The private key is written to the cluster. With the public key, we encrypt our passwords and commit to a Git repository. At the time of deployment, some tool inside Kubernetes decrypts this and gives it to applications. And a specially trained Ops team member will write this key to the cluster. Let him carry a printout of it in his jacket pocket and then carefully enter it by hand.

OK, that seems like the best option.

Now let's think about it. In The Limoncelli Test: 32 Questions for Your Sysadmin Team, old but still relevant, there is, for example, the 31st question: "Can a user's account be disabled on all systems in 1 hour?". Your Ops guy quit. And if your Kubernetes clusters are not under a single sign-on system, then you must run across all clusters and delete his account from everywhere.

The next question is, "Can you change all privileged (root) passwords in 1 hour?". That secret key that your Ops guy compromised by getting fired is that root password. And to change it, you have to do certain things:

Extract the private key from the cluster
Decrypt what we have in the Git repository with it
Make new keys
Write the private key to the cluster
Encrypt the passwords
Commit them to git
Run it through the release cycle and deprecate it

And you have to do it in a particular order because we have an asynchronous pool model. Any careless rollout can turn your environment into a pumpkin. And rollback won't help here.

A couple of words for fans of external secret stores like Hashicorp Vault. I haven't found a consensus on them, but from a common-sense point of view, using external secret stores makes Git an even more non-unique source of truth and thus goes against the GitOps approach.

CI Ops vs GitOps

As part of the GitOps concept's PR, WaveWorks has often referred to classic CI Ops as an outdated approach and anti-pattern. Why? Because with GitOps, we can separate the stage of the deployment, protect it, and manage it more clearly.

Within CI Ops, the deployment stage becomes a part of the monolithic pipeline. And because of this, several disadvantages arise.

We will now analyze this fact.

But! To get a fair comparison, we must choose a "fighter" from the CI-systems side. Let's take the popular and familiar to many people Gitlab, which implements the concept of pipelines as code.

Your CI code is in the same repository as your application and is actually part of it. This allows for greater CI/CD transparency, self-documenting pipelines, and knowledge transfer within the team...

CI Ops vs GitOps: Security

The first problem is poor security. It's like we have a CI system within CI Ops, and developers have full access to Kubernetes through it, and it's non-secure. And within GitOps, we have an asynchronous executor pool that we don't have direct access to, greatly increasing the system's security. Let's put Flux in and see what the actual security is.

By default, Flux gives cluster administrator rights to all of its controllers.

Obviously, it's not very secure.

Let's try to improve this situation. I have found an article about multi-tenancy for Flux multiuser use. We create a tenant and see that namespace and ServiceAccount are generated.

Flux controllers get administrator rights to this namespace, not all of them, just Kustomize controllers.

Earlier, using Kyverno to limit the Flux Helm controller was recommended. Now, they offer some strange mechanism with patching of controller's deployments so that it starts to pick up the account service from the namespace where it is deployed.

But let's just think about it, abstracting from the toolkit. We have two schemes: the synchronous push and asynchronous pull models. Which one is safer, assuming they both have admin rights? That's a rhetorical question because admin rights are admin rights.

But even if we restrict infrastructure rights, these schemes have roughly the same access to infrastructure, passwords, and persistent data. They can harm.

If you look at the problem a bit from the outside, you can see that in both cases, you have a Git repository. And it contains both application code and infrastructure code. And probably the only way to stop all the abuse here is to control what gets into your Git. Gitlab has a mechanism for Merge request approvals (though only in the paid version for now, but as we know, Gitlab's paid features are quickly moving to the community version).

This mechanism prevents you from merging your changes into the working branch of your repository until the right number of peers have reviewed your changes and confirmed that they are OK. This rather simple mechanism increases security many times over—much more than the difference between pull and push models.

CI Ops vs GitOps: Rollback Procedure

Another disadvantage that GitOps ideologues cite is the difficulty of rolling back CI. Their guides say that the CI system is not meant to be the only source of truth. This means that when working with the CI system, we cannot determine what we have clustered and what we will roll back to. But with GitOps, we have an infrastructure repository synchronizing with the infrastructure, so we do a git revert. And everything is fine.

Let's think. GitOps only covers the deployment stage. But since git cannot be the only source of truth, we should think about rollback at the stage of building our artifacts. Besides, Flux has everything in place to make rollbacks difficult or impossible. For example, we can awfully tag a docker image (line 19). Also, we can use the Helm chart version range (line 11).

And if we use Helm chart from the application repository, we may need to roll back the application repository to roll back our environment.

In addition, don't forget about possible valuesFrom dependencies in a separate repository, which we may also need to roll back.

Flux also has recommendations for organizing infrastructure repositories. For example, Monorepo. It is when we have the infrastructure for all our environments in one branch in different folders.

Or, for example, Repo per team. It is when each team has its own folder in a shared infrastructure repository.

This is where you must be very careful when rolling back so you don't delete extra.

As for CI Ops, Gitlab may be the only source of truth. Gitlab has a so-called environment mechanism, where we specify the name of an environment in a Pipeline Deploy step and see its state in the environment tracking.

It immediately becomes transparent which pipelines, from which branch, and from which commits are got.

We also have a visual display of the pipelines, where we can see what rolled in there and whether those stages passed. We can roll back to the right stage. It is decisively impossible when looking at the commit history in an infrastructure repository in the GitOps concept.

Alternatively, we can save the docker image as an artifact in the pipeline, thus making the pipeline atomic. And this would really be the only source of truth.

As a result, to rollback within the implementation described above, you would just need to restart the Rollout Job in the Pipeline History in the Gitlab visual interface.

CI Ops vs GitOps: The Problem of Multiple Clusters

According to the GitOps concept developers, it is difficult to propagate changes to multiple clusters in CI Ops. Allegedly, in a CI system, we need to script, script, and script. But in the beautiful GitOps, we just put Flux in a new cluster and give it an infrastructure repository. One moment and our application is already in the new cluster.

Let's start analyzing. Here is our custom resource Helm-release in the infrastructure repository.

I have 2 questions when looking at this Helm release:

Where did this application come from? Where is its repository? Where are the build logs and test run results? From which branch and from which commit is it deployed?
Where is the application deployed to? How many fluxes in which clusters look at this infrastructure repository and pull configuration from there?

"Out of GitOps scope" can be the answer to the first question. We just need to write a proper HelmRelease push script that will write all the information we need in the commit message.

For the second question, Flux offers a bootstrap mechanism with the creation of an infrastructure repository (flux-fleet) to manage Flux specifically. Before installing Flux in a cluster, we first commit all YAMLs describing the installation of flux controllers and related custom resources. Then we apply them to the target cluster. When we need to connect a git repository and Kustomize to a new infrastructure repository, we commit to that infrastructure repository again.

So we have the entire list of our clusters and all the accesses in one repository. And it works fine until some manual operations happen.

For example, your Ops expert quits. Let's say there's a Kubernetes cluster somewhere where he put Flux and gave him manual bypass flux-fleet access to your infrastructure repositories. You may never know if you didn't do the access revocation and key regeneration.

So I'd recommend writing down step zero on your checklist as creating a flux-fleet repository and bootstrap Flux. Otherwise, you will have secret knowledge that may one day be lost with the employee.

What do we have in CI-ops? In Gitlab, we can declare reusable pieces of code (lines 1...9). We can inherit from these pieces of code (lines 13, 19, 25) and declare some additional variables (lines 15, 21, 27) when we make a new environment. In this case, it is the name of the cluster relative to which we will change the functioning of our scripts. In this way, we will deploy to a new environment and get its automatic tracking.

So, Who Really Needs GitOps?

There are a few cases where I think a GitOps approach can be justified:

You don't have a CI/CD system, or it can't provide normal deployment for some reason.
Improved security. GitOps can really add to the mix of measures to improve the security of your infrastructure. The lion's share of security issues are "out of GitOps scope" and should be addressed before or in conjunction with GitOps implementation.
You have a large, multi-component platform that is tested and deployed. In this sense, a GitOps infrastructure repository is a good way to make a "stable" version of the platform and operate it as a single entity.
You have processes where the pull model has advantages over the push model (e.g., canary deployments).

So what is GitOps? For those who like precise formulas, GitOps = Continuous Delivery - Continuous Deployment. So it's not a CD concept. It's just (one more) way to organize automatic deployment.

By and large, this methodology brought almost nothing new. The methodology authors took the ideas of Infrastructure as code (IaC), added a PULL-based approach with a reconciliation loop from SCM of the last generation (puppet, chef), and made it mandatory to keep the configuration in GIT.

Indeed, a new and key concept of GitOps is the concept of Git as a single source of truth, but this concept is clearly mediocre and has some flaws. In fact, it's an attempt to take a deployment step from an application delivery pipeline and put it into Git.

One of the key principles of DevOps is that there are no bad actors. There are bad processes. Hence the blameless culture and the understanding that you need to organize processes in such a way that the problem becomes impossible.

But because of the limited concept of GitOps, it has too many gaps and potential problems, too many things to do "out of GitOps scope". All this leads to misunderstanding of how to organize processes properly and, consequently, to workarounds with all the consequences.

To build a good GitOps process, a company needs high IT maturity because the authors of the concept left out too many important things.

So, GitOps does not provide any significant advantage over a well-organized CI Ops process, although it significantly increases the final system complexity and labor costs. And in my opinion, many GitOps tools understand this by implementing GitOps in addition to image building, e.g., Gitkube and JenkinsX.

Conclusion

What else are we missing to make us happy? We obviously don't have enough notifications. We have an asynchronous pull model. In this model, we don't understand what exactly this pull executor is doing on its side if it doesn't explicitly notify us. Let's add the installation and configuration of the notification controller to our checklist and match it.

So, to work with GitOps, we need to do:

Flux-fleet repository
Bootstrap flux in a Kubernetes cluster using the flux-fleet repository
Script for building and pushing images in the Docker registry
Infrastructure Git repository
Account for CI system access to the infrastructure GIT repository
Script to generate and push the HelmRelease file
Helm Repository
Account for CI system access to the Helm repository
Script to build and publish the Helm chart
Flux account for the infrastructure repository
Flux account for Helm chartʼs repository
Flux account for the application repository
Flux account for cluster access for an engineer on the Ops team
GIT repository for "single point of platform configuration"
Flux account to access the GIT repository
Slack/Teams channel for Notification Controller

Most of these items are "out of GitOps scope". And it's not a methodology problem. But as an experienced engineer, I understand that I can't do GitOps without these items.

What about CI Ops? As part of CI Ops on Gitlab, we don't need to do many extra actions to make GitOps work. We need to:

Make the CI script to build and publish a docker image
Add Helm release deploy to the script via Helm apply
Create an account for the CI system to access Kubernetes

With that, deprived of the illusion that it's secure, we'll be more responsible in setting up access to the CI system and the cluster.

In addition, we get a built-in integrated docker registry, built-in environment handling, built-in secrets handling, built-in notification of the execution of the pipeline stages, and some integrations.