Version control allows developers to keep the working pieces of the system available. Specifically, version control allows critical capabilities such as reproducibility and traceability.
Reproducibility is like a magic wand that helps to create identical systems no matter how complex the system itself is.
Traceability helps developers to pick up any environment and track its dependencies. Traceability also helps to pick up any two versions of the environment and find out differences, if any.
Advantages offered by reproducibility and traceability are already well understood by the developer community:
Until now, not much thought has been given in this direction, but as the cloud becomes more complex (spanning across multiple accounts, multiple regions, and multiple teams), and as the collaboration needed between different teams increases, we will have to turn to version control.
The cloud is not a homogeneous entity like source code. Cloud instead can be seen as a layered onion, where each layer may have its own versioning scheme, and the vertical slice needs to have its own.
The version control needs to be applied to the source code, which provisions the cloud. The provisioning is typically done through IaC such as CloudFormation, Terraform, or Pulumi.
But only applying version control to IaC source code is not sufficient. IaC is responsible for provisioning the cloud resources, and these cloud resources have their own lifetime. For example, the EBS volume can overrun the lifetime of EC2 to which it attaches. This attachment is reflected in resource properties, and these properties keep changing based on what kind of cloud resource it is.
Thus cloud version control onset requires us to protect at least three major entities — the provisioning code, the cloud resources it provisioned in response, and the resource properties which may change over the period.
Version controlling the provisioning code is the easiest of all. After all, it is a mature art. Keeping these files under git or similar tools is what is needed. Since IaC is source code, it also enjoys other benefits such as IDEs, code reviews, continuous integration, and continuous deployment. IaC allows the developers to express their intended state of the cloud environment. However, there is a constant flux between what is desired and what is deployed or running in the cloud.
To version control the cloud resources, we need to work with cloud control plane APIs. These APIs need to be used to fetch all the cloud resources available and their resource properties at that point in time. The version control would also need to be intelligent to mark the resource lifetime as created, available, and deleted. We may call this a snapshotting operation of the cloud. The snapshot marks all the cloud resources and their properties available at that time.
While theoretically, it looks plausible, we need to understand the ground reality. Cloud resources seldom exist in isolation. A complex relationship graph binds these resources together. Thus any foolproof solution needs to have comprehensive coverage of cloud resources. Any missing resource just breaks the graph. However, this is a huge ask, given the speed at which the cloud vendors are introducing new services and expanding existing services. At the time of writing, there are approximately 250 (+) cloud services available for AWS cloud. In fact, from IaC standpoint, Terraform has much better coverage than AWS’s own CloudFormation, and CloudFormation is perceived as lagging always.
So, are there any efforts by the cloud vendors for cloud versioning?
Mainly it is at two levels - individual cloud resource levels and by introducing services for cloud resource configuration management.
In AWS Cloud, we now see resource versioning introduced at the resource level. For example, ECS task definitions are explicitly versioned. Similarly, launch templates have versions. While these versioning schemes are useful for that individual cloud resources, from an overall cloud perspective having these are not enough.
You still need snapshotted information about all cloud resources. We need a holistic approach covering all of the cloud across multiple accounts, multiple regions, and almost all cloud resources.
If you think this looks complex, yes, it is. And we still haven’t discussed drift detection (desired vs. actual) and reverting to a previous state. It deserves its own separate post.
Naturally, when such versioned cloud information is available, would there be additional benefits other than we discussed earlier? There could be multiple such as cloud linting, automatic suggestions for cloud resource drifts (desired state vs. actual state), security vulnerability identification (via rules), notifications in case of accidental resource spawn (most common error for early startups), and obviously auto rollback in case of errors. One further interesting use case that could be served is cloud visualization, which I believe would become essential in the coming time.
Another interesting use case that can be served is, allowing to query the cloud resources and their relationships. Again this is a topic that deserves its own dedicated post.
Fortunately, few services have emerged in this space that can be helpful.
AWS Config, is a cloud resource configuration management service by AWS itself. The AWS service works on the notion of the recorder, which needs to be configured to record (snapshot) the cloud resources. The recorder can be configured to include all cloud resources or a selective few. The resource properties are also recorded. These recorded configuration snapshots then can be stored as long as 7 years (which is the default value) in S3. By default, the recorder works only for a single account and region. If you need to make it work for multiple accounts and regions, you need to create aggregators especially. Sometimes, it feels like an unnecessary complex step.
In a true AWS fashion, the cloud resources this service covers are limited, approx 100 resource types supported at the time of writing. Another issue with AWS config is pricing. Each resource configuration recording takes ~$0.003/region. For medium to large clouds, this could become very expensive very easily. Especially when something breaks and keeps changing the resource property quickly. Due to its pay-per-use model, the service pricing is complex.
Fugue.co, is already a known service with its regula open-source tool, and many of you might already be familiar with it. You may add cloud accounts to the fugue, which would then be scanned (snapshotted). You may then establish a baseline, which would serve as a golden snapshot against which drift would be determined. The Fugue has a working visualizer as well, which can sometimes make cloud management easy. It also supports ~190 cloud resource types, which is way better than AWS Config.
However, fugue does not maintain cloud resource history, and one cannot see resource (properties) history, except the drift. It also does not clearly identify the resource lifetime. Fugue seems to be more focused on compliance, and perhaps that explains the pricing tag it comes with ($1250/month).
CloudYali.io is in an exclusive free preview release, and you need to sign up for an invite. It already supports ~ 250 cloud resource types. It is possible to add multiple accounts to the service, which would then snapshot each account. All the cloud resources from different accounts and regions can then be seen in a single place. Thus it can be used for comprehensive resource inventory.
The UI makes it easier to look for resources of a specific type, which can further be narrowed down around accounts, regions, or even the date range. CloudYali clearly marks the lifetime of each resource, thus identifying when a specific resource was created or deleted. Interestingly, it is possible to view all the resource properties changing in a single place, which I think is convenient.
Given all these benefits, version-controlling the cloud may become a necessary part of a successful cloud strategy.
Thanks for reading!