Avoid Vaporizing All of Your Cloud Budget

Easy come easy go.

Here is my advice to deal with the mistakes related to cloud budgeting.

#1 Surprise consumption

Clouds operate on a pay-per-use model. That means you will be billed monthly for what you have consumed. A good piece of advice is to control the components of your architecture that can potentially consume resources unchecked, like the S3 storage buckets. Otherwise, you will get a large and inconvenient bill as a surprise gift. Let's jump into the details.

Here are a few common mistakes you should avoid:

You have automatic and regular backups for a lot of systems and data in place. Unfortunately, you forgot to define a policy that will discard old backups. If your project is small, this might not be an issue, and therefore such a policy is often forgotten. However, once the project evolves into a big project and you backup 1 TB of data each day, your wallet will begin to suffer because your total backup size will be enormous.
If your application allows users to upload content, you should also implement a limit system to prevent abuse. Otherwise, you can have evil-minded users upload tons of useless data that will be charged to your cloud account.
What applies to the backup example above also applies to logs. All your applications create logs, and storing those logs on an external system (like a S3 bucket) is good practice. You should consider how long you want to store your logs to keep your storage costs lean in that regard.
All cloud providers have a feature to create snapshots of your instances. Consider that those snapshots are still consuming storage when not deleted manually, even when the corresponding instance has been already terminated.

#2 Choosing the most expensive VM by accident

"I don't care what it costs, we need to scale pronto. Get me the largest VM they have."

Yells your manager hasty at you through your office door as the application unexpectedly went viral and the backend is choking on requests.

OK, boss. Depending on which cloud provider you're running on, this 'largest' VM will look like this:

AWS: 448.00 vCPU, 6144.00 GB, 1310 USD per day
GCP: 224.00 vCPU, 896.00 GB, 62 USD per day
Azure: 416.00 vCPU, 5700.00 GB, 1190 USD per day

And if you leave instances sized like those on, they will quickly eat up your year's budget in no time.

Prices are correct at the time of writing.

#3 Getting your admin account credentials leaked

If this catastrophic event happens to you, there are mostly two scenarios that will happen to your cloud account. Either someone malicious will delete all your data, instance, and configuration and leave you with the big task of restoring everything. Or he will spawn a large number of instances in your account to mine crypto for him. In both scenarios, you will have to foot this bill in some way.

#4 Overcomponentization

The days of the application monolith architecture are over. Developers everywhere are breaking up monoliths into smaller services (components) to make further development easier and scale-up parts that have become performance bottlenecks. Microservices have become a huge trend in modern application development.

But where light is, is also shadow. Overcomponentization is a bad practice that can occur here. When you split up the monolith into too many components you end up with a large set of components that each need their own hosting instance in the cloud. This entails more maintenance tasks to keep each operating system up-to-date, more monitoring, and more network components like load balancers.

You can avoid these side effects by using containers and a container manager like Kubernetes instead of VMs. With Kubernetes, you can manage your components and the host VMs they run own in a more effective way with less overhead compared to running each service in a single VM.

#5 A blast from the past - Instances that became unused unnoticed

Kind of a result of #4 Overcomponentization. When your application environment has developed into a large web of services, it's hard to keep a clear constant overview of the services' utilization and the interdependencies. Over time some services will become hardly used, and others can become completely outdated but stick around because, in this web of services, there might be a service that could depend on it, but it's hard to tell, so you keep them running.

Sounds familiar? Years later, you stumble about some instances and wonder, what were these used for?

It's quite easy to retire a service dependency in code. Just use v2 instead of v1 in the API URL. However, having a decommissioning process that will take care of outdated service instances is rarely found in an organization. And every instance that is online in the cloud costs and takes a bite out of your cloud budget no matter if the instance is used for something meaningful or not.

My recommendation to counter this problem is the following:

track your services (can be as simple as a markdown table), list them all and their dependencies to other services
establish a developer meeting each quarter to review which services are still used and which can be decommissioned
Be brave and decommission what's no longer needed.

#6 Not actively managing instance sizes

It's quite normal for an application in use to have ups and downs in its utilization from a 1-year perspective. For example, a webshop would usually see roughly the same utilization all year, but experience strong peaks during special events like Christmas or Black Friday sales.

Every good cloud architect and software developer should put some thought into how the system will be utilized and - generally speaking - which reasons could occur that will lead to peaks in the application utilization. A bad idea is to just pick a large enough instance and let it run. Such kind of thinking made sense in the times of traditional data centers where scaling a server's hardware would take months.

In the Cloud, you have faster options to scale. By just picking an instance size and letting it run you will leave a lot of money left on the table. Because most of the time you are running an instance that is underutilized - meaning you paying for resources you are not using. And on the other hand, when a utilization peak unexpectedly occurs, you will be caught out by still not having a large enough sized instance.

All the major cloud providers have something in their portfolio to address exactly this problem.

In AWS they are calledAuto-Scaling Groups (ASG), in Azure virtual machine scale sets, and in GCP Managed instance groups (MIGs). They have different names but do the same thing. Scale a group of identically configured instances based on an event. The event can be simple as a schedule or something sophisticated as a metric like CPU utilization, average network usage, or a number of requests. Some cloud providers also allow you to set up custom metrics that fit better to your application.

Generally, such automatic scaling groups can be effectively configured by using just three parameters. You can set a minimum and a maximum of instances you allow this group to run at the same time. And a number of desired instances. The desired number is the one that will allow you to save your cloud budget when it makes sense. New instances are launched when needed and old instances are terminated when no longer needed. This ensures that always the optimal number of instances runs at the same time - optimal for your application and your budget.

#7 Autoscaling typo

This one is half-serious. Imagine you had set your auto-scaling group to a minimum of 1 and a maximum of 5. However, you didn't notice you typed 55 instead of 5. And then, indeed, a load peak would occur. However, instead of a Black-Friday sales event that would generate revenue, it would be a DDoS attack. Your 55 instances would bravely defend against the DDoS attack, but your cloud budget would fall victim. You also would overlook the DDoS, because your 55 instances would cover for you.

#8 Keeping build servers on hot-standby

If you have to keep build servers (or test servers) ready on hot-standby it will certainly payout to monitor the demand. Often you will find that at certain times of the week your developers are not active and those servers are just idle. These could be weekends, early morning, or late evening hours.

For those times, you should set up a scheduler that uses the API of your cloud provider to start and stop instances automatically. Instances that don't run save you money.

These were my 8 tips to protect your cloud budget. Do you agree with this list, or do you think I missed some crucial advice? You can reach me here or check out our website to find the best cloud VM price for your next project.

Ciao!

Price data was pulled from our website at the time of writing.