One of the most common concerns when moving to the cloud is cost. Given that cloud allows you to turn IT costs from CAPEX (long-term investments ex. in hardware equipment and software licenses) into OPEX (day-to-day operating expenses), it’s crucial to choose the right service and estimate it properly. In this article, we’ll look at the common pitfalls and discuss how you can avoid them to truly benefit from the cloud’s elasticity.
The lift and shift approach means that you are moving an exact copy of your workload to the cloud with as few changes as possible. Even though this pattern may be useful if you want to move to the cloud quickly, it may lead to suboptimal usage of your resources. AWS acknowledged that this is a difficult problem by creating services to make this migration easier (CloudEndure Migration and AWS Server Migration Service). Still, for the best possible resource utilization, it’s best to consider rearchitecting your solution for the cloud.
With lift and shift, you are potentially leaving a lot of money on the table, when looking at it long-term. You would also likely miss out on many benefits your cloud provider can offer. For instance, when choosing fully-managed AWS Aurora over a traditional Postgres instance, you can gain (among others) 3x more throughput, storage autoscaling, and low-latency read replicas. This may be the reason why Aurora is currently one of the most popular and fastest-growing services on AWS.
It’s difficult to improve something if you don’t have enough data to make an informed decision about it. If you have no way of tracking how your cloud resources perform and how much costs they incur, it’s difficult to optimize their utilization.
It’s considered a best practice to tag your resources based on projects or organizational units to correctly allocate costs to the corresponding services.
Managing cloud architecture is not a one-off process. It’s a continuous practice of monitoring and evaluating what you use, how you use it, and why. Perhaps your original assumptions about the growth of a specific application turned out to be not entirely right and making a change could significantly lower costs.
For instance, consider an overprovisioned Kubernetes cluster with many more nodes than needed. Perhaps moving to a serverless version (EKS on Fargate) makes more sense in such a scenario.
Leaving “zombie” resources running unmonitored is not as uncommon as you may think. In a larger organization, it can happen that some projects get abandoned and the corresponding resources remain active due to incomplete handover processes.
As software engineers, we may sometimes be tempted to building our own custom solutions and services for everything. A potentially better approach is to first do proper research of what’s already available. Examples:
In general, if there is a serverless or fully-managed solution, it makes sense to at least consider it before investing too much time and effort into your own solution that you would have to maintain entirely yourself.
Often when reading some Reddit or blog posts, I see many engineers who are reluctant to use serverless or container orchestration platforms simply because all they know is EC2 and manually administered servers. They assume that it’s all just a new technology that will “come and go” and there is, therefore, no need to change your ways. This implies that there is no merit in moving to container orchestration platforms, serverless and other cloud services. This seems to be a close-minded approach. It’s better to challenge our assumptions and judge new technologies with clear facts, costs, and performance benchmarks, rather than by skepticism towards what’s new.
If you would create an EC2 instance for every service and tool you manage, you would likely end up in a maintenance nightmare. But if you instead deploy each of your services to a container deployed to a Kubernetes (EKS) or Fargate (ECS) cluster, you can allocate much more resources into a single server instance due to dynamic port mapping and more compact resource utilization of containers (ex. shared layers).
Container orchestration platform will help you ensure that you balance the load between the instances and that your workloads will stay healthy. They take the capacity guesswork, to some extent, out of the picture. You can specify how many container instances should be running at all times and the control plane will ensure that it happens, just as you defined it.
If you can easily load balance your workload across many containers or serverless resources, then you no longer have to guess which EC2 or RDS instance size will be appropriate for your use case.
If you only consider the hardware or service costs, you may end up thinking that many resources can be more cost-effective on-prem. But if you add up the costs of additional maintenance, upgrades, and employees managing those servers, that’s an entirely different story.
If you scale your resources purely based on your current situation, you may fail to take into account how your needs may change in the future. What if your business and data grow much faster? What if it turns out to be the opposite? Is your application still easy to change and adapt to unknown future scenarios? And finally, will you be able to find and retain enough employees that can operate around those needs in the long run?
On the other extreme, if you want to be cautious, you may be tempted to overprovision everything to make sure you are ready for usage spikes. It’s a good strategy provided that you can justify the spikes based on past usage patterns. But it can be a bad strategy if you are doing it out of gut feeling.
Cloud allows elasticity in the sense that you can add nodes to your clusters, load balance the workload across more containers, or increase the number of vCPUs or memory size when you see the need for it. If configured and monitored properly, there is no need to overprovision anything. I’m not saying that right-sizing is easy (far from it), but with good processes and automation in place, it’s doable and can significantly save costs, especially when operating numerous resources at scale.
Overprovisioned prod resources
Sometimes the bottlenecks are not the compute resources, but rather a poorly chosen data store. It’s good to consider:
It’s naturally use-case dependent, but the databases often constitute the main bottleneck of any scalable architecture.
One possible solution to optimize your cloud resource utilization is to leverage automation. For instance, with Dashbird, you can keep track of your under- and overprovisioned resources and get notified about them. When using the well-architected lens dashboard, we can find out that our ECS cluster with EC2 instance type (non-serverless data plane) had a CPU utilization of over 90% within the last hour.
Well-architected lens dashboard
Then, we can drill down into specific time intervals and inspect further why this spike occurred.
Underprovisioned ECS cluster reaching the CPU capacity limits
At the same time, another containerized service may be overprovisioned, potentially leaving money on the table. Having this information allows you to optimize your resource configuration based on the actual usage patterns.
Overprovisioned ECS service
In this article, we investigated common pitfalls when sizing your cloud resources and discussed how to avoid them to truly benefit from the cloud’s elasticity. By making use of container orchestration platforms, serverless and fully-managed solutions, and by continuously monitoring your usage patterns over time, you can optimize your architecture for performance and costs.
Previously published at https://dashbird.io/blog/sizing-cloud-resources-mistakes/