Zombie capacity is any infrastructure piece that looks like it’s doing something, but in reality, is lying unused and should be killed. Zombie capacity can accumulate quickly and can be one of your largest infrastructure debts. The zombies come out of the dark when you get your cloud bill, your users complain about your system’s performance or availability, or you look at cost or usage metrics that look shameful to you and are hard to defend to your CTO. It all comes to understanding the dark side of running your applications or microservices on top of cloud infrastructure, which is . capacity management Capacity Management is the dark side of cloud-native applications that is usually ignored Capacity Management is the science, and the art, of balancing performance, cost, and resources. You would like to get the best possible performance with the least amount of resources. But given the billing complexities of cloud infrastructure, you also want to manage which model are you using to maximize your return on investment (ROI). Capacity Management is the science, and art, of balancing performance, cost, and resources. Take a look at the illustration below to understand factors impacting capacity management. Application/microservices is defined as the user experience and overall system responsiveness. performance are CPU, memory, and I/O in case of VMs, or they can be higher level PaaS resources, such as managed databases, middleware, etc. I’m focusing in this article on VMs (or IaaS only). Resources consists of the billing models you use, such as reserved vs. on-demand VMs. It is also characterized by how you group resources. Cost For example, do you use eight cores in one VM, or distribute them over 4 VMs? This grouping can make a huge difference in your bill. Why Should I Care? It depends who you are and what you do :) If you are a DevOps or SRE engineer, you want to: hen PagerDuty alerts go off because your users have a terrible user experience. On our , we’ve observed that poorly distributed resources and old scalability rules are the keys to more than 25% of live site incidents. Save some sleepless nights w platform . A good chunk of engineering time goes to analyzing (and reacting to) cloud bills. You usually get those questions when the bill significantly goes up without clear business justification. For example, if your bill goes up in one month by 30% without adding that many users or features, this is a big deal for the business and leadership in your company. Avoid stressful monthly cloud bill reviews If you are a developer, you want to: orrelating your code and changes to the user experience. Know if your deployed microservice is getting better performance for the resources it got or not. For example, is the new feature or recent bug fix consuming too many resources? Learn how to write better cloud-native microservices by c . You want to know if your microservice started to behave unexpectedly at specific workloads or conditions without doing any explicit instrumentations. Understand how your microservice behave under real workloads If you are an engineering manager, you want to: . Infrastructure debt is similar to your architectural or code debts. You need to make the best decisions to prevent a slowdown in the fullness of time. Run lean and avoid infrastructure debt This kind of debt is accumulated when the team keeps allocating the wrong infrastructure under the pressure of moving fast. It becomes harder and harder to keep releasing with decent velocity under the increasing demand to run efficient infrastructure. Why is Capacity Management a Pain In The Neck? Capacity management is a moving target. It is impacted by , , and . Multiple persons and roles impact applications and infrastructure architecture. They work with different motivations that are sometimes conflicting. users workloads changing application/system architecture evolving infrastructure Capacity management is impacted by users workloads, changing application/system architecture, and evolving cloud infrastructure. Factors impacting capacity management of cloud-native applications and infrastructure Also, each one of these three factors moves at a different velocity. , which impacts your applications performance and infrastructure utilization. Users workloads change every few seconds or minutes , if not faster depending on the team’s velocity. Application architecture changes every few months , It impacts users experience and the use of infrastructure capacity. . It impacts the performance of the application, and eventually user experience. Infrastructure technologies evolve every few months For example, using compute-optimized instances improves the performance of CPU intensive microservices. Using the right type of desks and network interfaces positively impact your databases. What Should I Do? If you are a DevOps or SRE, you need to focus on the following User Experience Measure, Characterize, and link users workloads to microservices Characterize workloads by measuring their intensity and latency throughout the day.Figure out if there are hourly, daily, weekly, or seasonal patterns. Quantify these patterns, i.e., number of API call of each feature, variability of workload.Profile each feature by identifying impacted microservices and measure CPU, memory, and I/O consumed to satisfy each API call (or feature). Performance and Profile of Microservices For each microservice understand if you are over or under budgeting resources. If you don’t have a budget, at least create a baseline from workloads you measure in the previous step.Profile different microservices by identifying whether they are CPU, memory, or I/O intensive. Infrastructure Identify zombie VMs. These are VMs that can be killed and have their current workloads moved to other VMs. Just look at the three common dimensions, CPU, memory, and I/O (network mainly), to identify these underutilized VMs.Match services profiles to the right VMs. Running your microservices to a general compute VM does not save your day. If your services are compute-intensive, you need to run them on compute-optimized instances, such as C5 family on AWS. The C5 family will give you much higher performance and scalability value for each dollar you pay to AWS. If you are a Developer, you need to focus on Create a baseline for your microservices. How much horsepower (CPU, memory, and I/O) does your microservice need to serve a specific unit of workload per second? For example, how many API requests per second can your microservice serve with 2 CPU cores, 4GB memory, and 10Gbits network? Baseline your microservice at different workloads. Track if this baseline you created changes from one release to another. A common mistake here is not tracking minor releases. Sometimes minor release introduces bugs that disrupts that pipeline significantly. You want to know about that as soon as it happens. If you are an engineering manager, work with your team on these Figure out the right KPIs (Key Performance Indicators). A single KPI won’t give you the full picture of your capacity. Your team should track at least one KPI per capacity dimension. Here you go some examples: (1) : cost per user or cost per operation (direct and indirect), or cost per microservice, Cost KPIs (2) : APIs latency (90, 95 and 100 percentiles of users), Performance KPIs (3) : cost per CPU, effective CPU cost (utilization included), cost per memory GB. Resources KPIs TLDR Capacity management is the dark side of cloud-native applications that is usually ignored. Capacity management is the science, and art, of balancing performance, cost, and resources. You should care about capacity management because it will save you sleepless nights, difficult questions around the cloud provider bill, level up your cloud-native software development skills and eliminate cloud infrastructure debt that can accumulate very quickly. Capacity management is a pain due to many factors impacting it, namely: , , and . users workloads changing application/system architecture evolving cloud infrastructure Measure, characterize and link users workloads to microservices. Create a reasonable KPI for each of the factors impacting your capacity management — details below.