Navigating Cloud Costs: How to Achieve Financial Visibility with Cloud Architecture

In what is becoming a very cost-conscious world of cloud usage due to a myriad of reasons; global economic pressures, run-away cloud costs, failed migration programs et cetera. Is that there is still an opportunity for each organization irrespective of existing or new landing zone design (organizational governance structure) to achieve the necessary cost visibility to enable Cloud Financial Operations ('FinOps') practices to be operationalized in a way that ensures best-of-breed outcomes in a given scenario.

The Magical Cloud
- What does this mean?
- Why does it matter?
Architecting Account Hierarchy for Cost Visibility
- Common governance constructs
- Company A - Environments
- Company B - Business units
- Company C - Workloads & environments
Conclusion

The Magical Cloud

Cloud computing, once marketed as the magical computer, tucked away in the sky, ready to solve your organization's most complex problems, has seen the sentiment change drastically in the last few years as organizations learned the hard and fast way, that it isn't that simple.

From failed migrations where chatty bloated on-premises monoliths were lifted and shifted in a rush to promote 'cloud first principles' with no supporting skills to refactor once there. Through to poorly planned and established governance to ensure enforcement of basic guard rails to protect organizations from security events, and what this article is about, architecting to avoid bill shock by bringing visibility to your organization's cloud spend.

As "what get's measured, get's improved" Peter Drucker

The reality is, that the world of the cloud is not in any way similar to on-premises workloads, with millions of SKUs representing every flavor of service, computing, memory, storage, and more at the click of a button. Engineers can single-handedly blow a hole in an organization’s IT operating budget. As Dann Berg puts it in his article here the 'money feedback loop' is significantly longer (if there is at all) compared to a failed technical change by an engineer.

What does this mean?

It means, by the time organizations realize Intern Engineer A deployed a fleet of u-12tb1.112xlarge AWS EC2 instances with a whopping 448 vCPUs, 12TB of memory for a Proof of Concept Web App, and took off on holidays, they have at least an 80K USD bill by EOM.

Now, this is an unlikely scenario in that the intern was left to deploy without an approval workflow but the fact is, it can happen and it does every day of the week, worldwide. As that kind of tin isn't available on premises for anyone to spin up without jumping through hoops, and certainly not as quickly as the magical cloud enables.

Why does this matter?

It matters because cloud usage isn't showing signs of slowing down. Gartner has forecasted a 20.4% growth in public cloud spending in 2022, and a total of $600B is forecasted to be spent worldwide by 2023. Though according to Flexera, of that significant worldwide spend, 32% is expected to be wasted in 2022 ($158.3B). Ouch.

Those numbers alone are eye-watering, but when contrasted with the global economy trending toward more trying times, and 83% of organizations' top cloud challenge is managing their cloud spend (Flexera). It is inevitable that executives are already asking or about to ask, where can I save money?

Though to save money, to optimise, to operate efficiently, we first must know where spend is occuring across the organisation

So let's examine that.

Architecting Account Hierarchy for Cost Visibility

Depending on whether your organization was an early cloud adopter or whether your organization is exploring best-of-breed practices to establish new cloud capabilities within your organization. The answer to cloud visibility is not just #tagging. Several articles speak to tagging as if it's the panacea to solving all an organization's allocation and visibility concerns.

Though, as a cloud architect and a FinOps lead working in Professional Services where I am engaged at all ends of the spectrum in a variety of organizations' cloud journeys, I find the truth of 'tagging' to be one lever of many. But as stipulated in many articles, they don't provide the complete picture which leads to why I am writing this article.

It all begins with how your organization has utilized its CSP offerings to design necessary governance upfront. You see, tagging comes after establishing the necessary organizational structure, control policies, guardrails, etc. for how your organization may want to best organize itself. This could be by business units, environments, workloads et cetera.

Based on this, there are more than just cost considerations, and depending on the industry of your organization, you may find that governance constructs vary greatly. The below is going to provide a number of the common constructs I see, and what the pros and cons of these are to enable you to make the right decision for your organization.

The intent here is to 'shift left' FinOps capability to not rely on one factor that is commonly just 'tacked on' i.e. tagging, but to 'bake it in' to the design of your governance structure within your CSP or CSPs if you are multi-cloud.

To discuss this, we first must clarify some general nomenclature so we are all on the same page. Across CSPs, the terminology used to represent the same logical construct is different though in deployment we can achieve the same outcome. The below provides a comparison of Amazon Web Services (AWS) and Microsoft Azure (Azure) and how their terminology of governance hierarchy differs yet enables the same outcomes.

Note: In all examples below, I will be utilizing AWS terminology, but please refer back to the below for a comparison of terms if I lose you!

Common Governance Constructs

As discussed briefly above, the governance construct your organization deploys or has deployed may be based on what was available at the time (early adopter) or well-architected best practices that exist now. Either way, understanding how different industries may architect their landing zones and governance constructs enables you to make better decisions to improve cost visibility in your organization now and move forward and make the right trade-offs.

Note: In the examples below the 'Management Account' is used as a collective of potential 'other' accounts which are used for common governance and policy enforcement, security etc. And not specific to AWS terminology whereby 'management account' is the central account used to manage AWS Organisation Hierarchy (formerly known as the master account). These may include, but are not limited to:

Identity Account; centralize management of Active Directories, RBAC, etc.
Connectivity Account; centralized management and configuration of network resources such as AWS Direct Access
Management Account; central management of monitoring, logging, and analytics which may connect back on-premises to SIEM tooling like Splunk

So please don't get hung up on those management accounts.

Additionally, the examples provided below are by no means exhaustive but attempt to provide you with enough information to begin to consider your organizational governance requirements. As such, may require you to organize your organizational units and their respective accounts in ways that support the scale, policies, network design, or management boundaries that best serve the organization's operational needs.

As one size does not fit all.

Company A - Environments

This organizational structure uses AWS Organisational Units (OUs) and breaks accounts up by environments with all business units and their respective workloads mixed into each 'DEV', 'TEST', and 'PROD' in this example.

This structure has the benefit of minimizing cost as common resources can be shared and logically the account creates a boundary for isolation that separates concerns. So even without tagging, we can attribute workloads to their environment.

However, due to the structure by ‘environment’, it is easy to see tagging here is crucial if a 'chargeback' cost allocation method is intended or even a 'show back' to each business unit and their workloads. Though security here is of concern, as the 'PROD' account is a honey pot of production workloads waiting to be exploited if not managed correctly.

Company B - Business Units

This organizational structure uses OU's to represent business units with all workloads of that business unit, and their respective 'environments' managed within each.

This structure has the benefit of minimizing cost as common resources can be shared, and logically the account creates a boundary for isolation that separates concerns similar to Company A. So even without tagging, we can attribute workloads to logical business units.

In an organization where show back or a chargeback is used for cost allocation, this may be enough to have specific cost boundaries defined by the account structure. However, to understand further what workload was utilizing a specific resource and to further break down costs, tagging is required.

It is important to call out here that this organizational structure does introduce risk around security similar to Company A, as environments are mixed, therefore engineers may through negligence inadvertently make a change meant for a 'DEV' environment but instead affect a 'PROD' workload. Strong development practices, CI/CD pipelines, code reviews et cetera are important here. Including the use of AWS Service Control Policies (SCPs), AWS Control Tower, and AWS Organisations to help manage and enforce policy guardrails.

Company C - Workload & Environment

This organizational structure is not common for many organizations but is common in high-security organizations were limiting the blast radius is important such as for my clients in the Defence & National Security sector. Organizational units may be used, but if so are to implement control policies as cost granularity is achieved by the very nature of account structure.

Accounts are broken up by both workloads and by the environment. As such, each workload will at minimum have two accounts with all 'Non-Production' workloads in one account and 'Production' in another. The number of accounts can increase where higher-level non-production environments such as 'Pre-Production' has integration with third parties to test end-to-end configuration and integration and separation of concerns are needing to be maintained.

This structure, undoubtedly, is the most expensive way to operationalize an organizational structure within a CSP as no resources are shared. Given workloads that may communicate with each other as part of an end-to-end system must take additional measures to enable peering etc. Additionally, costs are duplicated for the organization as core resources used such as AWS WAF are required in every account, and are charged accordingly.

Given this structure, workloads have evident operational boundaries and policy enforcement. In turn, costs can be directly attributed to a specific workload and its respective environment. In this scenario, tagging becomes less crucial as costs are granular to the cent, with no shared costs to allocate back. This structure therefore may be enough for some organizations to have confidence in just 'paying the bill' and trusting teams to take action through cloud-native advisor tools.

However, when you begin to consider what resources are being used such as Kubernetes Clusters, and getting to the next level of visibility in large workloads with hundreds of VMs potentially, tagging becomes crucial.

As is discussed above, it comes down to your organization's imperative. I have spoken with some organizations in the Cryptocurrency Exchange business who are not concerned at all about their spending and therefore don't care for visibility in context to finances (though may from an SRE perspective. While others, such as the government are spending public funds and have tighter controls on how that money should be spent.

Conclusion

Understanding the organizational governance structure is the first step toward better cost allocation and visibility. Though tagging plays an incredibly important role in this, it is just a lever, and understanding the purpose of why things are architected the way they are enables you to pull the lever more effectively. Lowering the burden on your engineering teams, and increasing the value-add certain tags have. As too many tags without a specific purpose simply create overheads. Don't be that team!

So get intimate, and understand the why and I promise you will achieve better outcomes for your finops team, and in turn, your organization.

This article represents my personal opinion only and not that of my employer.

Also published here.