Recently, a team that I know deployed their new code to production using all the right tools. Built in micro-service architecture, the project had 4 different services deployed on AWS Lambda, with their own databases, right IAM policies and everything automated using CFT and CI/CD pipelines, and a dedicated platform team to support them. Yet, in less than 3 months, they were struggling with a bunch of issues, Lambda cold starts, concurrency limits in the account, latency due to network calls etc. When I started thinking back why it happened, it’s clear, this isn’t a story about better tooling, latest technology or even the cloud team’s expertise - it’s about the need to think about infrastructure during the initial application design, not after.
Over the past decade, cloud computing and adoption of cloud infrastructure have drastically increased, across companies of all sizes. This is because the promise is compelling. Cloud providers market “pay per use”, and emphasize on the ease of use. AWS lists "Easy to Use” as the very first benefit of using AWS, and Google Cloud says it "helps developers build quickly…”. While these claims are relatively true, compared to on-premise hosting, the complexity hasn’t disappeared, it has just shifted form.
Setting up cloud infrastructure and managing them is complex by design. While the cloud providers have abstracted a lot of low-level details, the complexity has not been completely erased, but only shifted them to the engineers in different forms. Engineers today face several key challenges:
Cloud solutions have opened the floodgates with an abundance of choices. Even a seemingly simple decision like choosing a compute service becomes complex because it fundamentally shapes how the whole application is built. For instance,
To host even a simple application, we need to orchestrate a bunch of interconnected services. A Basic web application on EC2 needs Network configurations like VPC, subnets, routing tables, ACLs, Security groups, IAM roles, auto scaling groups, load balancers etc. Each of these components need to work in harmony, and a misconfiguration in any one area can impact the entire system. For instance, a team deployed a few REST apis as Lambdas functions on one account but reused the API Gateway Authorizer from a different aws account. Ignoring one step of setting up the VPCs tunnel left the lambdas in the new account outside of a VPC, making every call between the lambdas, gateway-authorizer, DynamoDb and Elasticache to go over the internet, putting the entire application at a security risk while also increasing the latency of requests.
Cloud’s promise of abstraction often leads to a false sense of simplicity. In reality engineers must now understand not just traditional infrastructure concepts but also how the cloud providers have wrapped these in layers of web services. For instance, setting up a production read RDS instance requires understanding on
Many companies respond to this complexity by creating dedicated Cloud platform teams. While these teams excel at providing standardized infrastructure patterns and self service tools, they cannot abstract away the fundamental need for an application developer to understand the infrastructure implications. For example,
The solution is not more tools or another layer of abstraction. Instead we need to fundamentally shift how we think about the infrastructure in the application development process.
Consider AWS lambda cold starts: Rather than treating this as an infrastructure problem to be solved later, teams who understand this limitation upfront make fundamentally different architectural decisions, such as
This shift in thinking from “how to solve cold starts ?” to “how to design a solution with cold starts in mind ?” exemplifies the left shift in infrastructure thinking.
Infrastructure choices made early in development have lasting implications on operational costs. For example,
These cost implications aren't just infrastructure details to be worked out later. They should influence core architectural decisions from the start. A system designed with infrastructure costs in mind often looks very different from one where cost optimization is treated as a post-deployment concern.
Infrastructure choices shouldn’t be one-time decisions. Teams should regularly evaluate if their infrastructure choices remain valid as their application evolves. This includes regular review of service limits, scaling patterns, monitoring code implications, and assessment of new service offerings that might better suit the evolved requirements.
The instinct to over engineer for edge cases often adds unnecessary complexity. The teams that develop applications should simplify the solution both from the application architecture as well as infrastructure architecture. For example, we can use DynamoDB streams to capture change data or over engineer it for any specific use case, and a separate event pipeline with a batch job to do the same work, which complicates the system with unnecessary failure points.
Rather than trying to keep all options open, teams can embrace opinionated architectural patterns that align with their chosen infrastructure. For example, frameworks like serverless patterns for event-driven architectures reduce decision fatigue and enforce best practices. While this might feel like infrastructure dictating development models, in reality it helps faster development and sets a clear path for future architectural decisions. With standard patterns established like these, when new applications are built, the teams can naturally start solving the problems in a way it fits within the framework and build it the right way from the start, reducing the complexity in adapting new infrastructure and the maintenance overhead.
Infrastructure Knowledge is a must for all engineers, not just the ones in a platform team. For an application developer, choosing an infrastructure should be no different than choosing a programming language, or a database. While understanding every single detailed configuration and parameter of a cloud service might seem daunting, investing time in acquiring knowledge about what the service offers and what are the pros and cons is a must for every engineer to make informed decisions and choices, and proactively embrace them.
While Cloud services have abstracted many low level details, they remain what their name suggests, i.e. Infrastructure as a service, The complexity still exists, but we can manage it better by shifting the infra considerations upstream in the development process. This is not just about better tools or deeper knowledge and expertise, but about fundamentally changing the way we think about and approach application development in the cloud era.
The most successful teams aren't those with the best infrastructure tools or the largest platform teams - they're the ones who have learned to think about infrastructure as an integral part of their application design process. This is what I call Infrastructure-Driven Development - where infrastructure considerations shape and guide application architecture from day one, rather than being an afterthought to be dealt with during deployment.