If you're trying to build a high-scale application in the cloud, sometimes it's easy assume you can just add more servers or let the platform sort itself out. However, there’re very subtle pitfalls which can derail your efforts significantly. In recent years, I have come across a recurring set of often surprising issues with major consequences. In this article, we will walk through frequent pitfalls, share some real-world stories and provide practical suggestions on how to approach them.
“Everything fails, all the time.”
—Werner Vogels (CTO, Amazon)
Too Many Requests
or LimitExceeded
errors, interrupting your service.Request Quota Increases: Monitor usage in the cloud console (e.g. AWS Service Quotas) and raise limits before spike in traffic.
Introduce Queues & Caching: Decouple front-end traffic from back-end services with AWS SQS, RabbitMQ, or Redis to absorb surges.
# Example: Serverless Framework snippet for AWS Lambda & SQS
# Smooth out traffic by letting messages queue instead of overwhelming your function.
functions:
processMessages:
handler: handler.process
events:
- sqs:
arn: arn:aws:sqs:us-east-1:123456789012:MyQueue
batchSize: 10
maximumBatchingWindow: 30
Walmart (2021) encountered throttling on internal APIs during holiday sales. They addressed it by adding caching and queue-based decoupling, which smoothed out spikes. Reference: Walmart Labs Engineering Blog
// Example: Node.js with Redis caching
// Checks Redis first for the data; if absent, queries the DB, then stores the result in Redis.
const redis = require('redis');
const redisClient = redis.createClient({ url: 'redis://<your-redis-endpoint>' });
async function getUserProfile(userId) {
const cacheKey = `user:${userId}`;
const cachedData = await redisClient.get(cacheKey);
if (cachedData) {
return JSON.parse(cachedData);
}
// If not cached, fetch from DB:
const profile = await db.findUserById(userId);
await redisClient.set(cacheKey, JSON.stringify(profile), 'EX', 3600); // expires in 1 hour
return profile;
}
Netflix (2022) scaled from a single relational DB to a NoSQL + Redis architecture to handle massive global traffic. Reference: Netflix Tech Blog
// Example: AWS CDK snippet for a Multi-AZ RDS PostgreSQL instance
import * as rds from 'aws-cdk-lib/aws-rds';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
const dbInstance = new rds.DatabaseInstance(this, 'MyPostgres', {
engine: rds.DatabaseInstanceEngine.postgres(),
vpc,
multiAz: true, // Deploys in multiple Availability Zones
allocatedStorage: 100,
instanceType: ec2.InstanceType.of(ec2.InstanceClass.BURSTABLE3, ec2.InstanceSize.MEDIUM),
});
Capital One (2021) emphasized multi-region deployments on AWS to avoid reliance on a single region. Reference: AWS re:Invent 2021 Session by Capital One
Centralize Logs & Metrics: Use AWS CloudWatch, Azure Monitor, Datadog, Splunk, or equivalent for a single source of truth.
Enable Distributed Tracing: Implement OpenTelemetry or Jaeger/Zipkin to trace requests across services.
// Example: Node.js + OpenTelemetry basic setup
// Sends tracing data to the console or a collector for deeper analysis.
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { ConsoleSpanExporter, SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
provider.register();
// Now your service automatically captures trace data for each inbound request.
Honeycomb.io (2022): Places emphasis on real-time, event-based telemetry to detect and resolve anomalies quickly. Reference: Honeycomb Blog
// Example: k6 load test simulating a ramp-up to 200 virtual users.
// Tailor stages to reflect your typical traffic patterns.
import http from 'k6/http';
export let options = {
stages: [
{ duration: '1m', target: 50 },
{ duration: '2m', target: 200 },
{ duration: '1m', target: 200 }
]
};
export default function() {
http.get('https://your-api-endpoint.com/');
}
Instagram (2021): Uses frequent load tests and capacity planning to accommodate explosive user growth. Reference: Instagram Engineering Blog
# Example: AWS CLI command to create a monthly cost budget
aws budgets create-budget \
--account-id 123456789012 \
--budget-name "MyMonthlyLimit" \
--budget-limit Amount=500,Unit=USD \
--time-unit MONTHLY \
--budget-type COST
Lyft (2021): Reduced AWS spending by optimizing compute usage, shutting down idle resources, and leveraging reserved instances. Reference: Lyft Engineering Blog
# Example: AWS CloudFormation snippet for Cross-Region Replication of an S3 bucket
Resources:
PrimaryBucket:
Type: AWS::S3::Bucket
Properties:
VersioningConfiguration:
Status: Enabled
ReplicationConfiguration:
Role: arn:aws:iam::123456789012:role/S3ReplicationRole
Rules:
- Status: Enabled
Destination:
Bucket: arn:aws:s3:::my-backup-bucket
Netflix (2023): Uses an active-active multi-region setup, automatically routing traffic to healthy regions during disruptions. • Reference: Netflix Tech Blog
AWS Well-Architected Framework
“The best way to avoid major failure is to fail often.”
—Netflix Chaos Engineering
By addressing these pitfalls head-on, you’ll be able to maintain reliability while scaling to serve millions of users and keeping your infrastructure lean, responsive, and secure.
What other challenges have you faced when scaling cloud applicationss? Share your insights in the comments—happy scaling!
Follow Milav Shah on LinkedIn for more insights.