We were so sure we'd nailed it.
Our team had a Kotlin microservice at DoorDash processing lists of restaurants, orders, drivers, customers, and locations. Cross-referencing them required matching by IDs. The "obvious" way to do it was to convert the lists to HashMaps for O(1) lookups instead of O(n) linear searches.
We deployed it. Latency got worse.
Turns out, these complex domain objects had expensive hashCode() implementations. The CPU cost of computing hashes for map creation exceeded the benefit of faster lookups. With list sizes under 50 items, dumb linear search crushed the "obvious" solution.
That discovery crystallized something we'd been learning the hard way: at scale, theoretical optimizations can sometimes backfire. Big O notation assumes constant-cost operations, but the "constant" can matter more than the complexity class.
After years of optimizing systems serving 50M+ monthly requests, I've learned that sustainable performance improvement requires a fundamentally different mindset than what most guides teach. It's not about knowing techniques like caching, compression, database tuning, etc; it's about orchestrating them into a system that stays fast under real-world chaos.
Why Most Performance Advice Falls Short
Here's the trap: you reduce P50 latency from 200ms to 150ms. Celebration time, right? But if your P99 jumps from 500ms to 2s because of cache stampedes, you've made things worse for the users who matter most - those on the edge of your performance envelope.
Discord learned this by scaling individual servers from tens of thousands to nearly two million concurrent users. Their challenge wasn't slow queries. It was managing quadratic scaling: with 100,000 people online, a single message becomes 10 billion notifications to deliver.
The lesson? You can't optimize your way out of a systems problem by only tuning individual components and not looking at the big picture
Start With Observability (But Measure What Actually Matters)
"You can't improve what you can't measure." Sure. But what you measure determines whether you'll actually fix the problem or just move it somewhere else.
Layer your instrumentation
Start at the edge and work inward. Capture P50, P90, P99, and P999 latencies from the user's perspective—users complain about the tail, not the average. Google's "The Tail at Scale" paper should be required reading.
Then dig into client-side breakdown: how much time goes to decoding, deserialization, rendering? Segment by device type, network conditions, app version, and geography. A 200ms response on WiFi is great; on 3G in rural India, it might timeout before anything renders.
Within your service, if you have a DAG-style workflow, measure time per node and identify the critical path. One slow database call on that path delays everything downstream. Distributed tracing via OpenTelemetry or Jaeger makes these bottlenecks obvious. And don't neglect system-level metrics - CPU, memory, GC pressure, thread pool saturation, disk I/O, network bandwidth. They're often the canary that warns you before users start complaining.
Set up error budgets alongside latency monitoring. If 99.9% of requests need to be under 200ms, your observability should tell you exactly where that budget is getting burned. Google's SRE book covers this well.
The Optimization Playbook
Once you can see the problem, here's how to attack it systematically.
Compression: Smaller Bytes, Faster Flights
Enable Brotli compression over Gzip for modern clients. LinkedIn saw ~14% reduction in JSON payload sizes after switching.
But don't blindly compress everything. Compression costs CPU cycles. For tiny payloads, it might not help. Skip already-compressed formats like JPEG, PNG, and MP4. Use static compression for assets (JS, CSS), dynamic for JSON/HTML.
Progressive Loading: First Screen First
The first screen users see determines whether they bounce. Load only what's visible; fetch the rest later.
Use cursor-based pagination instead of offset-based - it's more reliable when data changes between requests. Implement skeleton UIs and shimmer loaders. Facebook and YouTube use these because they make waiting feel faster, even when actual latency is the same.
Caching: Powerful but Dangerous
A great cache strategy cuts latency by orders of magnitude. A bad one causes stale data, stampedes, or outages.
Build layers: L1 in-memory (fast but local), L2 distributed via Redis or Memcached, CDN for geographic distribution.
Use advanced patterns: Stale-while-revalidate serves cached data immediately while refreshing in background. Jittered TTLs prevent cache stampedes when many keys expire simultaneously. Choose read-through vs. write-through based on your consistency needs.
Facebook's approach to serving billions of requests is worth studying. AWS also has solid caching best practices documentation.
Code: Hunt the Hot Paths
You'd be surprised - or maybe not - how much inefficiency hides in production code.
Profile hot paths using tools like perf, py-spy, or jvisualvm. Optimize parsing and serialization. Watch for quadratic loops and unnecessary allocations. Consider object pooling for high-churn objects.
Netflix shaved 200ms off median latency by optimizing their JSON-to-UI rendering pipeline.
And remember my HashMap story from the intro - always profile real workloads. The code that looks optimal on paper might be the code that's killing you.
Parallelization: Do It Safely
Parallelizing independent operations can dramatically reduce latency. The key is avoiding the traps.
Use async/await for I/O-bound operations. Process independent API calls in parallel. Push non-critical work to background jobs.
But watch out: over-parallelization can saturate database connection pools. Missing backpressure controls can overwhelm downstream services. I've seen teams accidentally DDoS their own databases this way
Database: The Usual Suspect
Databases are frequently the bottleneck, but solutions often go deeper than adding indexes.
Optimize for reads even if writes get more complex - most consumer facing systems are read-heavy. Batch database requests to avoid N+1 queries. Use materialized views to precompute expensive results. Choose sharding strategies based on your access patterns: by user ID for balanced load, by geography for latency reduction.
Instagram scaled feeds to billions of users by sharding Postgres across IDs. For query optimization fundamentals, Use the Index, Luke is excellent, as is PostgreSQL's performance documentation.
Making It Stick
The process that actually works isn't complicated, but it requires discipline. Measure comprehensively across your entire stack first. Then identify the biggest bottleneck using data, not assumptions - your intuition about what's slow is often wrong. Apply targeted optimizations to the highest-impact areas, validate that the change actually improved user experience (not just your benchmarks), and then do it all again. Performance optimization never ends; it's a practice, not a project.
Track technical metrics like P99 latency, error rates, and throughput alongside business metrics like bounce rates, conversion, and engagement. A 100ms improvement in page load can increase conversion by 1-2% for e-commerce. At scale, that's millions in revenue.
The Real Challenge
As Jeff Dean puts it: "If you want your system to be fast, first make it correct. Then profile, measure, and optimize the hot spots."
Whether you're building mobile apps, APIs, or distributed systems, the principles stay the same: measure, experiment, optimize, repeat. The difference at scale is that every optimization needs to work harmoniously with all the others.
That HashMap we thought would make everything faster? It taught us that the most dangerous performance work is the kind where you're so confident you skip the measurement step.
Don't be that engineer. Instrument first. Then optimize what the data tells you to optimize.
Previously at DoorDash, where I led teams optimizing systems serving millions of users. Currently at Anthropic working on AI infrastructure. I write about distributed systems and performance engineering at ujjwalgulecha.com.
