There's a genre of conference talk I've sat through more times than I care to count. Someone who recently joined a startup from a large enterprise goes up on stage and says, with the confidence of a person who has just discovered a secret, that Kubernetes is too complicated. That maybe we should go back to VMs, or use something simpler. The room nods. The tweet thread gets a thousand likes. And then everyone goes back to their clusters and nothing changes, because whatever they're fleeing toward has the same fundamental problem they're fleeing from.
The problem isn't Kubernetes. It's that the organizations running Kubernetes have become genuinely, irreversibly complicated—and they're confusing the instrument measuring the fever with the fever itself.
The Complexity Was Already There
When I first started working with container orchestration in earnest, we were running maybe forty services across three teams. The deployment pipeline was artisanal in the worst sense: hand-tuned scripts, tribal knowledge, a senior engineer who had memorized which Ansible playbook to run in which order. It worked until it didn't—until that engineer left, until a new region was added, until compliance demanded audit trails we didn't have.
We didn't introduce Kubernetes and suddenly have a complicated system. We had a complicated system that we were managing through heroics and institutional memory. Kubernetes made that complexity legible, surfaced it in YAML and resource graphs and scheduling constraints, where before it was buried in the skulls of three overworked SREs. That's not the same as creating complexity. That's documentation that fights back.
The CNCF has said something to this effect, though usually in more diplomatic language. What they're pointing at is real: coordinating distributed workloads across failure domains is a hard problem with a solution space that doesn't compress easily. You cannot wish away the coordination overhead of scheduling ten thousand pods across thirty nodes in two availability zones while respecting pod affinity rules, resource quotas, and a custom admission controller from your security team. That overhead exists whether Kubernetes expresses it or not. Before Kubernetes, it lived in runbooks nobody read, in production incidents at 3 AM, in the archaeology of understanding why some service only worked when deployed to a specific instance type in us-east-1b.
What Multi-Cloud Actually Did to People
The multi-cloud era made everything worse before—maybe—making some things better, and it did so in ways that are worth examining precisely because they illuminate how organizational decisions manufacture the complexity that then gets blamed on tooling.
The pitch was irresistible. No vendor lock-in. Redundancy against cloud outages. Ability to use best-of-breed services from different providers. In practice, a significant fraction of organizations that went multi-cloud did so not because they needed the resilience but because a VP read a Gartner report and the decision landed on infrastructure teams that had no real voice in making it.
What that meant, concretely: you'd have a networking model in AWS—security groups, VPC peering, IAM roles for service accounts—and an entirely different model in Azure, with its NSGs, VNets, managed identities, and subtly different DNS resolution behavior. The Terraform you wrote for one didn't port cleanly to the other. If you wanted your services to behave identically across both clouds, you were either writing and maintaining two implementations of everything, or you were building an abstraction layer over the two clouds' APIs that would eventually become its own maintenance burden. Neither option is free.
I've seen teams spend six months building a "cloud-agnostic" deployment abstraction only to discover that their primary application depended on a specific AWS service that had no Azure analog. The abstraction didn't protect them; it just deferred the reckoning while adding indirection. Meanwhile, the Helm charts that got written in that period—one version per cloud, slightly different values files, slightly different naming conventions—became a garden of fork drift. The AWS chart got a security fix. Someone forgot to backport it to the Azure chart. Six weeks later, a penetration test finds the vulnerability on the Azure deployment.
This isn't a Kubernetes failure. It's an organizational one. The decision to go multi-cloud without a genuine use case, without the staffing to maintain parity, without enforced standards—that decision is where the technical debt originated. Kubernetes was merely the layer where the debt became visible as inconsistent configurations, mysterious scheduling failures, and support tickets from developers who couldn't understand why their service worked in one environment but not the other.
Where the Actual Fracture Lines Are
If you want to understand why a complex Kubernetes deployment feels like it's fighting you, you have to look at a few specific mechanisms that go wrong in practice.
Admission control without ownership. Kubernetes has a webhook-based admission control system that lets you intercept and modify or reject API requests before they hit etcd. It's powerful. Organizations inevitably add admission controllers for security policies, resource limit enforcement, label requirements, image signing verification. Over time, no single team has a complete picture of what all the admission controllers are doing. A developer tries to deploy a new service and it silently fails because three different admission controllers are each partially rejecting the request, and the error messages compose into something undiagnosable. I've watched a senior engineer spend a full afternoon on a deployment that was failing because a new network policy admission controller was requiring a label that the service's Helm chart didn't set, and the error from the API was cryptic enough that the first two hours were spent looking in entirely the wrong place.
RBAC proliferation. Role-based access control in Kubernetes is correct in its model but savage in its operational demands at scale. Every team wants its own namespace with its own service accounts. The ClusterRoles and RoleBindings that accumulate over two or three years in a medium-sized organization are genuinely difficult to audit. Who bound what to what, and why? The original author has moved to another team. The binding was created in response to an incident and never tightened afterward. You can run kubectl get rolebindings --all-namespaces and get hundreds of rows that tell you what exists but tell you almost nothing about whether it should.
Etcd as a pressure gauge. Etcd is a distributed key-value store that, under the hood, is doing Raft consensus to maintain consistency. It's not designed to store large objects. It's not designed to handle thousands of writes per second at sustained load. I've seen clusters where the CRD sprawl—every operator, every platform tool, every custom resource definition—had pushed the etcd object count into territory where you start watching request latency with a kind of anxious attention. The etcd disk I/O profile of a busy cluster with fifteen operators and hundreds of custom resources is not what the engineers who set up that cluster anticipated when they first provisioned it on a storage class that made sense at the time.
None of this is Kubernetes being bad software. It's Kubernetes being operated at a scale and complexity it was designed to handle, by organizations that didn't invest commensurately in the people and processes required to operate it well.
The Platform Engineering Answer, and Its Honest Limits
The industry response to all of this has been platform engineering—building internal developer platforms that abstract away the raw Kubernetes surface, exposing "golden paths" that bake in the organizational standards so developers can deploy a service without knowing what an admission controller is.
This is genuinely good. Backstage, Port, and similar tools have given infrastructure teams a way to create opinionated interfaces over their clusters. The best implementations I've seen treat the platform like a product: there's a team responsible for it, there are SLAs, there's a feedback loop with the developers using it, and the abstractions are updated when they stop working. The worst implementations I've seen are a thin YAML wrapper that exists mostly as a checkbox in someone's OKRs, that nobody maintains, and that developers route around by talking directly to cluster admins.
The honest limitation of platform engineering is that it doesn't reduce the underlying complexity; it relocates it. Somewhere in your organization, someone still needs to understand Kubernetes deeply—the scheduling model, the networking plugins, the storage classes, the upgrade process. When the abstraction layer breaks, which it will, you need people who can descend past it. Platform teams that are understaffed and under-resourced become bottlenecks. Developers who are insulated from the platform for too long lose the mental models they'd need to help debug it. The abstraction is both the point and the risk.
There's also the question of what "golden path" actually means when you have twenty teams with twenty slightly different requirements. A batch processing team's deployment needs are structurally different from a latency-sensitive API team's. The more universal you make the golden path, the more it approximates the lowest common denominator. The more you accommodate variation, the more the platform begins to resemble what you were trying to abstract away.
What a Careful Builder Would Do Monday Morning
Not reinvent anything. The first instinct when faced with a complicated system—usually at the end of a difficult week—is to want to tear it down and start fresh, with better decisions this time. That instinct is almost always wrong about the magnitude of improvement available and wrong about the cost of the restart.
What I'd actually do: run a namespace audit. Pick one cluster, list every namespace, and for each one try to answer three questions: who owns this, what's the blast radius if it misbehaves, and when was its resource quota last reviewed? The answers are usually surprising. You'll find namespaces nobody admits to owning. You'll find services with no resource limits that are quietly threatening neighbors. You'll find quota configurations that were set in 2021 and have never been updated despite the workload doubling.
Then I'd look at admission controller coverage—not to reduce it, but to map it. Draw the actual graph: which controllers fire on which resource types, in which order, with which rejection criteria. Make it readable. Print it out if you have to. The goal isn't to reduce security surface; it's to make the invisible visible so that when a deployment fails in a confusing way, there's a document someone can consult instead of an hour of kubectl describe archaeology.
Then, if the organization has more than four or five teams deploying to the same cluster, I'd have a frank conversation about whether the platform team is staffed correctly. Not whether there's a platform team—there probably is—but whether it has the capacity to actually function as a product team rather than an operations team that happens to have "platform" in the job titles. The difference is consequential. Operations teams react; product teams build, maintain, and iterate. You need both modes, but if the platform is going to absorb organizational complexity so developers don't have to, it needs staffing that reflects that ambition.
The Mirror Doesn't Lie
Kubernetes shows you your organization. The complexity in your cluster—the proliferated YAML, the duplicate pipelines, the inconsistent conventions across teams—is a faithful representation of the coordination failures, the accumulated compromises, and the organizational decisions that were made over years and that nobody went back to clean up.
That's not a criticism of Kubernetes. It might be the most useful thing about it.
The alternative is to have that complexity hiding in places you can't see: in tribal knowledge, in runbooks that aren't run, in the learned helplessness of developers who know that something works but can't explain why. At least in a cluster, you can audit it. You can put it in version control. You can write a script that checks whether your actual state matches your desired state. You cannot write that script for the implicit knowledge that lives in your longest-tenured SRE's head.
So no—don't blame Kubernetes. Build the platform team. Run the audit. Standardize the pipelines. Pick a multi-cloud strategy that reflects actual requirements rather than vendor anxiety. These are organizational problems with organizational solutions, and they were going to require that work whether or not Kubernetes ever existed.
The orchestrator is just honest with you. Whether you find that useful depends entirely on whether you're ready to hear what it's saying.
