Kubernetes Operators, Explained by a Production Engineer

Written from the perspective of a senior engineer who has built, debugged, and battle-tested operators in production. Written from the perspective of a senior engineer who has built, debugged, and battle-tested operators in production. Written from the perspective of a senior engineer who has built, debugged, and battle-tested operators in production. Table of Contents Why Operators Exist The Conceptual Foundation: Control Theory Kubernetes API Machinery: The Backbone Custom Resource Definitions (CRDs) The Controller Runtime: Inside the Engine Informers, Listers, and the Cache The Reconciliation Loop in Depth Work Queues and Rate Limiting Watches, Events, and Predicates Ownership, Finalizers, and Garbage Collection Status Subresource and Conditions Generation vs ObservedGeneration: A Deep Dive Concurrency, MaxConcurrentReconciles, and Cache Scoping Leader Election Webhooks: Admission and Conversion Operator Patterns and Anti-Patterns Observability and Debugging Production Considerations Ready to Build Your Own Operator Why Operators Exist The Conceptual Foundation: Control Theory Kubernetes API Machinery: The Backbone Custom Resource Definitions (CRDs) The Controller Runtime: Inside the Engine Informers, Listers, and the Cache The Reconciliation Loop in Depth Work Queues and Rate Limiting Watches, Events, and Predicates Ownership, Finalizers, and Garbage Collection Status Subresource and Conditions Generation vs ObservedGeneration: A Deep Dive Concurrency, MaxConcurrentReconciles, and Cache Scoping Leader Election Webhooks: Admission and Conversion Operator Patterns and Anti-Patterns Observability and Debugging Production Considerations Ready to Build Your Own Operator Why Operators Exist Before we dive into internals, let’s get philosophical for a moment. Kubernetes gives you primitives: Pods, Deployments, Services, ConfigMaps. These are general-purpose building blocks. They’re powerful, but they’re dumb — they don’t understand your application’s operational semantics. dumb Consider a PostgreSQL cluster. A skilled DBA knows: How to perform a rolling upgrade without downtime When and how to promote a standby to primary during a failure How to orchestrate backups in a consistent way How to resize volumes without data loss How to perform a rolling upgrade without downtime When and how to promote a standby to primary during a failure How to orchestrate backups in a consistent way How to resize volumes without data loss None of this knowledge lives in native Kubernetes. An Operator is the mechanism to codify operational expertise into software that runs inside your cluster and manages resources on your behalf. Operator codify operational expertise The formal definition: An Operator is a custom controller that manages Custom Resources to automate complex, stateful application lifecycle management. custom controller Custom Resources The Conceptual Foundation: Control Theory Every operator is, at its core, an implementation of a closed-loop control system — specifically what control engineers call a feedback control loop. closed-loop control system feedback control loop The three core concepts are: Desired State — What you declare in your Custom Resource (the spec field). This is immutable intent. Desired State spec Observed State — What’s actually running in the cluster right now (the status field plus the state of managed child resources). Observed State status Reconciliation — The act of computing the delta between desired and observed state, then taking actions to close that gap. Reconciliation Controllers are implemented on top of event streams (watch events from the Kubernetes API), but their reconciliation logic is level-based, not edge-triggered. The trigger is event-driven; the behavior is not. Rather than reacting once to a specific event, the controller always asks “is the world in the state I want?” and drives toward that state regardless of how many events fired. This distinction matters enormously for resilience: if you miss an event, the next reconciliation catches it anyway. Contrast this with a purely edge-triggered system where a missed event means a missed action — permanently. implemented on top of event streams level-based, not edge-triggered Kubernetes API Machinery: The Backbone Before building or understanding operators, you need a solid mental model of how the Kubernetes API server works. Every object in Kubernetes is stored in etcd as a versioned, typed resource. The API server exposes these objects via a RESTful interface. Critically, the API server supports a Watch mechanism — clients can subscribe to a stream of events for any resource type. etcd Watch The watch stream delivers three event types: ADDED, MODIFIED, DELETED. These are the raw signals your controller eventually acts on, though — as we’ll see — the controller runtime abstracts this considerably. ADDED MODIFIED DELETED Resource Versions are central to the concurrency model. Every object has a resourceVersion field — an opaque string used for optimistic concurrency control. It is derived from etcd’s internal revision mechanism, but clients must always treat it as opaque: never parse it, compare it numerically, or make assumptions about its format. When you update an object, you must send the current resourceVersion to guarantee a compare-and-swap, preventing lost updates in concurrent environments. Resource Versions resourceVersion resourceVersion Custom Resource Definitions CRDs are how you extend the Kubernetes API. When you apply a CRD, the API server dynamically registers new API endpoints, enables storage in etcd, and starts serving your custom resources as first-class API objects. A CRD has several important structural components: apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: databases.mycompany.io spec: group: mycompany.io names: kind: Database plural: databases singular: database shortNames: ["db"] scope: Namespaced versions: - name: v1alpha1 served: true storage: true schema: openAPIV3Schema: # Structural schema for validation subresources: status: {} # Enables /status subresource scale: # Optional: enables /scale subresource specReplicasPath: .spec.replicas statusReplicasPath: .status.replicas additionalPrinterColumns: - name: Phase type: string jsonPath: .status.phase apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: databases.mycompany.io spec: group: mycompany.io names: kind: Database plural: databases singular: database shortNames: ["db"] scope: Namespaced versions: - name: v1alpha1 served: true storage: true schema: openAPIV3Schema: # Structural schema for validation subresources: status: {} # Enables /status subresource scale: # Optional: enables /scale subresource specReplicasPath: .spec.replicas statusReplicasPath: .status.replicas additionalPrinterColumns: - name: Phase type: string jsonPath: .status.phase The status subresource deserves special attention. When enabled, spec and status become separately updatable — meaning only the controller should write to status, and users should only write to spec. This enforces a clean separation of intent vs. observation. The status spec status status spec Structural Schema is mandatory since apiextensions.k8s.io/v1 (Kubernetes 1.16+). Non-structural schemas are rejected by the API server. The openAPIV3Schema field defines the shape of your resource and enables server-side validation — every field must be described. This prevents garbage data from entering your system. Structural Schema apiextensions.k8s.io/v1 openAPIV3Schema The Controller Runtime: Inside the Engine The controller-runtime library (used by both Kubebuilder and Operator SDK) provides the scaffolding that most operators are built on. Let’s dissect what it gives you. controller-runtime The Manager is the top-level orchestrator. It: The Manager Manages a shared cache (backed by informers) for all resource types your controllers care about Provides a client that reads from the local cache and writes directly to the API server Runs all controllers in goroutines Handles leader election Exposes health check and metrics endpoints Manages a shared cache (backed by informers) for all resource types your controllers care about cache Provides a client that reads from the local cache and writes directly to the API server client Runs all controllers in goroutines Handles leader election Exposes health check and metrics endpoints The Cache is the performance secret. Rather than every reconciliation hitting the API server, reads go to a local in-memory store that is kept in sync via informers. This reduces API server load dramatically and makes your operator fast. The Cache The Client has two personalities: The Client Reader (cache-backed): Fast, eventually consistent. Used for Get and List operations during reconciliation. If you need strong consistency at a specific checkpoint, you can bypass the cache by constructing an uncached client — but do so sparingly, as it adds latency and API server load. Writer (direct to API): Used for Create, Update, Patch, Delete, and Status().Update(). These always go directly to the API server, never through the cache. Reader (cache-backed): Fast, eventually consistent. Used for Get and List operations during reconciliation. If you need strong consistency at a specific checkpoint, you can bypass the cache by constructing an uncached client — but do so sparingly, as it adds latency and API server load. Reader Get List Writer (direct to API): Used for Create, Update, Patch, Delete, and Status().Update(). These always go directly to the API server, never through the cache. Writer Create Update Patch Delete Status().Update() Informers, Listers, and the Cache This is where things get really interesting from an internals perspective. The Informer is the heart of the watch machinery. Informer The Reflector does the heavy lifting: it first performs a List to establish the initial state, then starts a long-lived Watch. If the watch connection drops (network blip, API server restart), the reflector automatically reconnects and re-lists if necessary. Reflector List Watch The DeltaFIFO queue is a clever data structure that deduplicates events for the same object. If an object is modified 10 times before the controller gets around to processing it, they’re collapsed. This is the first layer of the “level-triggered” behavior. DeltaFIFO The Local Cache (a thread-safe store with indexes) is what client.Get and client.List read from. It’s always slightly behind the API server (eventual consistency), but that’s acceptable because your reconciler should be idempotent anyway. Local Cache client.Get client.List Listers are typed wrappers over the cache that let you query by namespace or label selector without hitting the network. Listers The Reconciliation Loop in Depth Here’s the full picture of what happens from a watch event to a completed reconciliation: A few nuances that trip people up: The key is a namespace/name pair, not an object. When your reconciler is called, you only get the namespace and name. You must re-fetch the current state of the object from the cache. Never trust stale data passed in — always re-read at the top of your reconcile function. The key is a namespace/name pair, not an object. Reconcile should be idempotent. It will be called multiple times for the same state. If you create a resource, check if it already exists first. If you apply a configuration, make it declarative. A reconcile that is accidentally destructive when called twice is a ticking time bomb. Reconcile should be idempotent. Errors vs. Requeue. Returning an error causes the item to be requeued with exponential backoff (respecting the rate limiter). Returning ctrl.Result{Requeue: true} or ctrl.Result{RequeueAfter: duration} requeues without registering an error (no backoff increment). Use the former for actual errors, the latter for polling scenarios. Errors vs. Requeue. error ctrl.Result{Requeue: true} ctrl.Result{RequeueAfter: duration} Work Queues and Rate Limiting The work queue deserves its own section because it’s where many operator performance issues originate. The work queue has a built-in deduplication guarantee: if the same namespace/name is already in the queue, adding it again is a no-op. This means a burst of 100 events for the same object results in exactly one reconciliation. deduplication The Processing Set ensures that while an item is being reconciled, any new events for that same item are queued but not dispatched until the current reconciliation completes. This prevents concurrent reconciliations for the same object. Processing Set Rate limiters in controller-runtime compose two strategies: Rate limiters The ItemExponentialFailureRateLimiter tracks per-item failure counts and applies backoff: base * 2^failures up to a maximum. This prevents a persistently failing object from hammering the API server. ItemExponentialFailureRateLimiter base * 2^failures The BucketRateLimiter is a global token bucket that caps overall reconciliation throughput. This protects the API server from a thundering herd when many objects need reconciliation simultaneously (e.g., after an operator restart). BucketRateLimiter The default controller-runtime rate limiter combines per-item exponential backoff (base ~5ms, max ~1000s) with a global token bucket (~10 QPS, burst ~100). These defaults can vary across controller-runtime versions and are not guaranteed API contracts — always verify against your version’s source. In high-scale environments, you’ll almost certainly want to tune them. Watches, Events, and Predicates A controller needs to know which objects to watch. The .Watches() builder in controller-runtime lets you express complex watch topologies. .Watches() EnqueueRequestForOwner is the most common pattern: when a child resource changes (e.g., a Pod owned by your operator’s StatefulSet), find the owner reference chain and enqueue the root owner. This lets the parent controller react to child state changes. EnqueueRequestForOwner EnqueueMappedRequest (formerly EnqueueRequestsFromMapFunc) is a powerful escape hatch. Given any object event, you provide a function that maps it to zero or more reconcile requests. Use this for non-ownership relationships — e.g., when a shared Secret changes, requeue all operators that reference it. EnqueueMappedRequest EnqueueRequestsFromMapFunc Predicates filter events before they hit the queue. This is a critical optimization that’s often overlooked: Predicates // Only reconcile when spec changes, not on every status update ctrl.NewControllerManagedBy(mgr). For(&myv1.Database{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})). Complete(r) // Only reconcile when spec changes, not on every status update ctrl.NewControllerManagedBy(mgr). For(&myv1.Database{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})). Complete(r) GenerationChangedPredicate is particularly valuable — it only triggers reconciliation when metadata.generation increments (which only happens on spec changes), ignoring pure status updates. Without this, every status write your controller does triggers another reconciliation, creating a tight loop. GenerationChangedPredicate metadata.generation Ownership, Finalizers, and Garbage Collection This triad is where operator bugs tend to cluster. Let’s be precise. Owner References establish the parent-child relationship for garbage collection: Owner References Finalizer deletion flow — what happens step by step when a user deletes an object with a finalizer: Finalizer deletion flow Owner references tell the Kubernetes garbage collector that child objects should be deleted when the parent is deleted. Always set owner references on resources you create — without them, orphaned resources accumulate in the cluster. ctrl.SetControllerReference(database, statefulSet, r.Scheme) ctrl.SetControllerReference(database, statefulSet, r.Scheme) This sets the child’s metadata.ownerReferences to point to the parent, with controller: true and blockOwnerDeletion: true. metadata.ownerReferences controller: true blockOwnerDeletion: true Finalizers are strings in metadata.finalizers that prevent an object from being deleted until all finalizers are removed. When a user deletes an object with finalizers, Kubernetes sets metadata.deletionTimestamp but doesn’t remove the object. Your controller must detect this, do cleanup work, remove its finalizer, and then update the object — at which point Kubernetes deletes it. Finalizers metadata.finalizers metadata.deletionTimestamp Common finalizer pattern: const myFinalizer = "mycompany.io/database-finalizer" func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { db := &myv1.Database{} if err := r.Get(ctx, req.NamespacedName, db); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } if !db.DeletionTimestamp.IsZero() { // Object is being deleted if controllerutil.ContainsFinalizer(db, myFinalizer) { if err := r.runCleanup(ctx, db); err != nil { return ctrl.Result{}, err } controllerutil.RemoveFinalizer(db, myFinalizer) return ctrl.Result{}, r.Update(ctx, db) } return ctrl.Result{}, nil } // Add finalizer if not present if !controllerutil.ContainsFinalizer(db, myFinalizer) { controllerutil.AddFinalizer(db, myFinalizer) return ctrl.Result{}, r.Update(ctx, db) } // Normal reconciliation... } const myFinalizer = "mycompany.io/database-finalizer" func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { db := &myv1.Database{} if err := r.Get(ctx, req.NamespacedName, db); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } if !db.DeletionTimestamp.IsZero() { // Object is being deleted if controllerutil.ContainsFinalizer(db, myFinalizer) { if err := r.runCleanup(ctx, db); err != nil { return ctrl.Result{}, err } controllerutil.RemoveFinalizer(db, myFinalizer) return ctrl.Result{}, r.Update(ctx, db) } return ctrl.Result{}, nil } // Add finalizer if not present if !controllerutil.ContainsFinalizer(db, myFinalizer) { controllerutil.AddFinalizer(db, myFinalizer) return ctrl.Result{}, r.Update(ctx, db) } // Normal reconciliation... } A critical warning: Finalizer logic must be robust and eventually complete. A finalizer that never removes itself will prevent the object from being garbage collected forever. Always provide a way to force-remove the finalizer in operational runbooks. A critical warning Status Subresource and Conditions Your operator’s primary communication channel with users (and other systems) is the status field. Get this right. status Always use the Conditions pattern for status. It’s the Kubernetes-idiomatic way to communicate multi-dimensional state. The example below uses condition types modeled after the common Kubernetes Deployment pattern — adapt the types to your domain: Conditions status: phase: Running observedGeneration: 5 # which spec generation this status reflects conditions: - type: Ready status: "True" lastTransitionTime: "2024-01-15T10:00:00Z" reason: AllReplicasReady message: "3/3 replicas are ready" - type: Progressing status: "False" lastTransitionTime: "2024-01-15T10:01:00Z" reason: ReplicaSetAvailable message: "Rollout complete" - type: Available status: "True" lastTransitionTime: "2024-01-14T08:00:00Z" reason: MinimumReplicasAvailable message: "Deployment has minimum availability" status: phase: Running observedGeneration: 5 # which spec generation this status reflects conditions: - type: Ready status: "True" lastTransitionTime: "2024-01-15T10:00:00Z" reason: AllReplicasReady message: "3/3 replicas are ready" - type: Progressing status: "False" lastTransitionTime: "2024-01-15T10:01:00Z" reason: ReplicaSetAvailable message: "Rollout complete" - type: Available status: "True" lastTransitionTime: "2024-01-14T08:00:00Z" reason: MinimumReplicasAvailable message: "Deployment has minimum availability" observedGeneration is critical and frequently missed. It tells observers which version of the spec this status corresponds to. Without it, you can’t tell if status.phase: Running means “running the spec you just applied” or “running an older spec while the new one is being processed.” observedGeneration status.phase: Running Always update status with r.Status().Update(ctx, obj) not r.Update(ctx, obj). The status subresource has a separate endpoint and a separate RBAC policy. The main update endpoint ignores status changes; the status endpoint ignores spec changes. r.Status().Update(ctx, obj) r.Update(ctx, obj) Generation vs ObservedGeneration: A Deep Dive This is one of the most misunderstood mechanics in operator development, yet it’s fundamental to building correct status reporting. Let’s be precise. metadata.generation is a monotonically incrementing integer managed entirely by the API server. It increments only when the spec changes — status updates, label changes, and annotation changes do not increment it. This is why GenerationChangedPredicate works: it filters out the noise. metadata.generation only when the spec changes GenerationChangedPredicate status.observedGeneration is a field your controller writes to status after completing a reconciliation. It should be set to the metadata.generation value of the object you just reconciled. status.observedGeneration status metadata.generation The pattern lets any observer — including kubectl wait, GitOps controllers, and your own tooling — determine whether the controller has finished processing the latest spec without any out-of-band signaling: kubectl wait func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { db := &myv1.Database{} if err := r.Get(ctx, req.NamespacedName, db); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // ... reconcile logic ... // At the end: stamp observedGeneration db.Status.ObservedGeneration = db.Generation db.Status.Phase = "Running" return ctrl.Result{}, r.Status().Update(ctx, db) } func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { db := &myv1.Database{} if err := r.Get(ctx, req.NamespacedName, db); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // ... reconcile logic ... // At the end: stamp observedGeneration db.Status.ObservedGeneration = db.Generation db.Status.Phase = "Running" return ctrl.Result{}, r.Status().Update(ctx, db) } Without observedGeneration, a status.phase: Running is ambiguous — it could mean “running the spec you just applied 30 seconds ago” or “running an old spec that’s three versions behind.” With it, observers have a precise, reliable signal. observedGeneration status.phase: Running Concurrency, MaxConcurrentReconciles, and Cache Scoping MaxConcurrentReconciles By default, controller-runtime runs one reconciler goroutine per controller. For many operators this is fine, but for operators managing hundreds or thousands of independent custom resources, this is a significant throughput bottleneck. Enter MaxConcurrentReconciles: one reconciler goroutine per controller MaxConcurrentReconciles ctrl.NewControllerManagedBy(mgr). For(&myv1.Database{}). WithOptions(controller.Options{ MaxConcurrentReconciles: 10, }). Complete(r) ctrl.NewControllerManagedBy(mgr). For(&myv1.Database{}). WithOptions(controller.Options{ MaxConcurrentReconciles: 10, }). Complete(r) This allows up to 10 reconciler goroutines to run in parallel for different objects. A few important points: The work queue guarantees per-object serialization. Even with MaxConcurrentReconciles: 10, the same namespace/name key will never be dispatched to two goroutines simultaneously. You get concurrency across different objects, not within a single object’s reconciliation chain. The work queue guarantees per-object serialization. MaxConcurrentReconciles: 10 namespace/name Your reconciler must be goroutine-safe. Any shared state (metrics counters, caches, client connections) must be safe for concurrent access. The controller-runtime client is safe. Custom state you add to the reconciler struct is your responsibility. Your reconciler must be goroutine-safe. Rate limiting still applies globally. High MaxConcurrentReconciles combined with a tight rate limiter creates goroutines waiting on the rate limiter. Tune both together. Rate limiting still applies globally. MaxConcurrentReconciles A good starting heuristic: set MaxConcurrentReconciles to roughly the number of objects you expect divided by the average reconcile latency in seconds. For 1000 objects reconciling in ~500ms each, MaxConcurrentReconciles: 5 gives you comfortable throughput headroom. MaxConcurrentReconciles MaxConcurrentReconciles: 5 Cache Scoping for Large Clusters By default, the controller-runtime cache watches all namespaces. In large multi-tenant clusters this can mean caching thousands of objects your operator doesn’t care about. Cache scoping is the solution: mgr, err := ctrl.NewManager(cfg, ctrl.Options{ Cache: cache.Options{ // Only cache objects in specific namespaces DefaultNamespaces: map[string]cache.Config{ "tenant-a": {}, "tenant-b": {}, }, }, }) mgr, err := ctrl.NewManager(cfg, ctrl.Options{ Cache: cache.Options{ // Only cache objects in specific namespaces DefaultNamespaces: map[string]cache.Config{ "tenant-a": {}, "tenant-b": {}, }, }, }) Field indexing is another powerful tool. If your reconciler frequently lists objects filtered by a custom field, add an index to the cache: Field indexing // Index Databases by their referenced Secret name if err := mgr.GetFieldIndexer().IndexField( ctx, &myv1.Database{}, ".spec.credentialsSecret", func(obj client.Object) []string { db := obj.(*myv1.Database) return []string{db.Spec.CredentialsSecret} }, ); err != nil { return err } // Now you can efficiently list all DBs referencing a secret dbList := &myv1.DatabaseList{} r.List(ctx, dbList, client.MatchingFields{".spec.credentialsSecret": secretName}) // Index Databases by their referenced Secret name if err := mgr.GetFieldIndexer().IndexField( ctx, &myv1.Database{}, ".spec.credentialsSecret", func(obj client.Object) []string { db := obj.(*myv1.Database) return []string{db.Spec.CredentialsSecret} }, ); err != nil { return err } // Now you can efficiently list all DBs referencing a secret dbList := &myv1.DatabaseList{} r.List(ctx, dbList, client.MatchingFields{".spec.credentialsSecret": secretName}) Without an index, this List does a full cache scan. With it, it’s an O(1) lookup. At scale, this is the difference between a 1ms and 200ms reconciliation. List Optimistic Locking and Conflict Retries API server conflicts (409 Conflict) are a normal part of operating at scale. When your reconciler reads an object, modifies it, and writes it back — and something else has modified it in between — you get a conflict. The correct response is to re-read and retry: 409 Conflict import "k8s.io/client-go/util/retry" err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // Re-fetch to get the latest resourceVersion if err := r.Get(ctx, req.NamespacedName, db); err != nil { return err } // Apply your changes to the freshly-fetched object db.Status.Phase = computedPhase return r.Status().Update(ctx, db) }) import "k8s.io/client-go/util/retry" err := retry.RetryOnConflict(retry.DefaultRetry, func() error { // Re-fetch to get the latest resourceVersion if err := r.Get(ctx, req.NamespacedName, db); err != nil { return err } // Apply your changes to the freshly-fetched object db.Status.Phase = computedPhase return r.Status().Update(ctx, db) }) retry.DefaultRetry uses exponential backoff (5 retries, 10ms base, 1.0 jitter). For status updates this is usually sufficient. For spec updates, prefer server-side apply which handles conflicts at the field ownership level rather than requiring a full re-read/retry. retry.DefaultRetry Leader Election In production, you run multiple replicas of your operator for high availability. But you don’t want multiple replicas simultaneously reconciling the same objects — that leads to conflicts and thrashing. Leader election solves this. Controller-runtime uses a Lease object in the cluster as the distributed lock. The leader holds the lease by periodically renewing it. If the leader fails to renew before the lease expires, another replica acquires it. Lease Configuration in controller-runtime: mgr, err := ctrl.NewManager(cfg, ctrl.Options{ LeaderElection: true, LeaderElectionID: "my-operator-leader", LeaderElectionNamespace: "my-operator-system", LeaseDuration: &leaseDuration, // default 15s RenewDeadline: &renewDeadline, // default 10s RetryPeriod: &retryPeriod, // default 2s }) mgr, err := ctrl.NewManager(cfg, ctrl.Options{ LeaderElection: true, LeaderElectionID: "my-operator-leader", LeaderElectionNamespace: "my-operator-system", LeaseDuration: &leaseDuration, // default 15s RenewDeadline: &renewDeadline, // default 10s RetryPeriod: &retryPeriod, // default 2s }) Standby replicas still run the cache — they maintain informers and local caches, but they don’t start the controllers. This means failover is fast (no cold start for the informer sync) because the new leader already has a warm cache. Standby replicas still run the cache Important nuance: Leader election reduces the likelihood of concurrent reconciliations, but it does not eliminate it entirely. During the lease expiry window, a brief overlap is possible where both the old and new leader are active. Controllers must still be written to tolerate conflicts and retries. Never assume strict single-threaded execution at the cluster level — your reconciler must be safe to run concurrently. Important nuance reduces Caution: Leader election adds latency to recovery. With LeaseDuration=15s, a leader failure can cause up to 15 seconds of no-reconciliation. Tune this based on your operator’s latency requirements. Caution LeaseDuration=15s Webhooks: Admission and Conversion Webhooks are the mechanism to inject logic into the API server’s request pipeline. Defaulting Webhooks (MutatingAdmissionWebhook) run before storage and let you inject default field values. This is essential for forward compatibility — when you add a new required field to v2 of your CRD, a defaulting webhook can populate it for resources created without it. Defaulting Webhooks (MutatingAdmissionWebhook) Validating Webhooks (ValidatingAdmissionWebhook) run after mutation and let you reject invalid requests with human-readable error messages. This is where you enforce complex business rules that can’t be expressed in OpenAPI schema (cross-field validation, external system checks, etc.). Validating Webhooks (ValidatingAdmissionWebhook) Conversion Webhooks are needed when you have multiple active API versions of a CRD. The API server stores objects in one version (the storage: true version) but can serve them in other versions. Conversion webhooks handle the transformation between versions. Conversion Webhooks storage: true // controller-runtime webhook setup func (r *Database) Default() { if r.Spec.Replicas == nil { defaultReplicas := int32(1) r.Spec.Replicas = &defaultReplicas } } func (r *Database) ValidateCreate() (admission.Warnings, error) { if r.Spec.StorageSize.Cmp(minStorage) < 0 { return nil, fmt.Errorf("storage size must be at least %s", minStorage.String()) } return nil, nil } // controller-runtime webhook setup func (r *Database) Default() { if r.Spec.Replicas == nil { defaultReplicas := int32(1) r.Spec.Replicas = &defaultReplicas } } func (r *Database) ValidateCreate() (admission.Warnings, error) { if r.Spec.StorageSize.Cmp(minStorage) < 0 { return nil, fmt.Errorf("storage size must be at least %s", minStorage.String()) } return nil, nil } Webhooks require TLS certificates and must be running before the API server can call them. Certificate management is operationally annoying — use cert-manager or controller-runtime’s built-in certificate provisioner. Operator Patterns and Anti-Patterns After years of writing and reviewing operators, here’s the distilled wisdom: Patterns to Follow Adopt Whenever Possible: Use server-side apply (client.Apply) instead of create-or-update. It’s declarative, handles field ownership correctly, and is idempotent by design. One critical caveat: if you adopt SSA, use it consistently for all managed resources. Mixing Update and Apply on the same fields causes managedFields ownership conflicts that are painful to debug and resolve. Adopt Whenever Possible client.Apply consistently Update Apply managedFields // Instead of create-or-update dance: patch := client.Apply obj.ManagedFields = nil // Let SSA manage this err = r.Patch(ctx, obj, patch, client.ForceOwnership, client.FieldOwner("my-operator")) // Instead of create-or-update dance: patch := client.Apply obj.ManagedFields = nil // Let SSA manage this err = r.Patch(ctx, obj, patch, client.ForceOwnership, client.FieldOwner("my-operator")) Use Patch over Update: Always prefer Patch (specifically strategic merge patch or JSON patch) over Update for status and spec changes. Update replaces the entire object and is prone to conflicts; Patch is surgical and conflict-resistant. Use Patch over Update Patch Update Update Patch Emit Events: Use the Event recorder to emit Kubernetes events for significant state transitions. This gives users visibility via kubectl describe: Emit Events kubectl describe r.Recorder.Event(db, corev1.EventTypeWarning, "ProvisioningFailed", "Failed to create PVC") r.Recorder.Event(db, corev1.EventTypeWarning, "ProvisioningFailed", "Failed to create PVC") Separate controllers for separate concerns: Don’t build a monolithic reconciler. If your operator manages both the database cluster and its backup schedule, use two controllers with a shared cache. Separate controllers for separate concerns Anti-Patterns to Avoid Don’t store state in the controller process. Your controller can be restarted, scaled, or fail over at any moment. The only source of truth is the Kubernetes API. If you need to persist computed state, put it in status or in a ConfigMap. Don’t store state in the controller process. status Don’t busy-loop with short requeue intervals. In most cases, sub-10-second polling intervals are unnecessary and wasteful. Prefer watch-based triggers unless the external system cannot emit events. For fast-moving, short-lived state machines (e.g., managing transient Jobs), shorter intervals may be valid — but they should be the exception, not the default. If you truly need polling, make the interval configurable so it can be tuned per deployment. Don’t busy-loop with short requeue intervals. Don’t ignore resourceVersion conflicts. A 409 Conflict from the API server means someone else updated the object between your read and write. The correct response is to re-fetch and retry, not to log and continue. Don’t ignore resourceVersion 409 Conflict Don’t call the API server inside tight loops. Fetching all pods to check readiness in a loop that runs every reconciliation is expensive. Use the cache, or precompute what you need at the start of reconciliation. Don’t call the API server inside tight loops. Don’t use Update when Patch will do. Using r.Update(ctx, obj) after modifying the spec will overwrite any changes made between your read and your write. Prefer patch operations. Don’t use Update Patch r.Update(ctx, obj) Observability and Debugging An operator you can’t observe is an operator you can’t trust in production. Metrics Controller-runtime exports Prometheus metrics out of the box: # Work queue depth — a leading indicator of reconciliation backlog workqueue_depth{name="database"} 42 # Reconcile duration histogram — p99 tells you about slow reconciliations controller_runtime_reconcile_time_seconds_bucket{controller="database", le="0.1"} 1000 # Reconcile errors — should be near zero in steady state controller_runtime_reconcile_errors_total{controller="database"} 5 # Active goroutines in the work queue workqueue_work_duration_seconds_bucket{name="database"} # Work queue depth — a leading indicator of reconciliation backlog workqueue_depth{name="database"} 42 # Reconcile duration histogram — p99 tells you about slow reconciliations controller_runtime_reconcile_time_seconds_bucket{controller="database", le="0.1"} 1000 # Reconcile errors — should be near zero in steady state controller_runtime_reconcile_errors_total{controller="database"} 5 # Active goroutines in the work queue workqueue_work_duration_seconds_bucket{name="database"} Always add custom metrics for your domain: var databasesProvisioning = prometheus.NewGauge(prometheus.GaugeOpts{ Name: "myoperator_databases_provisioning", Help: "Number of databases currently in provisioning state", }) var databasesProvisioning = prometheus.NewGauge(prometheus.GaugeOpts{ Name: "myoperator_databases_provisioning", Help: "Number of databases currently in provisioning state", }) Structured Logging Use structured logging (logr interface) with consistent fields: log := log.FromContext(ctx).WithValues( "database", req.NamespacedName, "generation", db.Generation, "phase", db.Status.Phase, ) log.Info("Starting reconciliation") log := log.FromContext(ctx).WithValues( "database", req.NamespacedName, "generation", db.Generation, "phase", db.Status.Phase, ) log.Info("Starting reconciliation") Tracing For complex operators with many API calls, distributed tracing (OpenTelemetry) provides invaluable insight into where time is spent during reconciliation. Common Debugging Commands # Watch reconciler output in real time kubectl logs -n operator-system deploy/my-operator -f | jq '.' # Inspect the CRD resource including status kubectl get database mydb -o yaml # Check events for a custom resource kubectl describe database mydb # Force a reconcile by touching the annotation kubectl annotate database mydb force-reconcile=$(date +%s) --overwrite # Check lease for leader election kubectl get lease -n operator-system # Watch reconciler output in real time kubectl logs -n operator-system deploy/my-operator -f | jq '.' # Inspect the CRD resource including status kubectl get database mydb -o yaml # Check events for a custom resource kubectl describe database mydb # Force a reconcile by touching the annotation kubectl annotate database mydb force-reconcile=$(date +%s) --overwrite # Check lease for leader election kubectl get lease -n operator-system Production Considerations Resource Management Always set resource requests and limits on your operator pod. An operator without limits can starve other workloads during a reconciliation storm. RBAC Least Privilege Your operator’s ServiceAccount should only have the permissions it actually needs. A common mistake is granting cluster-admin for convenience. Use the Kubebuilder RBAC markers to generate precise RBAC manifests: cluster-admin //+kubebuilder:rbac:groups=mycompany.io,resources=databases,verbs=get;list;watch;create;update;patch;delete //+kubebuilder:rbac:groups=mycompany.io,resources=databases/status,verbs=get;update;patch //+kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete //+kubebuilder:rbac:groups=mycompany.io,resources=databases,verbs=get;list;watch;create;update;patch;delete //+kubebuilder:rbac:groups=mycompany.io,resources=databases/status,verbs=get;update;patch //+kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete Graceful Shutdown Handle SIGTERM gracefully. The controller-runtime manager’s Start function blocks until context cancellation, at which point it stops all controllers and waits for in-flight reconciliations to complete (up to a timeout). Make sure your reconciler respects context cancellation: SIGTERM Start func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // Check context at expensive checkpoints select { case <-ctx.Done(): return ctrl.Result{}, ctx.Err() default: } // ... reconcile logic } func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // Check context at expensive checkpoints select { case <-ctx.Done(): return ctrl.Result{}, ctx.Err() default: } // ... reconcile logic } Testing Strategy Use envtest (from controller-runtime) for integration tests. It spins up a real etcd and API server, installs your CRDs, and lets you test full reconciliation loops without a cluster. This is your most valuable testing layer. envtest Upgrade Considerations When upgrading your operator, consider: CRD schema changes: Adding fields is safe. Removing or renaming fields is breaking. Use conversion webhooks for major schema evolution. Controller logic changes: New reconciler behavior applied to existing resources — think through the transition. Add a migration annotation or one-time migration job if needed. State machine transitions: If you’re adding new phases to your state machine, ensure existing resources in “old” phases are handled by the updated controller. CRD schema changes: Adding fields is safe. Removing or renaming fields is breaking. Use conversion webhooks for major schema evolution. CRD schema changes Controller logic changes: New reconciler behavior applied to existing resources — think through the transition. Add a migration annotation or one-time migration job if needed. Controller logic changes State machine transitions: If you’re adding new phases to your state machine, ensure existing resources in “old” phases are handled by the updated controller. State machine transitions Conclusion Kubernetes Operators are one of the most powerful extension mechanisms ever built into a distributed system platform. But that power comes with complexity. The controller runtime, informers, work queues, rate limiters, finalizers, and webhooks form a sophisticated machinery that, once understood, enables you to build remarkably robust automation. The key mental models to internalize: Level-triggered reconciliation — always reconcile toward desired state, don’t just react to events. This gives you resilience for free. Level-triggered reconciliation The cache is your friend — reads from cache, writes to API. This is the performance contract the entire system is designed around. The cache is your friend Idempotency is not optional — your reconciler will be called many times for the same state. Design it accordingly from day one. Idempotency is not optional Status is a contract — observedGeneration, conditions with reasons and messages, precise phase transitions. This is how your operator communicates with the world. Status is a contract observedGeneration The operators you build are, in a very real sense, pieces of software that will run 24/7, autonomously managing production infrastructure. Treat them with the same rigor you’d apply to any production-critical system: test thoroughly, observe everything, and design for failure. Ready to Build Your Own Operator? If you want to go from zero to production-ready Kubernetes operators with hands-on practice, check out the Kubernetes Operators Course — a practical, end-to-end course that walks you through building operators from the basics all the way to production-grade patterns. It’s a great companion to the internals covered in this post. Kubernetes Operators Course Kubernetes Operators Course Found a bug or inaccuracy? The beauty of operators — and this blog post — is that there’s always room for a reconciliation loop. Found a bug or inaccuracy? The beauty of operators — and this blog post — is that there’s always room for a reconciliation loop. NotebookLM Link NotebookLM Link