There’s a question nobody in the room asks before an AI system goes live. It’s not about accuracy. It’s not about latency or cost. It’s not even about which model to use. It’s this: when this thing does something wrong, and it will, whose job is it to know?
I’ve sat in a lot of AI deployment reviews. The accuracy metrics are always there. The integration diagrams are always there. The rollout plan is always there. The named human who is personally accountable for detecting when the system drifts, and empowered to do something about it before the damage compounds is seldom there. That gap is not a monitoring problem. It’s not a tooling problem. It’s a decision that never got made.
"Trust" is something you develop over time, as evidence accumulates. Accountability is something you design before you need it. Most AI deployments have one and completely skipped the other.
This is the article I wish had existed when I was building observability infrastructure at Splunk and Cisco. It’s about what I learned watching capable systems fail, not because the technology broke, but because the human structure around the technology was never built.
The trust question is a distraction from the one that actually matters
For the last several years, the dominant AI governance conversation has circled the same questions: Is it accurate enough? Can we trust it? How do we know when it’s wrong? These are legitimate. They’re also, in a precise sense, beside the point. Trust is retrospective. You earn it through repeated performance under real conditions. It accretes over time, or it erodes. It’s not something you can install before a deployment, and it’s not something a governance framework produces.
Accountability is different. Accountability is a decision you make before the system runs. It answers a simpler question: when something goes wrong with this AI system, who is responsible for knowing it happened, what are they supposed to do about it, and do they have what they need to actually do that?
In my experience, most organizations have invested heavily in the trust side, eval frameworks, red teaming, accuracy benchmarks, and almost nothing in the accountability side. The result is a fleet of well-tested AI systems with no organizational immune system to catch their failures in production.
Observation: In every AI deployment post-mortem I’ve been part of, the data needed to catch the problem earlier was present. What was missing was a human whose job it was to act on it.
What ownership means, and why "the data science team" is not an answer
When I ask teams who owns their AI system in production, I get one of three answers. The most common is "the data science team." The second most common is a long pause followed by "well, it’s kind of shared." The third, occasionally, is a specific name, a person, not a team, with a defined mandate. Only the third answer means anything.
Shared accountability is a precise term for no accountability. When a model starts drifting, when outputs start degrading, when the system starts producing recommendations that operations teams are quietly overriding at scale, "the data science team" does not get paged. A person gets paged. And if there is no person, nobody gets paged, and the drift compounds.
I watched this play out in a supply chain deployment I was brought in to review. A forecasting model had been running for three quarters. It performed well at launch. Then a key supplier restructured their lead times, not dramatically, the kind of shift that registers as a footnote in a vendor review, not a headline.
The model had no mechanism to recognize what changed. The organization had no process to catch that the model’s assumptions had quietly broken. By quarter four, inventory recommendations were systematically biased. Buffer stock piling up in the wrong categories. Actual gaps accumulating elsewhere. The drift went unnoticed for nearly two months. When I asked who owned model performance monitoring, I got the answer I’ve heard in at least half a dozen similar situations:
"The data science team has a dashboard."
There was no review cadence. No escalation threshold. No named person accountable for connecting that dashboard to the operational outcomes it was supposed to be influencing. The model did exactly what it was built to do but the organizational structure around it failed completely.
"Human in the loop" is often theater. Here’s why.
"Human in the loop" has become the standard answer to every AI governance concern. It sounds reassuring. In practice, it’s frequently a work of fiction. Here is the actual situation the human is in. They’re presented with an AI output they didn’t generate, from a system they didn’t build, trained on data they haven’t reviewed, operating under conditions they may not know have changed. They’re asked, implicitly or explicitly, to evaluate it.
In most cases, they can’t. Not because they lack experience. Because the information required to genuinely evaluate the recommendation isn’t available to them at the moment of decision. So they do what humans in ambiguous, time-pressured situations reliably do: they default to the apparent authority of the system in front of them. They approve it. They bear accountability for the outcome. They were never in a position to actually exercise judgment about the input.
Putting a human in the loop and giving that human what they actually need to exercise genuine judgment are two completely different engineering decisions. Most implementations do the first and never get to the second.
I designed observability systems specifically to close this gap at Splunk and Cisco. The problem is never the absence of data. It’s always the absence of the right data, surfaced to the right person, in a form that makes judgment tractable at the moment it’s needed. That is a design problem. It has a design solution. But you have to recognize it as a design problem first, which means you have to stop treating "human in the loop" as the solution rather than the requirement.
The chaos engineering insight that changed how I think about AI ownership
In 2022, I co-invented a patent at Cisco, US 12242370, “Intent-Based Chaos Level Creation to Variably Test Environments”, that was built around a problem I now see everywhere in AI deployments.
The problem: how do you know when a complex, autonomous production system is deviating from its intended operating behavior, before the damage is already done?
The solution we developed was a “chaos level”, a calibrated, ML-driven measure of how far a system is being pushed away from its intended behavior, derived from real topology telemetry and historical failure data. You can inject failures deliberately, measure deviation from intent, and retrain based on feedback from the production environment. The key principle: you cannot manage deviation from intended behavior unless you have formally defined, in measurable terms, what intended behavior looks like.
That principle is the missing foundation in most AI governance discussions. Teams define what they want their model to produce. Almost none define, formally and in advance, what “still working correctly” looks like in operational terms, precise enough that deviation is detectable before outcomes provide the evidence.
The table below makes this mapping concrete. Every component of the chaos level system has a direct equivalent in AI production governance. The organizations that have built these equivalents before deployment are the ones whose AI investments are actually compounding.
Chaos Engineering (Patent) | What It Does | AI Governance Equivalent | What Breaks Without It |
Intent-based chaos level | Calibrates how far from intended
behavior you’re pushing the system | Operating envelope definition | System acts outside boundaries
undetected |
Topology telemetry | Reads real production state, not
assumed state | Behavioral telemetry engine | Drift invisible until outcomes
degrade |
Feedback loop retraining | Adjusts chaos scale based on
observed impact | Named owner + review cadence | No mechanism to catch model
drift in time |
Blast radius measurement | Bounds how much damage a failure
can cause | Reversibility thresholds | Irreversible actions taken
without approval |
Chaos experiment execution | Controlled failure injection
under real conditions | Canary + shadow deployment | Production incidents as the only
signal |
Intended behavior baseline | Formal definition of what
correct looks like | Pre-deployment behavior spec | No shared definition of
"still working" |
The CFO analogy: what real operational ownership looks like
I find analogies more useful than frameworks when trying to change how people think about a problem. The one that lands most consistently is financial governance.
A CFO doesn’t just have access to the financials. They have a mandate over them. A defined review cadence. Explicit criteria for what “in order” and “out of order” mean operationally, not just statistically. A clear escalation path. And critically: a specific human being who is accountable for certifying that the numbers reflect reality, not just for having access to a dashboard that displays them.
Financial governance didn’t develop because better accounting software was invented. It developed because organizations made a decision, deliberate, structural, made before the first audit, that financial integrity was important enough to assign to a human with real authority and a real mandate. That decision preceded the infrastructure. It had to.
The question for every AI deployment is not "do we have a dashboard?" It’s "is there a named human with a mandate over what that dashboard shows, and are they empowered to act on it?"
Run that checklist against any AI system your organization has deployed in the past 18 months. Most teams will find the dashboard. Almost none will find the mandate.
The three-layer ownership model that actually works in production
After seeing this failure pattern repeat across infrastructure, supply chain, and AI deployments, I’ve settled on a structure that works. It’s not a framework in the consulting sense. It’s three decisions that have to be made before deployment, not after the first incident.
Layer 1: Define the operating envelope before you define the workflow
This is the answer to "what does correct behavior look like?" expressed in operational terms. Not accuracy percentages. The systems the AI is authorized to touch. The actions it can take autonomously versus the ones that require approval. The reversibility threshold below which a human must sign off. Without this layer, you have no baseline to measure deviation against.
Layer 2: Build behavioral telemetry that surfaces signals, not just metrics
The difference between a useful monitoring system and a dashboard nobody reads is whether the signals it surfaces map to operational questions a human can act on. A confidence score dropping from 0.91 to 0.83 is a metric. That same drop, correlated with a 23% increase in the override rate from operations teams, is a signal. The telemetry has to be designed to produce the second, not just collect the first.
Layer 3: Assign a named owner with a real mandate, a review cadence, and a working kill switch
"Named" means a person, not a team. "Real mandate" means they have the authority to halt the system, not just file a ticket. "Working kill switch" means it has been tested under realistic conditions, not just documented in a runbook. These are not redundant qualifications. I’ve seen deployments fail on each of them individually.
What this looks like in code: Here’s a minimal Python implementation of the accountability wrapper that sits between your LLM orchestrator and production execution. This is the scaffolding most teams skip, the layer that enforces the operating envelope, generates the behavioral telemetry, and wires in the escalation path.
# accountability_wrapper.py — the ownership layer, condensed
# accountability_wrapper.py
@dataclass
class OperatingEnvelope:
allowed_systems: List[str] # what the agent may touch
confidence_floor: float = 0.82 # halt below this — drift signal
max_actions_per_run: int = 40 # circuit breaker
approval_threshold: float = 0.93 # irreversible actions need sign-off
class AccountabilityWrapper:
def run(self, task, context):
result = self.agent_fn(task, context)
# Layer 1 — boundary enforcement
if result['system'] not in self.envelope.allowed_systems:
self.owner_notify('[BREACH] out-of-scope system')
self.kill_switch() # tested, not just documented
# Layer 2 — confidence floor (behavioral drift signal)
if result['confidence'] < self.envelope.confidence_floor:
self.owner_notify('[DRIFT] confidence below floor — review needed')
return {'status': 'escalated'}
# Layer 3 — irreversible action gate
if not result['reversible']:
self.owner_notify('[APPROVAL REQUIRED] irreversible action')
return {'status': 'awaiting_approval'}
self.audit_log.append(result) # every action logged for owner cadence
return {'status': 'ok'}
Two things worth noting about this code. First, `owner_notify` takes a specific email address, not a team alias. That’s intentional. Shared accountability is no accountability. Second, the comment on `kill_switch` includes the last date it was actually tested. If you can’t fill in that date, you don’t have a kill switch. You have a line in a runbook.
Before you deploy: the ownership checklist
I’ve distilled the ownership questions into eight checks. If you can answer all eight before deployment, you have an accountability structure. If you can’t, you have a system that will fail silently until the outcomes force the conversation.
Ownership Check | Answer | Risk If Skipped |
Is there a named individual, not
a team, accountable for this system’s behavior in production? | Yes / No | Nobody gets paged when it drifts |
Is the operating envelope
defined: what systems it can touch, what actions require approval? | Yes / No | Unconstrained action scope |
Is there a written definition of
"still working correctly" in operational terms, agreed before
deployment? | Yes / No | No baseline to measure deviation
against |
Is there a review cadence, scheduled,
not reactive, for the named owner to check operational signals? | Yes / No | Drift only caught after outcomes
degrade |
Is the escalation path explicit:
who decides it’s a problem, who is authorized to act? | Yes / No | Inaction under uncertainty |
Has the kill switch been tested
under realistic conditions? (Not just documented.) | Yes / No | No stopping mechanism when
needed |
At the human handoff point, does
the reviewer have what they need to exercise real judgment? | Yes / No | Rubber-stamp approval loop |
Are irreversible actions gated
behind explicit approval, with blast radius limits defined? | Yes / No | Permanent, unrecoverable errors |
The decision that has to come before everything else
I want to end where I started, because I think the framing matters.
The trust question, can we trust this AI system?, is not a bad question. It’s just not the first question. You cannot answer it in advance, and you cannot answer it for a system that has no one responsible for tracking whether it’s still behaving correctly in the environment it was deployed into.
Ownership is what makes trust possible. A named person, with a defined mandate, who has the authority to act and the information to act intelligently, is the mechanism by which an AI system earns trust over time in a real production environment.
Without that person, you’re not deploying an AI system. You’re deploying a hypothesis that the system will perform as expected indefinitely, with no feedback loop to tell you when it stops.
Most AI deployments will eventually produce a moment where something goes wrong. The organizations that compound value from AI are not the ones where nothing goes wrong. They’re the ones where someone was ready when it did.
Build the ownership structure before you build the system. Define the operating envelope before you define the workflow. Name the person before you write the rollout plan.
That decision has to come first. Everything else is an implementation detail!
