Traditional CI/CD pipelines are built on a comforting lie:
If output == expected, ship it.
AI agents ruin that illusion.
LLM-powered agents don’t produce outputs — they produce distributions. Ask the same question twice and you might get:
- a different phrasing,
- a different reasoning path,
- sometimes a different (but still correct) answer.
So how do you build a green CI/CD pipeline for a system that is probabilistic by design?
This post is about how I’ve learned to test non-deterministic agents without neutering them, lying to myself, or turning CI into a flaky nightmare.
The Core Problem
Traditional tests assume:
- deterministic inputs,
- deterministic outputs,
- binary correctness.
AI agents offer:
- probabilistic outputs,
- semantic correctness,
- creativity as a feature.
If you test them the old way, you’ll end up with:
- constantly red builds,
- disabled tests,
- or worse — tests that always pass and mean nothing.
So we need new testing primitives.
Principle #1: Stop Testing Outputs. Start Testing Behavior.
Instead of asking:
“Did the agent say exactly this?”
Ask:
- Did it answer the question?
- Did it respect constraints?
- Did it avoid hallucinations?
- Did it use the tool correctly?
- Did it terminate when it should?
This mindset shift is everything.
LLM-as-a-Judge: Let AI Test AI
One of the most practical patterns I’ve used is LLM-as-a-Judge.
The idea
- Your production agent uses a smaller or cheaper model (e.g., Llama-3, GPT-4o-mini).
- Your test suite uses a stronger model (e.g., GPT-4o) to grade the output.
This sounds sketchy until you realize:
- humans already do subjective evaluation,
- CI systems just need consistent grading, not perfection.
What the judge checks
- correctness
- completeness
- policy adherence
- reasoning validity (without exposing chain-of-thought)
- hallucination risk
Example: LLM-as-a-Judge in CI
def judge_response(task, agent_output):
rubric = f"""
You are grading an AI agent.
Task:
{task}
Agent Output:
{agent_output}
Score the response from 1 to 5 on:
- Correctness
- Completeness
- Constraint adherence
- Hallucination risk
Return JSON only:
{{
"score": <int>,
"reason": "<brief explanation>"
}}
"""
judge_response = call_llm(
model="gpt-4o",
prompt=rubric
)
return parse_json(judge_response)
Then your CI assertion becomes:
result = judge_response(task, agent_output)
assert result["score"] >= 4
You’re no longer testing exact words.
You’re testingquality thresholds.
Semantic Similarity > String Equality
String equality checks are useless for LLMs.
Instead, test meaning.
What this looks like
- embedding similarity
- entailment checks
- topic coverage
- keyword presence (with flexibility)
Example: Semantic Similarity Assertion
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_match(expected, actual, threshold=0.85):
emb_expected = model.encode(expected, convert_to_tensor=True)
emb_actual = model.encode(actual, convert_to_tensor=True)
similarity = util.cos_sim(emb_expected, emb_actual)
return similarity.item() >= threshold
CI test:
assert semantic_match(
expected="User must reset password via email",
actual=agent_output
)
This lets:
- wording vary,
- reasoning differ,
- creativity exist…
…without breaking the build.
The Flaky Test Nightmare (and How to Survive It)
Non-determinism creates flaky tests by default.
If you pretend otherwise, your CI will slowly become ignored.
The real trade-off
- More creativity → more variability
- More constraints → less intelligence
You can’t eliminate this tension.
You can onlymanage it.
Patterns That Actually Work
1. Multiple Runs, Majority Vote
Run the same test 3–5 times.
scores = [judge_response(task, run()).score for _ in range(5)]
assert sum(s >= 4 for s in scores) >= 4
You’re testing stability of behavior, not single outputs.
2. Test Invariants, Not Answers
Examples of invariants:
- The agent must not invent data.
- The agent must call the tool before answering.
- The agent must stop within N steps.
- The agent must refuse unsafe requests.
These are binary, even if outputs aren’t.
3. Tiered CI Gates
Not every test needs to block deploys.
Example:
- Hard gate: policy violations, tool misuse, infinite loops
- Soft gate: quality score dips (alerts, not failures)
- Offline evals: creativity, style, tone
CI doesn’t have to be cruel to be useful.
What This Changed for Me
Before:
- I tried to make AI behave like deterministic software.
- CI felt hostile.
- Tests were either brittle or meaningless.
After:
- CI validates behavioral reliability.
- Failures are informative.
- Green builds actually mean something.
The goal isn’t perfection.
It’sconfidence under uncertainty.
Final Thought
Testing AI agents isn’t about forcing determinism.
It’s about answering one question honestly:
“Did this agent behave acceptably under uncertainty?”
If your CI/CD pipeline can answer that,
you’re already ahead of most teams.
