Building CI/CD Pipelines for Non-Deterministic Agents

Written by nikitakothari | Published 2026/02/17
Tech Story Tags: ai-agent-testing | cicd | cicd-for-llms | llm-as-a-judge | semantic-similarity-testing | probabilistic-software | ai-reliability-engineering | testing-large-language-models

TLDRTraditional CI/CD breaks for probabilistic systems. Use LLM-as-a-Judge to evaluate agent outputs. Replace string equality with semantic assertions. Expect flakiness — manage it with multiple runs and invariants. Test behavior, not exact answers.via the TL;DR App

Traditional CI/CD pipelines are built on a comforting lie:

If output == expected, ship it.

AI agents ruin that illusion.

LLM-powered agents don’t produce outputs — they produce distributions. Ask the same question twice and you might get:

  • a different phrasing,
  • a different reasoning path,
  • sometimes a different (but still correct) answer.

So how do you build a green CI/CD pipeline for a system that is probabilistic by design?

This post is about how I’ve learned to test non-deterministic agents without neutering them, lying to myself, or turning CI into a flaky nightmare.


The Core Problem

Traditional tests assume:

  • deterministic inputs,
  • deterministic outputs,
  • binary correctness.

AI agents offer:

  • probabilistic outputs,
  • semantic correctness,
  • creativity as a feature.

If you test them the old way, you’ll end up with:

  • constantly red builds,
  • disabled tests,
  • or worse — tests that always pass and mean nothing.

So we need new testing primitives.


Principle #1: Stop Testing Outputs. Start Testing Behavior.

Instead of asking:

“Did the agent say exactly this?”

Ask:

  • Did it answer the question?
  • Did it respect constraints?
  • Did it avoid hallucinations?
  • Did it use the tool correctly?
  • Did it terminate when it should?

This mindset shift is everything.


LLM-as-a-Judge: Let AI Test AI

One of the most practical patterns I’ve used is LLM-as-a-Judge.

The idea

  • Your production agent uses a smaller or cheaper model (e.g., Llama-3, GPT-4o-mini).
  • Your test suite uses a stronger model (e.g., GPT-4o) to grade the output.

This sounds sketchy until you realize:

  • humans already do subjective evaluation,
  • CI systems just need consistent grading, not perfection.

What the judge checks

  • correctness
  • completeness
  • policy adherence
  • reasoning validity (without exposing chain-of-thought)
  • hallucination risk

Example: LLM-as-a-Judge in CI

def judge_response(task, agent_output):
    rubric = f"""
    You are grading an AI agent.

    Task:
    {task}

    Agent Output:
    {agent_output}

    Score the response from 1 to 5 on:
    - Correctness
    - Completeness
    - Constraint adherence
    - Hallucination risk

    Return JSON only:
    {{
      "score": <int>,
      "reason": "<brief explanation>"
    }}
    """

    judge_response = call_llm(
        model="gpt-4o",
        prompt=rubric
    )

    return parse_json(judge_response)

Then your CI assertion becomes:

result = judge_response(task, agent_output)
assert result["score"] >= 4

You’re no longer testing exact words.
You’re testingquality thresholds.


Semantic Similarity > String Equality

String equality checks are useless for LLMs.

Instead, test meaning.

What this looks like

  • embedding similarity
  • entailment checks
  • topic coverage
  • keyword presence (with flexibility)

Example: Semantic Similarity Assertion

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_match(expected, actual, threshold=0.85):
    emb_expected = model.encode(expected, convert_to_tensor=True)
    emb_actual = model.encode(actual, convert_to_tensor=True)
    similarity = util.cos_sim(emb_expected, emb_actual)
    return similarity.item() >= threshold

CI test:

assert semantic_match(
    expected="User must reset password via email",
    actual=agent_output
)

This lets:

  • wording vary,
  • reasoning differ,
  • creativity exist…

…without breaking the build.


The Flaky Test Nightmare (and How to Survive It)

Non-determinism creates flaky tests by default.

If you pretend otherwise, your CI will slowly become ignored.

The real trade-off

  • More creativity → more variability
  • More constraints → less intelligence

You can’t eliminate this tension.
You can onlymanage it.


Patterns That Actually Work

1. Multiple Runs, Majority Vote

Run the same test 3–5 times.

scores = [judge_response(task, run()).score for _ in range(5)]
assert sum(s >= 4 for s in scores) >= 4

You’re testing stability of behavior, not single outputs.


2. Test Invariants, Not Answers

Examples of invariants:

  • The agent must not invent data.
  • The agent must call the tool before answering.
  • The agent must stop within N steps.
  • The agent must refuse unsafe requests.

These are binary, even if outputs aren’t.


3. Tiered CI Gates

Not every test needs to block deploys.

Example:

  • Hard gate: policy violations, tool misuse, infinite loops
  • Soft gate: quality score dips (alerts, not failures)
  • Offline evals: creativity, style, tone

CI doesn’t have to be cruel to be useful.


What This Changed for Me

Before:

  • I tried to make AI behave like deterministic software.
  • CI felt hostile.
  • Tests were either brittle or meaningless.

After:

  • CI validates behavioral reliability.
  • Failures are informative.
  • Green builds actually mean something.

The goal isn’t perfection.
It’sconfidence under uncertainty.


Final Thought

Testing AI agents isn’t about forcing determinism.

It’s about answering one question honestly:

“Did this agent behave acceptably under uncertainty?”

If your CI/CD pipeline can answer that,
you’re already ahead of most teams.


Written by nikitakothari | I am a Senior Member of Technical Staff at Salesforce, where I build AI-driven enterprise solutions that integrate LLM.
Published by HackerNoon on 2026/02/17