The Builder's Dilemma: Balancing Creativity and Consistency in Agentic Workflows

Why Your LLM Gives Different Answers Every Time

Last month, I was debugging a problem with our invoice processing system, which was returning different amounts for the same PDF file on consecutive calls. The root cause? I had failed to adjust the temperature parameter from its default value. What I learned is that temperature and seed are not just abstract terms. In shipping LLM-based features, they are essential parameters. Let me walk you through what I have learned so far.

Temperature: Your Creativity Dial

Temperature is a slider that ranges from 'boring but reliable' to 'creative but unpredictable.' It is a number (0-2) that determines how adventurous the model will be in selecting its next word.

The Mental Model

For every token an LLM outputs, it's like rolling weighted dice and selecting the outcome from thousands of possibilities. The temperature affects the weights:

Temperature = 0.0 (The Accountant)

Always chooses the statistically most likely outcome

- Input = output = input (predictable and deterministic)

- I use this for: parsing documents, extracting data, crafting SQL queries

Real example: Our compliance report generator uses temp = 0.0 because regulators hate "creative" interpretations

Temperature = 0.7 (The Consultant)

- Trades reliability for some randomness in output

- Output appears human-written without being unrecognizable

- I use this for: customer support chatbots, email composition, documentation

Real example: Our internal Q&A chatbot uses temp = 0.7 to sound less robotic while being informative

Temperature = 1.5 (The Artist)

Loosely follows rules to allow unexpected word combinations

- Can result in pure genius or pure garbage

- I use this for: brainstorming sessions, marketing text generation, naming new products

Real example: We used temp = 1.5 to generate 50 product name ideas and then filtered out the bad ones

Seeing Temperature in Action

Here’s an actual experiment I ran while implementing a feature description generator:

import openai

prompt = "Describe a payment processing API in one sentence:"

# Conservative approach (temp=0.1)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.1)
# Output: "A payment processing API enables secure transaction handling between merchants and payment providers."
# Run it 10 times → same output every time

# Balanced approach (temp=0.8)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.8)
# Output: "This API orchestrates the flow of payment data, ensuring transactions are processed reliably and securely."
# Run it 10 times → 10 slightly different phrasings

# Wild approach (temp=1.8)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=1.8)
# Output: "Money moves through digital veins, authenticated and transformed by elegant code."
# Run it 10 times → some brilliant, some bizarre```

Seed: The Time Machine for Randomness

And that's when it gets good. Even with temperature = 0.8, you can get the exact same output with a seed value. It's like rewinding the dice rolls and playing the game again.

Why I Use Seeds Religiously Now

After the payment system bug (mentioned above), I started to use seeds everywhere. Here are some reasons why I find them incredibly useful:

Debugging Production Issues: If a customer ever tells me "the AI gave me weird output," I can reproduce the exact same randomness with the seed value. No more "it works on my machine" scenarios.
Regression Testing: We use fixed seed values to ensure that changes to our prompts do not inadvertently break functionality. We can easily tell if the output changes because of our prompts or randomness.
A/B Testing Prompts: In A/B testing prompts, I use the same seed value for both prompts. This is to isolate the effect of the prompt from the effect of the seed value itself.
Demo Consistency: In client demos, I often use seeds to prevent the demo from suddenly producing an embarrassing result. (Okay, this happened once).

How I Use Seeds in Production

import openai
import hashlib
def analyze_transaction(transaction_id, description): 
  # Generate deterministic seed from transaction ID 
  # This ensures same transaction always gets same analysis 
  seed = int(hashlib.md5(transaction_id.encode()).hexdigest()[:8], 16) 
  response = openai.ChatCompletion.create( 
        model="gpt-4", 
        messages=[
          { "role": "system", "content": "Categorize this transaction and flag if suspicious." }, 
          { "role": "user", "content": description }], 
        temperature=0.3, # Low but not zero for slight nuance 
        seed=seed # Reproducible per transaction ) 
  return response.choices[0].message.content


# Same transaction ID → same seed → same analysis
result1 = analyze_transaction("TXN-12345", "Wire transfer $9,500")
result2 = analyze_transaction("TXN-12345", "Wire transfer $9,500")
assert result1 == result2 # Always passes

# Different transaction → different seed → different random path
result3 = analyze_transaction("TXN-67890", "Wire transfer $9,500")
# result3 might differ from result1 due to different seed```

The Temperature-Seed Dance

This is the interaction matrix I maintain on my desk:

Temperature	Seed	What Happens	When I Use This
0.0	Not Set	Fully deterministic	Production data extraction
0.0	Set	Still deterministic (seed ignored)	Same as above, seed is redundant
0.7	Not Set	Different every time	Exploratory prototyping
0.7	Set	Reproducible variation	Testing, demos, debugging

The Gotcha That Cost Me 3 Hours

Seeds aren't magic portals to identical outputs. I learned this the hard way:

# This worked perfectly in dev
response = call_gpt4(prompt, temperature=0.5, seed=42)
# Two weeks later, OpenAI updated GPT-4
# Same code, same seed, DIFFERENT output

Seeds only guarantee reproducibility within:

The same model version (gpt-4-0613 vs gpt-4-1106 will differ)
The same infrastructure (OpenAI's API vs Azure OpenAI can differ)
The same provider (OpenAI vs Anthropic vs local models definitely differ)

My solution: Log model versions alongside seeds in production.

My Production Playbook

After experimenting 3+ LLM features, here are my tested patterns:

Pattern 1: The Compliance Extractor

def extract_contract_terms(contract_pdf):
  """Zero tolerance for variation in legal documents"""
  return openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{
      "role": "system",
      "content": "Extract key terms as JSON: parties, dates, amounts, obligations"
       }, {
      "role": "user",
      "content": contract_pdf
       }],
    temperature=0.0, # Absolute consistency required
    response_format={"type": "json_object"}
 )

Pattern 3: The Regression Test

import pytest

def test_fraud_detection_prompt(): 
  """Ensure prompt changes don't break detection""" 
  test_case = "ATM withdrawal $500, then $500, then $500 in 10 minutes" 
  response = detect_fraud( transaction_pattern=test_case, temperature=0.2, seed=12345 # Fixed for test reproducibility ) 
  # If this fails, we changed something important 
  assert response["risk_level"] == "high" 
  assert "velocity" in response["reasoning"].lower()

The Third Dial:

Top-P (For the Curious) Once you understand temperature, there’s a second dial: top_p (nucleus sampling). I don’t use it often, but this is when it comes into play: Temperature changes the probability distribution. Top-P truncates it.

# Temperature alone
response = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[{"role": "user", "content": "Name a color:"}],
  temperature=0.8 # Considers all colors, weighted by probability
  )
# Might get: blue, red, green, azure, crimson, teal...

# Top-P alone
response = openai.ChatCompletion.create(
  model="gpt-4", 
  messages=[{"role": "user", "content": "Name a color:"}], 
  top_p=0.1 # Only top 10% probability mass)
# Likely gets: blue, red, green (ignores rare colors)

The Rule: Use temperature first. Use top_p if you need to tweak the "long tail" of possibilities. Tuning both temperature and top_p at once is like using both the gas and the brakes at the same time. It can be done, but it usually isn't necessary.

Mistakes I've Made (So You Don't Have To)

Mistake #1: "More Temperature = Better Creativity"? No

# My first attempt at a product description generator
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Describe our payment API"}],
temperature=2.0 # "Maximum creativity!"
)
# Output: "Behold the quantum entanglement of monetary consciousness!"
# Marketing team: "...what?"

Lesson here: Temperature > 1.5 often produces nonsense. Sweet spot for creativity is 0.9-1.3.

Error #2: Forgetting Seeds Are Ignored at temp=0

# I believed this would allow me to "version" deterministic results 
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0,
    seed=42  # Totally ignored! )

Moral here: When temp=0, results are already deterministic. Seed is unnecessary.

# This went live to prod 
def process_claim(claim_id, claim_text):     
    return call_llm(claim_text, temperature=0.5, seed=42)  
# Three months later: "Why are all claims receiving the same analysis?" 
# Me: "Uh. Same seed for all calls. Whoops."

Moral: Either use a different seed for each input, or we get identical randomness at every place.

The Cheat Sheet I Wish I Had

After making LLM features work, I've distilled my decision tree to this:

Key Takeaways

1. Temperature=0 is underrated. Most production features require consistency, not creativity.

2. Seeds are for debugging, not decoration. Log them with your outputs.

3. Test your assumptions. Seed reproducibility differs by provider and model version.

4. Start conservative. It's easier to raise the temperature than to debug hallucinations.

These two variables transformed my LLM output from unpredictable experiments to trustworthy product features. Master them, and you'll deploy AI with confidence.

How do you manage temperature? I'm particularly interested in hearing about seed logging in production environments. Share your methods/approaches in the comments.