Why Your LLM Gives Different Answers Every Time
Last month, I was debugging a problem with our invoice processing system, which was returning different amounts for the same PDF file on consecutive calls. The root cause? I had failed to adjust the temperature parameter from its default value. What I learned is that temperature and seed are not just abstract terms. In shipping LLM-based features, they are essential parameters. Let me walk you through what I have learned so far.
Temperature: Your Creativity Dial
Temperature is a slider that ranges from 'boring but reliable' to 'creative but unpredictable.' It is a number (0-2) that determines how adventurous the model will be in selecting its next word.
The Mental Model
For every token an LLM outputs, it's like rolling weighted dice and selecting the outcome from thousands of possibilities. The temperature affects the weights:
Temperature = 0.0 (The Accountant)
- Always chooses the statistically most likely outcome
- Input = output = input (predictable and deterministic)
- I use this for: parsing documents, extracting data, crafting SQL queries
Real example: Our compliance report generator uses temp = 0.0 because regulators hate "creative" interpretations
Temperature = 0.7 (The Consultant)
- Trades reliability for some randomness in output
- Output appears human-written without being unrecognizable
- I use this for: customer support chatbots, email composition, documentation
Real example: Our internal Q&A chatbot uses temp = 0.7 to sound less robotic while being informative
Temperature = 1.5 (The Artist)
- Loosely follows rules to allow unexpected word combinations
- Can result in pure genius or pure garbage
- I use this for: brainstorming sessions, marketing text generation, naming new products
Real example: We used temp = 1.5 to generate 50 product name ideas and then filtered out the bad ones
Seeing Temperature in Action
Here’s an actual experiment I ran while implementing a feature description generator:
import openai
prompt = "Describe a payment processing API in one sentence:"
# Conservative approach (temp=0.1)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.1)
# Output: "A payment processing API enables secure transaction handling between merchants and payment providers."
# Run it 10 times → same output every time
# Balanced approach (temp=0.8)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.8)
# Output: "This API orchestrates the flow of payment data, ensuring transactions are processed reliably and securely."
# Run it 10 times → 10 slightly different phrasings
# Wild approach (temp=1.8)
response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=1.8)
# Output: "Money moves through digital veins, authenticated and transformed by elegant code."
# Run it 10 times → some brilliant, some bizarre```
Seed: The Time Machine for Randomness
And that's when it gets good. Even with temperature = 0.8, you can
Why I Use Seeds Religiously Now
After the payment system bug (mentioned above), I started to use seeds everywhere. Here are some reasons why I find them incredibly useful:
- Debugging Production Issues: If a customer ever tells me "the AI gave me weird output," I can reproduce the exact same randomness with the seed value. No more "it works on my machine" scenarios.
- Regression Testing: We use fixed seed values to ensure that changes to our prompts do not inadvertently break functionality. We can easily tell if the output changes because of our prompts or randomness.
- A/B Testing Prompts: In A/B testing prompts, I use the same seed value for both prompts. This is to isolate the effect of the prompt from the effect of the seed value itself.
- Demo Consistency: In client demos, I often use seeds to prevent the demo from suddenly producing an embarrassing result. (Okay, this happened once).
How I Use Seeds in Production
import openai
import hashlib
def analyze_transaction(transaction_id, description):
# Generate deterministic seed from transaction ID
# This ensures same transaction always gets same analysis
seed = int(hashlib.md5(transaction_id.encode()).hexdigest()[:8], 16)
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{ "role": "system", "content": "Categorize this transaction and flag if suspicious." },
{ "role": "user", "content": description }],
temperature=0.3, # Low but not zero for slight nuance
seed=seed # Reproducible per transaction )
return response.choices[0].message.content
# Same transaction ID → same seed → same analysis
result1 = analyze_transaction("TXN-12345", "Wire transfer $9,500")
result2 = analyze_transaction("TXN-12345", "Wire transfer $9,500")
assert result1 == result2 # Always passes
# Different transaction → different seed → different random path
result3 = analyze_transaction("TXN-67890", "Wire transfer $9,500")
# result3 might differ from result1 due to different seed```
The Temperature-Seed Dance
This is the interaction matrix I maintain on my desk:
|
Temperature |
Seed |
What Happens |
When I Use This |
|---|---|---|---|
|
0.0 |
Not Set |
Fully deterministic |
Production data extraction |
|
0.0 |
Set |
Still deterministic (seed ignored) |
Same as above, seed is redundant |
|
0.7 |
Not Set |
Different every time |
Exploratory prototyping |
|
0.7 |
Set |
Reproducible variation |
Testing, demos, debugging |
The Gotcha That Cost Me 3 Hours
Seeds aren't magic portals to identical outputs. I learned this the hard way:
# This worked perfectly in dev
response = call_gpt4(prompt, temperature=0.5, seed=42)
# Two weeks later, OpenAI updated GPT-4
# Same code, same seed, DIFFERENT output
Seeds only guarantee reproducibility within:
- The same model version (gpt-4-0613 vs gpt-4-1106 will differ)
- The same infrastructure (OpenAI's API vs Azure OpenAI can differ)
- The same provider (OpenAI vs Anthropic vs local models definitely differ)
My solution: Log model versions alongside seeds in production.
My Production Playbook
After experimenting 3+ LLM features, here are my tested patterns:
Pattern 1: The Compliance Extractor
def extract_contract_terms(contract_pdf):
"""Zero tolerance for variation in legal documents"""
return openai.ChatCompletion.create(
model="gpt-4",
messages=[{
"role": "system",
"content": "Extract key terms as JSON: parties, dates, amounts, obligations"
}, {
"role": "user",
"content": contract_pdf
}],
temperature=0.0, # Absolute consistency required
response_format={"type": "json_object"}
)
Pattern 3: The Regression Test
import pytest
def test_fraud_detection_prompt():
"""Ensure prompt changes don't break detection"""
test_case = "ATM withdrawal $500, then $500, then $500 in 10 minutes"
response = detect_fraud( transaction_pattern=test_case, temperature=0.2, seed=12345 # Fixed for test reproducibility )
# If this fails, we changed something important
assert response["risk_level"] == "high"
assert "velocity" in response["reasoning"].lower()
The Third Dial:
Top-P (For the Curious) Once you understand temperature, there’s a second dial: top_p (nucleus sampling). I don’t use it often, but this is when it comes into play: Temperature changes the probability distribution. Top-P truncates it.
# Temperature alone
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Name a color:"}],
temperature=0.8 # Considers all colors, weighted by probability
)
# Might get: blue, red, green, azure, crimson, teal...
# Top-P alone
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Name a color:"}],
top_p=0.1 # Only top 10% probability mass)
# Likely gets: blue, red, green (ignores rare colors)
The Rule: Use temperature first. Use top_p if you need to tweak the "long tail" of possibilities. Tuning both temperature and top_p at once is like using both the gas and the brakes at the same time. It can be done, but it usually isn't necessary.
Mistakes I've Made (So You Don't Have To)
Mistake #1: "More Temperature = Better Creativity"? No
# My first attempt at a product description generator
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": "Describe our payment API"}],
temperature=2.0 # "Maximum creativity!"
)
# Output: "Behold the quantum entanglement of monetary consciousness!"
# Marketing team: "...what?"
Lesson here: Temperature > 1.5 often produces nonsense. Sweet spot for creativity is 0.9-1.3.
Error #2: Forgetting Seeds Are Ignored at temp=0
# I believed this would allow me to "version" deterministic results
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
seed=42 # Totally ignored! )
Moral here: When temp=0, results are already deterministic. Seed is unnecessary.
Error #3: Blind Faith in Seeds
# This went live to prod
def process_claim(claim_id, claim_text):
return call_llm(claim_text, temperature=0.5, seed=42)
# Three months later: "Why are all claims receiving the same analysis?"
# Me: "Uh. Same seed for all calls. Whoops."
Moral: Either use a different seed for each input, or we get identical randomness at every place.
The Cheat Sheet I Wish I Had
After making LLM features work, I've distilled my decision tree to this:
Key Takeaways
1. Temperature=0 is underrated. Most production features require consistency, not creativity.
2. Seeds are for debugging, not decoration. Log them with your outputs.
3. Test your assumptions. Seed reproducibility differs by provider and model version.
4. Start conservative. It's easier to raise the temperature than to debug hallucinations.
These two variables transformed my LLM output from unpredictable experiments to trustworthy product features. Master them, and you'll deploy AI with confidence.
How do you manage temperature? I'm particularly interested in hearing about seed logging in production environments. Share your methods/approaches in the comments.
