I Gave 5 Frontier Models the Same Email Thread. Here's What They Missed.

We gave five frontier models the same 31-message email thread, the same prompt, and the same job: tell us what was decided, who owns what, and what changed. None of them got all of it right.

One pulled a pricing decision from a forwarded internal discussion that had been reversed six messages later. One flattened two reply branches into a single story and quietly invented consensus where there was none. And one attributed a task to someone who never said "I'll handle it" because the sentence only appeared in quoted history from an earlier reply. By message #21, that same sentence had been duplicated 12 times across the thread by email clients quoting the full history on every reply.

Across all five models, 3 out of 5 listed a dropped integration as an agreed item. 4 out of 5 misidentified decision-makers. Every model confused "person who talks a lot" with "person who has authority."

GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro, Grok 4.20 Beta, and Mistral Large 3 are all capable of answering these questions correctly given the right input. Raw email is not the right input, and the structural reasons why are specific enough to document, which is what this post does.

The Test Setup

The thread is a real B2B SaaS deal negotiation (anonymized), spanning 3 weeks with 8 participants across both organizations. It includes a subject line change mid-thread, one participant CC'd halfway through who immediately starts giving opinions without the earlier context, a forwarded internal pricing discussion that accidentally went to the prospect, and three sub-threads that branched off when people replied to different messages in the chain.

When pulled from the Gmail API, this thread produced approximately 47,000 tokens of raw content. After deduplication and quoted text stripping, the actual unique content was about 11,000 tokens. That's a 4.3x bloat factor (the ratio of raw tokens to unique tokens, driven almost entirely by email clients quoting the full history on every reply). Typical for threads this length.

We fed each model the full raw content. Every model can handle the context length.

The question is what they do with it.

Prompt. Identical across all five models, no tools, no web access, temperature 0 where the API allows:

Read this email thread and return: (1) current decisions, (2) open action items with owners, (3) deadlines, (4) what changed during the thread, (5) any risks or contradictions. Use the JSON schema provided.

Models. GPT-5.4 (OpenAI API), Claude Sonnet 4.6 (Anthropic API), Gemini 3 Pro (Google Vertex AI), Grok 4.20 Beta (xAI API), Mistral Large 3 (La Plateforme). All tested March 2026. One model per lab.

Disclosure. I work at iGPT, which sells an email preprocessing API. The structured context in the second half of this test was generated by our product. The raw-input failures documented in the first half stand on their own regardless of how you choose to fix them.

The Most Revealing Failure from Each Model

The full scoring tables are at the end. What follows is the single most instructive miss from each model, chosen because it maps to a specific structural property of email that gets destroyed when you dump raw content into a context window.

GPT-5.4: Stale Forwarded Decision

In message #7, an internal pricing discussion was accidentally forwarded to the prospect. That forwarded chain contained an approved 15% discount from three weeks earlier. By message #13, the vendor's finance team had revised the discount to 12%. By message #19, the prospect had explicitly asked for "current" pricing.

GPT-5.4 reported the 15% discount as the agreed pricing. In the flattened text, the forwarded content sits inline with no structural marker distinguishing it from active conversation, and the older figure is stated with more confidence ("approved at 15%" versus "we're revising to 12%") which the model interprets as higher certainty. GPT-5.4 also performed worst on stale-history resistance overall, pulling from forwarded content on two other questions as well.

Failure class: Forwarded-chain staleness. Forwarded content appears inline in raw email with no structural boundary. The model treats old history as current conversation.

Claude Sonnet 4.6: Wrong Owner on Turn 18

In message #18, Priya (the vendor's Solution Architect) wrote "I'll send the POC scope document by Friday." Claude Sonnet 4.6 attributed this commitment to James (the Account Executive) who had written the most messages in the thread.

In flattened email text, the pronoun "I" appears dozens of times and refers to different people. Once the From: headers are buried in threading noise, the model uses name frequency as a proxy for speaker identity, and the person who talks the most gets credited with commitments they never made. That said, Sonnet 4.6 was the only model that produced zero hallucinated commitments from quoted text, and it was one of only two that flagged the CFO's silence as a risk signal.

Failure class: Multi-party pronoun ambiguity. "I'll handle it" gets attached to the wrong speaker because the model can't reliably map "I" across flattened turns without per-message participant metadata.

Gemini 3 Pro: Branch Merge Error

Between messages #14 and #20, the thread forked. David Kim (prospect VP Eng) replied to message #14 agreeing to a 30-day POC. Meanwhile, Lisa Park (Procurement, CC'd at message #12) replied to message #11 raising concerns about security certifications and suggesting the POC should wait until a compliance review was complete.

Gemini 3 Pro collapsed both branches into one narrative: "The team agreed to a 30-day POC pending compliance review." David agreed without conditions. Lisa wanted to delay. These are contradictory positions from different branches, and the model merged them into invented consensus. This was Gemini 3 Pro's worst dimension overall: it scored lowest on branch awareness across all five models, and it was the only one to also hallucinate a fifth action item from the dropped integration discussion.

Failure class: Thread fork blindness. The model collapses parallel branches into one linear story because flattened text can't represent non-linear conversation topology.

Grok 4.20 Beta: Overconfident Contradiction Summary

Grok 4.20 detected more risk signals than any other model. It flagged the CFO's silence, the competing vendor mention, the timeline pressure, and the accidental pricing leak. It was the only model to catch all four. But it described the situation as "increasingly adversarial" and rated the deal as "high risk, likely to stall," which over-indexes on the negatives without weighing the positive momentum.

More interesting was its handling of a cross-thread reference. The prospect's VP of Engineering mentioned in message #22 that "we're also looking at [Competitor]'s approach." Grok treated this as a direct comparison and began contrasting the vendor's capabilities against the competitor, hallucinating specifics about the competitor's offering that weren't in the thread. It filled the gap with plausible-sounding details because the thread referenced but didn't describe the competitor's product.

Failure class: Cross-thread relationship loss. A reference to external context triggers confabulation because the model has no access to the referenced material and fills the gap with fabrication.

Mistral Large 3: Quoted-Text Contamination

In message #9, James wrote to the internal team: "The client is open to the custom integration if we can deliver it within the POC timeline." By message #15, this had been discussed and quietly dropped.

In message #21, David Kim replied to an earlier message. His email client included the full quoted history below his reply, which meant James's message #9 about the integration appeared again as quoted text inside David's message. Mistral Large 3 treated this quoted appearance as a reaffirmation, listing the custom integration as an "active agreed item" and citing message #21 as the source, even though David's actual reply was about scheduling. This was Mistral's most distinctive failure: it was the only model to explicitly cite a quoted-text source as evidence for an active agreement.

Failure class: Quoted-text contamination. The model can't distinguish original statements from quoted copies. In a 20-message thread, every reply includes the full history below it, so the same sentence appears a dozen times.

What All Five Models Struggled With

The model-specific errors differ. The underlying failures don't.

Decision through silence. A custom integration was proposed, discussed for four messages (about 800 words), and then quietly dropped when the conversation moved to pricing. Never explicitly rejected, just abandoned. Three out of five models listed it as agreed. The discussion produces high attention weight because it's lengthy and detailed, and the models default to treating anything discussed at length as active unless they find explicit closure. Absence of closure is not a signal they can detect from raw text.

CFO silence as a risk signal. Rachel Torres was directly asked about pricing in messages #16 and #23. She didn't respond to either. Only Grok 4.20 and Sonnet 4.6 flagged this. The other three couldn't detect it, because in raw email content, silence is invisible. There is no message that says "I am choosing not to respond." Identifying meaningful absence requires understanding participant structure, not just content.

Authority versus verbosity. James the AE wrote the most messages. Four out of five models listed him as a decision-maker. Rachel the CFO wrote one message buried in a forwarded chain. Most models either missed her or couldn't determine her role. Participation frequency is a terrible proxy for organizational authority, and it's the best heuristic available from unstructured text.

Why Email Breaks Frontier Models

Every failure above maps to a structural property of email that gets destroyed when you flatten a thread into text. These aren't edge cases. They're the default state of any business thread longer than about 10 messages.

Quoted reply duplication. Every reply includes the full quoted history below it. The 20th message contains 19 copies of the first. The thread inflates from ~11,000 unique tokens to ~47,000 raw tokens, biasing attention toward earlier messages that appear more frequently.

Forwarded chain collapse. Forwarded content appears as one continuous block with no structural separation from the active thread. A statement from an internal discussion three weeks ago gets treated as current negotiation.

Participant identity loss. Strip From: headers and "I" refers to eight different people across 47,000 tokens of text. Attribution becomes a frequency-based guessing game.

Non-linear conversation topology. Three people replied to different messages, creating parallel sub-conversations. Linear processing treats these as one flow. The In-Reply-To headers encoding the actual conversation graph are exactly what flattening strips out.

Invisible absence. The most important signals were things that didn't happen: the integration dropped without rejection, the CFO who didn't respond. Detecting meaningful absence requires knowing who was asked what and whether they answered. Raw text can only process what's present.

Structure Beats Model Choice

We ran the same questions through iGPT's Context Engine, which performs thread reconstruction, deduplication, participant attribution, and temporal ordering before the content reaches the model. The output includes per-message metadata on who said what, when, who they were replying to, and what changed between messages.

We used the same five models, the same questions, and a different input structure.

Metric	Raw Email (avg across 5 models)	Structured Context (avg)	Delta
Decision accuracy	~42%	~91%	+49 pts
Owner attribution	~48%	~94%	+46 pts
Deadline extraction	~56%	~89%	+33 pts
Stale-history resistance	~35%	~88%	+53 pts
Branch awareness	~30%	~85%	+55 pts
Contradiction detection	~38%	~82%	+44 pts

The composite accuracy improvement averaged 29 percentage points across all five models. But look at the structural metrics: stale-history resistance jumped 53 points. Branch awareness jumped 55. Those are the failures caused by how email is formatted, not by what people write in it.

The spread between models on raw input was about 8 percentage points, best to worst. The spread between raw input and structured input on the same model was 29 points. The preprocessing gap is more than 3x the model gap.

Changing the model moves accuracy a few points. Changing the input moves it by dozens.

You can build this preprocessing layer yourself: thread reconstruction, quoted text detection, signature stripping, MIME parsing, participant resolution, conversation topology mapping. There are open-source libraries that handle parts of this (email-reply-parser, flanker, mailparser). Budget 6 to 12 months if you want the full stack reliable across the range of email clients people actually use.

Or you can use an API that handles it end-to-end. iGPT's Context Engine does this in a single endpoint: raw email goes in, clean thread structure comes out with who said what, when, and what actually changed. That's what we used as the "structured context" input in the test above.

pip install igptai

from igptai import IGPT

igpt = IGPT(api_key="IGPT_API_KEY", user="user_123")

res = igpt.recall.ask(
    input="What decisions were made, who owns what, and what changed?",
    quality="cef-1-reasoning",
    output_format="json"
)
# → structured JSON: decisions, action items with owners,
#   deadlines, risk signals, source citations per claim

If you're processing real deal threads through any third-party layer, the compliance question matters. iGPT offers zero data retention for inference and is working toward SOC 2 and GDPR alignment, but whatever you use, verify it meets your security requirements before routing live email through it.

The principle holds regardless of implementation. If you're spending time evaluating which model is "better at email," you're optimizing the wrong variable.

Full Scoring Tables

Decision-Makers

Ground truth: David Kim (VP Eng, technical decision-maker), Rachel Torres (CFO, budget authority, one message in forwarded chain).

Model	Identified Decision-Makers	Assessment
GPT-5.4	David Kim, James Chen, Rachel Torres	Partial. James is the AE, not a decision-maker.
Claude Sonnet 4.6	David Kim, Rachel Torres, Lisa Park	Partial. Lisa was CC'd for compliance review only.
Gemini 3 Pro	David Kim, James Chen	Partial. Same James error. Missed Rachel entirely.
Grok 4.20 Beta	David Kim, Rachel Torres	Closest. Missed that Rachel's authority is specifically budget.
Mistral Large 3	David Kim, Rachel Torres, James Chen	Partial. Same James misidentification.

Agreements

Ground truth: (1) 30-day POC with scope doc. (2) Enterprise tier at 12% discount (revised from 15%), pending CFO approval. (3) Dedicated solutions engineer. NOT agreed: custom integration (proposed, discussed, dropped).

Model	Agreements Found	Critical Error
GPT-5.4	POC, 15% discount, custom integration	Wrong discount (stale). Listed dropped integration.
Claude Sonnet 4.6	POC, 12% discount, solutions engineer	Missed CFO approval condition.
Gemini 3 Pro	POC pending compliance, 15% discount, integration, solutions engineer	Merged branches. Wrong discount. Listed dropped integration.
Grok 4.20 Beta	POC, 12% discount, solutions engineer	Correct items. Over-qualified with risk caveats.
Mistral Large 3	POC, 15% discount, custom integration	Wrong discount. Listed dropped integration citing quoted text.

Action Items + Attribution

Ground truth: (1) Priya: POC scope doc by Friday. (2) David Kim: provision staging env. (3) James: schedule tech deep-dive. (4) Rachel Torres: final pricing approval (outstanding).

Model	Items Extracted	Attribution Errors
GPT-5.4	All four items	Attributed POC scope doc to James instead of Priya.
Claude Sonnet 4.6	POC scope, staging env, tech deep-dive	Missed Rachel's outstanding approval. Attributed POC doc to James.
Gemini 3 Pro	Four items + invented "follow up on integration"	Hallucinated fifth item from dropped discussion.
Grok 4.20 Beta	All four items	Correct on 3 of 4. Swapped David/James on staging env.
Mistral Large 3	POC scope, tech deep-dive, pricing approval	Missed staging env. Correct attribution on rest.

Risk Assessment

Ground truth signals: (1) CFO silence on two direct pricing questions. (2) Accidental pricing leak. (3) Competing vendor mentioned by name. (4) POC timeline overlaps Q4 freeze.

Model	Risk Level	Signals Caught	What It Missed
GPT-5.4	Medium	Competitor, timeline	CFO silence, pricing leak
Claude Sonnet 4.6	Medium-high	CFO silence, competitor	Pricing leak, Q4 freeze
Gemini 3 Pro	Low-medium	None specific ("positive momentum")	3 of 4 signals
Grok 4.20 Beta	High	All four	Over-indexed. Called it "adversarial."
Mistral Large 3	Medium	Competitor, timeline	CFO silence, pricing leak

Methodology

Models tested: GPT-5.4 (OpenAI API), Claude Sonnet 4.6 (Anthropic API), Gemini 3 Pro (Google Vertex AI), Grok 4.20 Beta (xAI API), Mistral Large 3 (La Plateforme). All March 2026.

Thread source: Real anonymized B2B SaaS deal negotiation, 31 messages, 8 participants, 3 weeks. PII replaced. Thread structure preserved.

Scope: This is a single-thread evaluation, not a statistical benchmark. The failure classes documented here are structural and reproducible across threads of similar complexity, but the specific accuracy percentages should be read as directional, not definitive. We chose this thread because it contains all five structural patterns (forwarding, branching, CC changes, quoted duplication, implicit abandonment) in one conversation, not because it was optimized for any particular tool.

Evaluation: Manual scoring by two evaluators against ground truth established by someone with full deal context (not the author). Partial credit awarded.

Structured context: iGPT recall/ask endpoint, cef-1-reasoning tier. Same five models, same prompt, same rubric.

Hallucinated commitments from quoted text: GPT-5.4: 1. Sonnet 4.6: 0. Gemini 3 Pro: 2. Grok 4.20: 1. Mistral Large 3: 1.

Model volatility: Model behavior changes with updates, and several models tested were in active iteration at the time (notably Grok 4.20 Beta). The specific numbers in this evaluation are a snapshot. The structural email properties that cause these failures do not change with model updates. If you rerun this test after a model refresh, the numbers will shift but the pattern will hold: raw email input degrades accuracy on thread-dependent questions regardless of model version.

iGPT's Email Intelligence API handles thread reconstruction, sender attribution, temporal ordering, and conversation topology mapping. docs.igpt.ai