Why Email Threads Break Every RAG Pipeline

Have you ever spent any time building RAG over email, where everything works beautifully in PDFs, wiki pages, and Notion dumps, but then you point the same pipeline at someone's inbox, and the whole thing collapses in ways that take weeks to diagnose? Because the failures don't look like failures, they look like bad retrieval that you keep trying to fix with better chunking strategies, bigger context windows, and more carefully tuned prompts, until you eventually realize the problem was never the retrieval at all. The problem is that email threads aren't documents, yet every standard RAG approach treats them as if they were. This is the article I wish existed when we started building email infrastructure: the failure modes are specific and enumerable, and once you name them, debugging gets easier. And because the fix isn't a better model or a smarter prompt, it's a completely different architecture. The approach that works on everything except email The instinct is obvious and well-founded. Connect to Gmail API, pull the messages, chunk them, embed them, and retrieve the top-k relevant chunks when someone asks a question. from google.oauth2.credentials import Credentials from googleapiclient.discovery import build from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma service = build("gmail", "v1", credentials=creds) results = service.users().messages().list(userId="me", maxResults=50).execute() messages = results.get("messages", []) raw_emails = [] for msg in messages: full = service.users().messages().get( userId="me", id=msg["id"], format="full" ).execute() payload = full.get("payload", {}) body = get_body_text(payload) raw_emails.append({ "id": msg["id"], "threadId": full.get("threadId"), "body": body, "headers": { h["name"]: h["value"] for h in payload.get("headers", []) } }) splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = [] for email in raw_emails: splits = splitter.split_text(email["body"]) for split in splits: chunks.append({ "text": split, "metadata": {"thread_id": email["threadId"]} }) embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_texts( [c["text"] for c in chunks], embeddings, metadatas=[c["metadata"] for c in chunks] ) results = vectorstore.similarity_search( "Who approved the Q4 budget?", k=5 ) from google.oauth2.credentials import Credentials from googleapiclient.discovery import build from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma service = build("gmail", "v1", credentials=creds) results = service.users().messages().list(userId="me", maxResults=50).execute() messages = results.get("messages", []) raw_emails = [] for msg in messages: full = service.users().messages().get( userId="me", id=msg["id"], format="full" ).execute() payload = full.get("payload", {}) body = get_body_text(payload) raw_emails.append({ "id": msg["id"], "threadId": full.get("threadId"), "body": body, "headers": { h["name"]: h["value"] for h in payload.get("headers", []) } }) splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200 ) chunks = [] for email in raw_emails: splits = splitter.split_text(email["body"]) for split in splits: chunks.append({ "text": split, "metadata": {"thread_id": email["threadId"]} }) embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_texts( [c["text"] for c in chunks], embeddings, metadatas=[c["metadata"] for c in chunks] ) results = vectorstore.similarity_search( "Who approved the Q4 budget?", k=5 ) This works for static documents because they are self-contained: each chunk carries its own meaning, and the relationships between chunks are mostly semantic. Email lacks those properties, and the ways it breaks are worth walking through in detail because each one reveals why email is fundamentally different from the data structure RAG was designed for. Quoted text duplication In a 12-message email thread, the Gmail API returns every reply with the full quoted chain below it: message 12 contains messages 1 through 11, message 11 contains 1 through 10, and so on, so the original message appears 12 times across the full thread payload. Here's what the raw text of a single reply actually looks like: Hey team, approved the revised timeline. Let's move forward. On Mon, Feb 3, 2026 at 2:14 PM Sarah Chen wrote: > Updated the timeline based on David's feedback. New deadline March 15. > > On Mon, Feb 3, 2026 at 11:30 AM David Kim wrote: >> The original March 1 deadline won't work. We need two more weeks >> for compliance review. >> >> On Fri, Jan 31, 2026 at 4:45 PM Sarah Chen wrote: >>> Here's the proposed timeline for Q1 deliverables... >>> [14 more lines of original proposal] >>> --- >>> Sarah Chen | VP Operations | Acme Consulting >>> 123 Main St, Suite 400 | New York, NY 10001 >>> Tel: (212) 555-0142 | sarah@acme.com >>> CONFIDENTIAL: This email and any attachments are intended >>> solely for the use of the individual or entity to whom >>> they are addressed... Hey team, approved the revised timeline. Let's move forward. On Mon, Feb 3, 2026 at 2:14 PM Sarah Chen wrote: > Updated the timeline based on David's feedback. New deadline March 15. > > On Mon, Feb 3, 2026 at 11:30 AM David Kim wrote: >> The original March 1 deadline won't work. We need two more weeks >> for compliance review. >> >> On Fri, Jan 31, 2026 at 4:45 PM Sarah Chen wrote: >>> Here's the proposed timeline for Q1 deliverables... >>> [14 more lines of original proposal] >>> --- >>> Sarah Chen | VP Operations | Acme Consulting >>> 123 Main St, Suite 400 | New York, NY 10001 >>> Tel: (212) 555-0142 | sarah@acme.com >>> CONFIDENTIAL: This email and any attachments are intended >>> solely for the use of the individual or entity to whom >>> they are addressed... When you chunk this and embed it, the original messages dominate the embedding space because they appear in so many chunks, and the model has no way to know these are duplicates because it reads them as reinforcement, which inflates confidence in the oldest messages and buries the most recent ones. We had a 25-message reply-all chain where retrieval kept returning the wrong messages because the company footer appeared so many times that it dominated the embedding for the entire thread. Semantically, every chunk looked almost identical because the boilerplate outweighed the actual content. The fix isn't a simple regex because people reply inline, they edit quotes, they forward with additions mid-quote, and any deduplication logic needs to handle all of these without accidentally stripping content that was intentionally modified inside a quote block. Thread structure disappears at embedding time Email threads have a conversation tree: message 7 might be a reply to message 3, not message 6; someone branches off a subthread; someone else replies to the original; a third person replies-all to a message from two days ago. When you chunk by message and embed, that structure vanishes entirely because embeddings capture semantic similarity between chunks but know nothing about who said what in response to whom or in what order the conversation actually progressed. So when you ask, "who approved this decision," the retrieval might surface a chunk where someone was quoted as saying, "looks good," but they were actually being quoted by someone disagreeing with them. The embedding matched the words, but the conversational context that gave those words meaning was lost at indexing time. CC vs. authorship attribution This one is subtle and produces confident, wrong outputs. Email headers contain To, CC, and From fields, and if you're dumping raw thread content into the context window, names from these fields blend with names in the message body: # What your chunker sees after flattening: """ From: michael@company.com To: sarah@acme.com CC: david@acme.com, jennifer@company.com Sarah, following up on David's proposal from last week. Jennifer will handle the compliance review. """ # What your chunker sees after flattening: """ From: michael@company.com To: sarah@acme.com CC: david@acme.com, jennifer@company.com Sarah, following up on David's proposal from last week. Jennifer will handle the compliance review. """ Your model sees "David" in the CC line and "David's proposal" in the body and has no structural way to distinguish "David was informed" from "David authored this," which is why extraction pipelines end up confidently attributing deliverables to people who never wrote a single reply in the thread because they appeared in CC fields and their names showed up in forwarded content. Forwarded thread forks Someone forwards a thread to a new group of people, and now you have two conversations that share history but diverged at a specific point, and Gmail API treats these as separate threads with no link between them. Ask "what did the team decide about X," and retrieval might pull from either branch and give you contradictory answers without knowing they're contradictory, because both chunks are semantically relevant and both are about the same topic, and the metadata that would tell you the conversation forked doesn't exist in the embedding. Signatures and boilerplate at scale One email's legal disclaimer is noise you can filter with a regex pattern. Still, in a real organization, you're dealing with 30+ signature formats, compliance disclaimers in multiple languages, confidentiality notices longer than the actual message content, auto-appended calendar links, and HTML formatting artifacts that survive conversion to text in unpredictable ways. At scale, a meaningful portion of your token budget goes to this noise while the model treats it as worth reasoning over. Cross-thread temporal reasoning Someone says "let's revisit this next quarter" in a January thread, and in March, a different thread references "the timeline we discussed," and these use completely different words to refer to the same thing because the connection between them is temporal and referential and not semantic, which means standard vector similarity can't find it. This is where the document-vs-conversation distinction matters most: documents are static, but conversations evolve over time, reference previous conversations, and create commitments that only make sense in the context of when they were made. Why the usual fixes solve the wrong problem Better models don't help because the issues are upstream of reasoning, and the model is working correctly on broken input, which means a smarter model just gives you more confidently wrong answers with better grammar. Bigger context windows let you stuff more of the thread in. However, you're still stuffing in duplicated quoted text, boilerplate, and structurally flat content, and more tokens of garbage are still garbage. At the same time, you pay for every token of every duplicated signature block. Better prompts help marginally by allowing you to instruct the model to "ignore signatures," "only attribute statements to the person who wrote them," and "consider the chronological order." Still, you're asking the model to reconstruct structure from unstructured input on every single query, which means you're doing context engineering at the wrong layer and pushing reconstruction logic into the prompt instead of handling it in the pipeline. Treating email as graph reconstruction instead of document retrieval The architecture that actually works starts from a different premise: email threads are conversational graphs, not documents; each message is a node; replies create edges; participants have roles that change over time; and decisions, commitments, and references create cross-thread edges. The pipeline has six layers that need to exist between raw email and your model: ┌─────────────────────────────────────────────────────┐ │ YOUR APPLICATION │ │ (agent, workflow, product) │ ├─────────────────────────────────────────────────────┤ │ Layer 6: Hybrid Retrieval │ │ semantic search + metadata filters + graph traversal│ ├─────────────────────────────────────────────────────┤ │ Layer 5: Cross-Thread Linking │ │ participant overlap, topic refs, temporal proximity │ ├─────────────────────────────────────────────────────┤ │ Layer 4: Structured Metadata Extraction │ │ decisions, tasks, owners, deadlines, sentiment │ ├─────────────────────────────────────────────────────┤ │ Layer 3: Participant & Role Tracking │ │ From vs To vs CC, role changes across thread │ ├─────────────────────────────────────────────────────┤ │ Layer 2: Content Deduplication │ │ quoted text removal, inline edit preservation │ ├─────────────────────────────────────────────────────┤ │ Layer 1: Thread Reconstruction │ │ In-Reply-To / References headers → conversation tree│ ├─────────────────────────────────────────────────────┤ │ RAW EMAIL (Gmail API / IMAP) │ └─────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────┐ │ YOUR APPLICATION │ │ (agent, workflow, product) │ ├─────────────────────────────────────────────────────┤ │ Layer 6: Hybrid Retrieval │ │ semantic search + metadata filters + graph traversal│ ├─────────────────────────────────────────────────────┤ │ Layer 5: Cross-Thread Linking │ │ participant overlap, topic refs, temporal proximity │ ├─────────────────────────────────────────────────────┤ │ Layer 4: Structured Metadata Extraction │ │ decisions, tasks, owners, deadlines, sentiment │ ├─────────────────────────────────────────────────────┤ │ Layer 3: Participant & Role Tracking │ │ From vs To vs CC, role changes across thread │ ├─────────────────────────────────────────────────────┤ │ Layer 2: Content Deduplication │ │ quoted text removal, inline edit preservation │ ├─────────────────────────────────────────────────────┤ │ Layer 1: Thread Reconstruction │ │ In-Reply-To / References headers → conversation tree│ ├─────────────────────────────────────────────────────┤ │ RAW EMAIL (Gmail API / IMAP) │ └─────────────────────────────────────────────────────┘ Layer 1: Thread reconstruction Map In-Reply-To and References headers to build the actual conversation tree, which gives you the structure that embeddings lose: who replied to whom, where the conversation branched, and what the actual sequence was. from collections import defaultdict def build_thread_tree(messages): """Build conversation tree from email headers.""" by_message_id = {} children = defaultdict(list) roots = [] for msg in messages: msg_id = msg["headers"].get("Message-ID", "") reply_to = msg["headers"].get("In-Reply-To", "") by_message_id[msg_id] = msg if reply_to and reply_to in by_message_id: children[reply_to].append(msg_id) else: roots.append(msg_id) return roots, children, by_message_id from collections import defaultdict def build_thread_tree(messages): """Build conversation tree from email headers.""" by_message_id = {} children = defaultdict(list) roots = [] for msg in messages: msg_id = msg["headers"].get("Message-ID", "") reply_to = msg["headers"].get("In-Reply-To", "") by_message_id[msg_id] = msg if reply_to and reply_to in by_message_id: children[reply_to].append(msg_id) else: roots.append(msg_id) return roots, children, by_message_id You want to sync newest-to-oldest because you need the final state of the conversation first, and when a thread is forwarded and modified, the most recent version reflects reality. Layer 2: Content deduplication Walk the tree and strip quoted text while preserving inline edits, because someone who replies and modifies a quote is adding new information inside old text, and your deduplication needs to detect this rather than treating all quoted content as noise. import re def strip_quoted_text(body, parent_body=None): """Remove quoted replies, preserve inline edits.""" lines = body.split("\n") cleaned = [] in_quote_block = False for line in lines: if re.match(r'^>\s*', line) or re.match(r'^On .+ wrote:$', line): in_quote_block = True original_line = re.sub(r'^>\s*', '', line) if parent_body and original_line not in parent_body: cleaned.append(f"[INLINE EDIT] {original_line}") continue if in_quote_block and line.strip() == "": in_quote_block = False continue if not in_quote_block: cleaned.append(line) return "\n".join(cleaned) import re def strip_quoted_text(body, parent_body=None): """Remove quoted replies, preserve inline edits.""" lines = body.split("\n") cleaned = [] in_quote_block = False for line in lines: if re.match(r'^>\s*', line) or re.match(r'^On .+ wrote:$', line): in_quote_block = True original_line = re.sub(r'^>\s*', '', line) if parent_body and original_line not in parent_body: cleaned.append(f"[INLINE EDIT] {original_line}") continue if in_quote_block and line.strip() == "": in_quote_block = False continue if not in_quote_block: cleaned.append(line) return "\n".join(cleaned) This is where things get difficult in production because email clients format quotes differently, some prefix with >, some indent, and some use HTML styling. Some do nothing identifiable at all, and people edit quotes without marking their edits, which means any automated approach needs to handle the messy reality of how email clients actually work rather than the clean > prefix you see in standards documents. Layer 3: Participant and role extraction Parse headers to build an explicit participant model, so the model doesn't have to guess at attribution. def extract_participants(messages): """Build structured participant model from thread.""" participants = {} for i, msg in enumerate(messages): sender = msg["headers"].get("From", "") to_list = parse_addresses(msg["headers"].get("To", "")) cc_list = parse_addresses(msg["headers"].get("Cc", "")) if sender not in participants: participants[sender] = { "first_seen": i, "roles": [], "message_count": 0 } participants[sender]["roles"].append("author") participants[sender]["message_count"] += 1 for addr in cc_list: if addr not in participants: participants[addr] = { "first_seen": i, "roles": [], "message_count": 0 } participants[addr]["roles"].append("cc") return participants def extract_participants(messages): """Build structured participant model from thread.""" participants = {} for i, msg in enumerate(messages): sender = msg["headers"].get("From", "") to_list = parse_addresses(msg["headers"].get("To", "")) cc_list = parse_addresses(msg["headers"].get("Cc", "")) if sender not in participants: participants[sender] = { "first_seen": i, "roles": [], "message_count": 0 } participants[sender]["roles"].append("author") participants[sender]["message_count"] += 1 for addr in cc_list: if addr not in participants: participants[addr] = { "first_seen": i, "roles": [], "message_count": 0 } participants[addr]["roles"].append("cc") return participants Once you have this as structured data, the difference between "David authored this" and "David was CC'd on this" is explicit instead of something the model has to infer from flattened text. Layers 4-6: Extraction, cross-thread linking, hybrid retrieval The remaining layers extract structured metadata (decisions, tasks, owners, deadlines, sentiment), build connections across threads based on participants, topics, and temporal proximity, and combine semantic search with metadata filtering and graph traversal. Pure vector similarity isn't enough for email, because too many important relationships are structural rather than semantic. Hence, you need to be able to say "find messages from Sarah in threads about the Q2 budget, ordered by date, where a decision was made" and have the retrieval handle the filtering before the semantic matching happens. What the difference looks like in practice Here's what happens when you query email with the standard approach versus structured infrastructure, and the contrast shows why the architecture matters more than the model. Standard RAG query result: results = vectorstore.similarity_search( "What are the open action items?", k=5 ) # Returns: 5 chunks of mixed content # - 2 chunks dominated by signature blocks # - 1 chunk from a quoted reply (wrong attribution) # - 1 relevant chunk buried in noise # - 1 chunk from an unrelated thread (similar keywords) results = vectorstore.similarity_search( "What are the open action items?", k=5 ) # Returns: 5 chunks of mixed content # - 2 chunks dominated by signature blocks # - 1 chunk from a quoted reply (wrong attribution) # - 1 relevant chunk buried in noise # - 1 chunk from an unrelated thread (similar keywords) Structured query through iGPT's API: iGPT's iGPT's from igptai import IGPT client = IGPT(api_key="your-api-key", user="user-123") response = client.recall.ask( input="What are the open action items from this week?", quality="cef-1-normal" ) from igptai import IGPT client = IGPT(api_key="your-api-key", user="user-123") response = client.recall.ask( input="What are the open action items from this week?", quality="cef-1-normal" ) The response metadata: { "state": { "status": "completed", "id": "n0a8neqszeo04ruf280v5ug4pkxnv63" }, "context": { "quality": "cef-1-normal", "indexed": 1, "datasources": 2 }, "metadata": { "sources": 7 }, "usage": { "input_tokens": 44676, "output_tokens": 505, "total_tokens": 45181 } } { "state": { "status": "completed", "id": "n0a8neqszeo04ruf280v5ug4pkxnv63" }, "context": { "quality": "cef-1-normal", "indexed": 1, "datasources": 2 }, "metadata": { "sources": 7 }, "usage": { "input_tokens": 44676, "output_tokens": 505, "total_tokens": 45181 } } The output is structured data with owners, dates, and source attribution across 7 referenced documents. No signature blocks, no duplicated quoted text, no misattributed CC recipients. The six infrastructure layers handled all of that before the model saw anything. The streaming version shows the pipeline stages as they happen: for event in client.recall.ask( input="Who committed to what in the last 7 days?", stream=True, quality="cef-1-normal" ): if "state" in event: print(f"Status: {event['state']['status']}") elif "context" in event: print(f"Datasources: {event['context']['datasources']}") elif "metadata" in event: print(f"Sources referenced: {event['metadata']['sources']}") elif "delta" in event: print(event["delta"]["output"], end="", flush=True) for event in client.recall.ask( input="Who committed to what in the last 7 days?", stream=True, quality="cef-1-normal" ): if "state" in event: print(f"Status: {event['state']['status']}") elif "context" in event: print(f"Datasources: {event['context']['datasources']}") elif "metadata" in event: print(f"Sources referenced: {event['metadata']['sources']}") elif "delta" in event: print(event["delta"]["output"], end="", flush=True) Status: start Datasources: 2 Status: inference Sources referenced: 22 Here is a summary of commitments made in the last 7 days... | Date | Person | Commitment | |------------|-------------|------------------------------------------------| | 2026-02-09 | Sivan Kaspi | Proposed new campaign, requested alignment sync | | 2026-02-10 | Noam Cohen | Reviewing blog and one-pager, final versions | ... Status: completed Status: start Datasources: 2 Status: inference Sources referenced: 22 Here is a summary of commitments made in the last 7 days... | Date | Person | Commitment | |------------|-------------|------------------------------------------------| | 2026-02-09 | Sivan Kaspi | Proposed new campaign, requested alignment sync | | 2026-02-10 | Noam Cohen | Reviewing blog and one-pager, final versions | ... Status: completed Twenty-two source documents referenced, synthesized into structured commitments with dates and owners, streaming back through server-sent events. The same query using standard RAG would have returned 5 chunks of flattened text with no structural metadata, and you'd be asking the model to infer who committed to what from raw message bodies. // Same thing in Node.js import IGPT from "igptai"; const client = new IGPT({ apiKey: "your-api-key", user: "user-123" }); const stream = await client.recall.ask({ input: "Summarize the key decisions from recent threads", stream: true, quality: "cef-1-normal" }); for await (const event of stream) { if (event.delta?.output) { process.stdout.write(event.delta.output); } } // Same thing in Node.js import IGPT from "igptai"; const client = new IGPT({ apiKey: "your-api-key", user: "user-123" }); const stream = await client.recall.ask({ input: "Summarize the key decisions from recent threads", stream: true, quality: "cef-1-normal" }); for await (const event of stream) { if (event.delta?.output) { process.stdout.write(event.delta.output); } } The full pipeline ┌──────────────────────────────────────────────────────────┐ │ EMAIL SOURCES │ │ Gmail / Outlook / IMAP (OAuth) │ └────────────────────────┬─────────────────────────────────┘ │ ┌────▼────┐ │ SYNC │ Connect, authenticate, │ │ pull threads + attachments └────┬────┘ │ ┌────▼─────────────────────────────┐ │ PARSE & STRUCTURE │ │ Thread reconstruction │ │ Quoted text deduplication │ │ Participant role tracking │ │ Attachment OCR + parsing │ │ Signature/boilerplate removal │ │ Multilingual handling │ └────┬─────────────────────────────┘ │ ┌────▼─────────────────────────────┐ │ INDEX │ │ Semantic vector embeddings │ │ Keyword / full-text index │ │ Context graphs │ │ Structured metadata │ └────┬─────────────────────────────┘ │ ┌────▼─────────────────────────────┐ │ HYBRID RETRIEVAL + REASONING │ │ Semantic + keyword + filters │ │ + reranking + context shaping │ │ → Structured JSON output │ │ (tasks, owners, decisions, │ │ sentiment, citations) │ └────┬─────────────────────────────┘ │ ┌──────────▼──────────┐ │ YOUR APPLICATION │ │ Agent / Workflow / │ │ SaaS / Dashboard │ └─────────────────────┘ ┌──────────────────────────────────────────────────────────┐ │ EMAIL SOURCES │ │ Gmail / Outlook / IMAP (OAuth) │ └────────────────────────┬─────────────────────────────────┘ │ ┌────▼────┐ │ SYNC │ Connect, authenticate, │ │ pull threads + attachments └────┬────┘ │ ┌────▼─────────────────────────────┐ │ PARSE & STRUCTURE │ │ Thread reconstruction │ │ Quoted text deduplication │ │ Participant role tracking │ │ Attachment OCR + parsing │ │ Signature/boilerplate removal │ │ Multilingual handling │ └────┬─────────────────────────────┘ │ ┌────▼─────────────────────────────┐ │ INDEX │ │ Semantic vector embeddings │ │ Keyword / full-text index │ │ Context graphs │ │ Structured metadata │ └────┬─────────────────────────────┘ │ ┌────▼─────────────────────────────┐ │ HYBRID RETRIEVAL + REASONING │ │ Semantic + keyword + filters │ │ + reranking + context shaping │ │ → Structured JSON output │ │ (tasks, owners, decisions, │ │ sentiment, citations) │ └────┬─────────────────────────────┘ │ ┌──────────▼──────────┐ │ YOUR APPLICATION │ │ Agent / Workflow / │ │ SaaS / Dashboard │ └─────────────────────┘ Everything from query to structured response happens in a single API call. No chunking strategy to tune, no vector store to manage, no prompt engineering to compensate for missing structure. Running it yourself pip install igptai pip install igptai from igptai import IGPT client = IGPT(api_key="your-key", user="your-user-id") # Connect Gmail via OAuth auth = client.connectors.authorize( service="google", scope="email", redirect_uri="https://your-app.com/callback" ) # User completes OAuth flow at auth["url"] # Check indexing status datasources = client.datasources.list() # Query once indexed response = client.recall.ask( input="What decisions were made this week and who owns next steps?" ) from igptai import IGPT client = IGPT(api_key="your-key", user="your-user-id") # Connect Gmail via OAuth auth = client.connectors.authorize( service="google", scope="email", redirect_uri="https://your-app.com/callback" ) # User completes OAuth flow at auth["url"] # Check indexing status datasources = client.datasources.list() # Query once indexed response = client.recall.ask( input="What decisions were made this week and who owns next steps?" ) If you want to skip the OAuth setup and just see what structured output looks like on real email data, the playground lets you connect your inbox and run queries in about five minutes with no code. playground playground We kept hitting the same wall that everyone building over email hits: the problem isn't the model, it's the infrastructure between raw email and the model, and once you solve the infrastructure, the model works fine because it's finally getting clean, structured input instead of flattened text with signatures repeated 12 times. Email threads are conversational graphs, and the moment you stop treating them like documents is the moment everything else starts working. If you're building agents or products over email and hitting these problems, drop a comment or find me on Reddit. Reddit