AI agents running long tasks face a critical challenge: they eventually run out of memory. As conversations grow longer, agents forget earlier steps, repeat tasks, contradict previous decisions, and rack up unnecessary API costs. The root cause is simple: Large Language Models operate within fixed memory windows. Every message, tool output, and reasoning step consumes tokens. When the window fills up, something has to give. How Memory Problems Manifest How Memory Problems Manifest Without proper management, agent memory fails in predictable ways: Poisoning – A wrong fact slips into context early on and keeps getting reused, compounding the original error throughout the workflow. Poisoning Distraction – Too much accumulated context makes the model ignore its training knowledge and focus on irrelevant historical details instead. Distraction Confusion – Irrelevant details pile up, leading to off-topic or low-quality answers as the agent loses track of what actually matters. Confusion Context Clash – New information or tools contradict older context still in memory, causing inconsistent reasoning as the agent flip-flops between incompatible approaches. Context Clash Building Long-Running AI Agents Building Long-Running AI Agents In my work building AI agents for long, stateful workflows, particularly business intelligence systems where business owners use AI to analyze customer sessions, identify patterns in customer needs, extract important updates from weekly messages, understand customer challenges, and discover product demand trends. I've seen memory limits surface quickly and painfully once agents move beyond short, single-turn interactions. One of the earliest challenges encountered was context rot. As agents processed hundreds of customer messages across multiple sessions to identify trends and patterns, earlier insights, customer signals, or identified pain points were either forgotten or inconsistently referenced. This led to agents missing critical patterns, contradicting earlier trend analysis, or failing to connect related customer feedback from different time periods. Another recurring issue was error amplification. When an incorrect assumption entered the context early such as misclassifying customer sentiment or misinterpreting a feature request, it persisted across subsequent analysis steps. Without intervention, the agent would confidently build insights upon that flawed premise, leading business owners to make decisions based on inaccurate customer intelligence. Cost and latency also became practical constraints. Agents analyzing weeks of customer conversations to identify what customers want most, track emerging challenges, or detect product demand patterns accumulated massive token usage. Processing full conversation histories for every business intelligence query increased inference costs dramatically and slowed response times making the system impractical for business owners who need quick, actionable insights. The Solution: Summarization The Solution: Summarization The breakthrough came with summarization middleware a way to compress long conversation histories once context fills up, keeping agents sharp, efficient, and affordable. Instead of choosing between losing context or hitting memory limits, summarization automatically condenses older interactions while preserving what matters most. This approach gives you precise control over: When to trigger compression (by token count, message count, or context percentage)How much context to keep uncompressed (recent messages stay in full detail)What to preserve in summaries (key insights, trends, and decisions rather than verbose message histories) When to trigger compression (by token count, message count, or context percentage) When to trigger How much context to keep uncompressed (recent messages stay in full detail) How much context to keep What to preserve in summaries (key insights, trends, and decisions rather than verbose message histories) What to preserve What became clear is that agent memory needs compression, not deletion. agent memory needs compression, not deletion Insights Matter More Than Messages Insights Matter More Than Messages A critical discovery from building these business intelligence agents is that they don't need to remember every customer message, they need to remember key trends identified, common customer challenges surfaced, product demands detected, and actionable insights generated. Summarization aligns naturally with this requirement. By compressing earlier analysis cycles into structured summaries, the agent preserves: Identified customer trends and patternsCommon pain points and challenges mentionedProduct features most requestedKey sentiment shifts across time periodsCritical updates and emerging issues Identified customer trends and patterns Common pain points and challenges mentioned Product features most requested Key sentiment shifts across time periods Critical updates and emerging issues This mirrors how business analysts work: they don't recall every customer conversation verbatim, but they do remember patterns, trends, and actionable insights. Why Summarization Works Why Summarization Works Summarization middleware keeps agents sharp by maintaining awareness of trends across customer sessions while compressing verbose history. It keeps agents efficient by reducing token usage without losing critical context. And it keeps agents affordable by using economical models for compression while powerful models handle core intelligence work. The middleware embeds directly into the agent loop, automatically compressing older context when thresholds are reached no manual intervention required. This proved essential for building agents that deliver accurate, cost-efficient, and trustworthy customer insights over extended time periods. What LangChain's Summarization Middleware Gives What LangChain's Summarization Middleware Gives Automatic summarization when token limits are approached – The middleware monitors your conversation and triggers compression automatically. No manual intervention needed.Flexible triggers based on tokens, fractions, or message counts – Configure activation based on absolute token count (e.g., 2,000 tokens), percentage of context capacity (e.g., 80%), or message count (e.g., 20 messages). Combine multiple triggers for fine-grained control.Control over how much recent context is preserved – Specify retention policies to keep recent messages uncompressed. Preserve the last 1,000 tokens, last 20 messages, or last 20% of context, whatever works for your use case.A separate, cheaper model for summarization to reduce costs – Use economical models like Claude Haiku or GPT-4o-mini for compression while your primary agent runs on more powerful models. This keeps summarization overhead minimal. Automatic summarization when token limits are approached – The middleware monitors your conversation and triggers compression automatically. No manual intervention needed. Automatic summarization when token limits are approached Flexible triggers based on tokens, fractions, or message counts – Configure activation based on absolute token count (e.g., 2,000 tokens), percentage of context capacity (e.g., 80%), or message count (e.g., 20 messages). Combine multiple triggers for fine-grained control. Flexible triggers based on tokens, fractions, or message counts Control over how much recent context is preserved – Specify retention policies to keep recent messages uncompressed. Preserve the last 1,000 tokens, last 20 messages, or last 20% of context, whatever works for your use case. Control over how much recent context is preserved A separate, cheaper model for summarization to reduce costs – Use economical models like Claude Haiku or GPT-4o-mini for compression while your primary agent runs on more powerful models. This keeps summarization overhead minimal. A separate, cheaper model for summarization to reduce costs Configuration Made Simple Configuration Made Simple The middleware requires minimal setup. You specify: Summarization model: A fast, economical model for compressionTrigger conditions: When to activate by tokens, context fraction, or message countRetention policy: How much recent context to preserve uncompressedSummary marker: A prefix to identify summaries in conversation history Summarization model: A fast, economical model for compression Summarization model Trigger conditions: When to activate by tokens, context fraction, or message count Trigger conditions Retention policy: How much recent context to preserve uncompressed Retention policy Summary marker: A prefix to identify summaries in conversation history Summary marker Once configured, the middleware runs automatically throughout agent execution. Implementation Example: Implementation Example: Basic Trigger Basic Trigger import { createAgent, summarizationMiddleware } from "langchain"; const agent = createAgent({ model: "gpt-4o", tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator], middleware: [ summarizationMiddleware({ model: "gpt-4o-mini", // Economical model for background tasks trigger: { tokens: 2000 }, // Compresses when context hits 2k tokens keep: { messages: 20 }, // Ensures the last 20 messages stay detailed }), ], }); import { createAgent, summarizationMiddleware } from "langchain"; const agent = createAgent({ model: "gpt-4o", tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator], middleware: [ summarizationMiddleware({ model: "gpt-4o-mini", // Economical model for background tasks trigger: { tokens: 2000 }, // Compresses when context hits 2k tokens keep: { messages: 20 }, // Ensures the last 20 messages stay detailed }), ], }); Advanced Logic Advanced Logic import { createAgent, summarizationMiddleware } from "langchain"; const agent = createAgent({ model: "gpt-4o", tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator], middleware: [ summarizationMiddleware({ model: "gpt-4o-mini", trigger: [ { tokens: 2000 }, {fraction:0.8} ] keep: { tokens: 1000 }, //Keep last 1000 tokens after summary summaryPrefix: "## Previous conversation summary:" }), ], }); import { createAgent, summarizationMiddleware } from "langchain"; const agent = createAgent({ model: "gpt-4o", tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator], middleware: [ summarizationMiddleware({ model: "gpt-4o-mini", trigger: [ { tokens: 2000 }, {fraction:0.8} ] keep: { tokens: 1000 }, //Keep last 1000 tokens after summary summaryPrefix: "## Previous conversation summary:" }), ], }); Perfect For Perfect For Coding agents – Agents performing refactoring, debugging, or code review across multiple files accumulate extensive context from file reads, code analysis, and iterative changes. Summarization preserves the thread of decisions without keeping every intermediate step in memory. Coding agents Customer support assistants – Long troubleshooting sessions generate substantial history from diagnostic steps, user responses, and solution attempts. Summarization maintains case context across extended interactions without hitting token limits. Customer support assistants Research and analysis agents – Agents synthesizing information from multiple sources or conducting iterative analysis benefit from compressed summaries of earlier findings while maintaining focus on current work. Research and analysis agents Multi-session workflows – Any application where users return to continue previous work benefits from summaries that preserve essential context without the full token overhead of past sessions. Multi-session workflows Summarization preserves the thread of decisions without keeping every intermediate step in memory. Customer support assistants – Long troubleshooting sessions generate substantial history from diagnostic steps, user responses, and solution attempts. Summarization maintains case context across extended interactions without hitting token limits. Customer support assistants Research and analysis agents – Agents synthesizing information from multiple sources or conducting iterative analysis benefit from compressed summaries of earlier findings while maintaining focus on current work. Research and analysis agents