Managing Agent Context Using Summarization Middleware Agent

AI agents running long tasks face a critical challenge: they eventually run out of memory. As conversations grow longer, agents forget earlier steps, repeat tasks, contradict previous decisions, and rack up unnecessary API costs.

The root cause is simple: Large Language Models operate within fixed memory windows. Every message, tool output, and reasoning step consumes tokens. When the window fills up, something has to give.

How Memory Problems Manifest

Without proper management, agent memory fails in predictable ways:

Poisoning – A wrong fact slips into context early on and keeps getting reused, compounding the original error throughout the workflow.

Distraction – Too much accumulated context makes the model ignore its training knowledge and focus on irrelevant historical details instead.

Confusion – Irrelevant details pile up, leading to off-topic or low-quality answers as the agent loses track of what actually matters.

Context Clash – New information or tools contradict older context still in memory, causing inconsistent reasoning as the agent flip-flops between incompatible approaches.

Building Long-Running AI Agents

In my work building AI agents for long, stateful workflows, particularly business intelligence systems where business owners use AI to analyze customer sessions, identify patterns in customer needs, extract important updates from weekly messages, understand customer challenges, and discover product demand trends. I've seen memory limits surface quickly and painfully once agents move beyond short, single-turn interactions.

One of the earliest challenges encountered was context rot. As agents processed hundreds of customer messages across multiple sessions to identify trends and patterns, earlier insights, customer signals, or identified pain points were either forgotten or inconsistently referenced. This led to agents missing critical patterns, contradicting earlier trend analysis, or failing to connect related customer feedback from different time periods.

Another recurring issue was error amplification. When an incorrect assumption entered the context early such as misclassifying customer sentiment or misinterpreting a feature request, it persisted across subsequent analysis steps. Without intervention, the agent would confidently build insights upon that flawed premise, leading business owners to make decisions based on inaccurate customer intelligence.

Cost and latency also became practical constraints. Agents analyzing weeks of customer conversations to identify what customers want most, track emerging challenges, or detect product demand patterns accumulated massive token usage. Processing full conversation histories for every business intelligence query increased inference costs dramatically and slowed response times making the system impractical for business owners who need quick, actionable insights.

The Solution: Summarization

The breakthrough came with summarization middleware a way to compress long conversation histories once context fills up, keeping agents sharp, efficient, and affordable. Instead of choosing between losing context or hitting memory limits, summarization automatically condenses older interactions while preserving what matters most.

This approach gives you precise control over:

When to trigger compression (by token count, message count, or context percentage)
How much context to keep uncompressed (recent messages stay in full detail)
What to preserve in summaries (key insights, trends, and decisions rather than verbose message histories)

What became clear is that agent memory needs compression, not deletion.

Insights Matter More Than Messages

A critical discovery from building these business intelligence agents is that they don't need to remember every customer message, they need to remember key trends identified, common customer challenges surfaced, product demands detected, and actionable insights generated.

Summarization aligns naturally with this requirement. By compressing earlier analysis cycles into structured summaries, the agent preserves:

Identified customer trends and patterns
Common pain points and challenges mentioned
Product features most requested
Key sentiment shifts across time periods
Critical updates and emerging issues

This mirrors how business analysts work: they don't recall every customer conversation verbatim, but they do remember patterns, trends, and actionable insights.

Why Summarization Works

Summarization middleware keeps agents sharp by maintaining awareness of trends across customer sessions while compressing verbose history. It keeps agents efficient by reducing token usage without losing critical context. And it keeps agents affordable by using economical models for compression while powerful models handle core intelligence work.

The middleware embeds directly into the agent loop, automatically compressing older context when thresholds are reached no manual intervention required. This proved essential for building agents that deliver accurate, cost-efficient, and trustworthy customer insights over extended time periods.

What LangChain's Summarization Middleware Gives

Automatic summarization when token limits are approached – The middleware monitors your conversation and triggers compression automatically. No manual intervention needed.
Flexible triggers based on tokens, fractions, or message counts – Configure activation based on absolute token count (e.g., 2,000 tokens), percentage of context capacity (e.g., 80%), or message count (e.g., 20 messages). Combine multiple triggers for fine-grained control.
Control over how much recent context is preserved – Specify retention policies to keep recent messages uncompressed. Preserve the last 1,000 tokens, last 20 messages, or last 20% of context, whatever works for your use case.
A separate, cheaper model for summarization to reduce costs – Use economical models like Claude Haiku or GPT-4o-mini for compression while your primary agent runs on more powerful models. This keeps summarization overhead minimal.

Configuration Made Simple

The middleware requires minimal setup. You specify:

Summarization model: A fast, economical model for compression
Trigger conditions: When to activate by tokens, context fraction, or message count
Retention policy: How much recent context to preserve uncompressed
Summary marker: A prefix to identify summaries in conversation history

Once configured, the middleware runs automatically throughout agent execution.

Implementation Example:

Basic Trigger

import { createAgent, summarizationMiddleware } from "langchain";

const agent = createAgent({
  model: "gpt-4o",
  tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator],
  middleware: [
    summarizationMiddleware({
      model: "gpt-4o-mini", // Economical model for background tasks
      trigger: { tokens: 2000 }, // Compresses when context hits 2k tokens
      keep: { messages: 20 }, // Ensures the last 20 messages stay detailed
    }),
  ],
});

Advanced Logic

import { createAgent, summarizationMiddleware } from "langchain";

const agent = createAgent({
  model: "gpt-4o",
  tools: [sessionRetriever, sentimentAnalysisTool, trendAggregator],
  middleware: [
    summarizationMiddleware({
      model: "gpt-4o-mini",
      trigger: [
      { tokens: 2000 },
      {fraction:0.8}
     ]
      keep: { tokens: 1000 },  //Keep last 1000 tokens after summary
      summaryPrefix: "## Previous conversation summary:"
    }),
  ],
});

Perfect For

Coding agents – Agents performing refactoring, debugging, or code review across multiple files accumulate extensive context from file reads, code analysis, and iterative changes. Summarization preserves the thread of decisions without keeping every intermediate step in memory.

Customer support assistants – Long troubleshooting sessions generate substantial history from diagnostic steps, user responses, and solution attempts. Summarization maintains case context across extended interactions without hitting token limits.

Research and analysis agents – Agents synthesizing information from multiple sources or conducting iterative analysis benefit from compressed summaries of earlier findings while maintaining focus on current work.

Multi-session workflows – Any application where users return to continue previous work benefits from summaries that preserve essential context without the full token overhead of past sessions.

Summarization preserves the thread of decisions without keeping every intermediate step in memory.