Conversation Management
Core Strategies
Sliding Window
Maintains a fixed number of recent messages and drops the oldest when the limit is reached.
When message count exceeds the configured window size, the oldest messages are removed. Incomplete message sequences are cleaned up to keep conversation state valid. Optionally truncates large tool results to preserve context space.
Conversations with predictable length where recent context matters most and older turns can be safely discarded.
Simple and predictable, but permanently loses older context. If the agent needs to reference something from early in the conversation, it is gone.
Summarising
Compresses older messages into summaries rather than discarding them, preserving key information in fewer tokens.
When context reduction is needed, a configurable percentage of older messages are summarised by a secondary model call. Recent messages are preserved in their original form. Tool use and result pairs are kept intact to avoid breaking conversation structure.
Long conversations where historical context remains relevant but must be compressed, such as technical discussions or customer service interactions.
Retains more information than sliding window, but adds latency and cost from the summarisation model call. Summary quality depends on what the summariser considers important.
Token Budgeting
Allocates fixed token budgets to different context sections (system prompt, conversation history, current task).
Each section gets a hard ceiling. If conversation history exceeds its budget, it triggers truncation or summarisation only within that section. Other sections remain untouched.
Systems with strict latency or cost constraints where predictable token usage matters.
Rigid boundaries can waste tokens if one section consistently underutilises its budget while another is starved.
Lossy Compression
Keeps all user messages verbatim, compresses only assistant responses.
User messages are preserved exactly, while assistant reasoning and responses are condensed. User intent is always available for re-interpretation.
Support and advisory contexts where faithfully retaining user requests is critical but assistant reasoning can be reconstructed.
Asymmetric compression means total savings are limited. Works best when assistant messages are significantly longer than user messages.
Selective Retention
Hierarchical Summarisation
Multiple summary layers at different levels of detail.
Recent messages stay in full. Older messages get summarised. Very old summaries get summarised again, creating a hierarchy of decreasing detail.
Very long-running sessions where both recent detail and distant context matter, such as multi-day projects or ongoing research.
Retains more signal than single-pass summarisation, but each layer introduces some information loss. Multiple model calls increase cost.
Semantic Chunking
Drops by relevance instead of age.
Use embeddings to score each message against the current query. Keep messages most similar to what is being discussed. Discard tangential exchanges regardless of when they occurred.
Conversations that jump between topics, where the most relevant context may not be the most recent.
Requires embedding computation for each message. Relevance scoring can miss context that is important for reasons not captured by semantic similarity.
Pinned Messages
Certain messages are marked as permanent and never dropped or summarised.
User preferences, critical decisions, key facts, and important instructions are flagged as pinned. The sliding window or summariser skips these entries. They persist even as the rest of the window moves forward.
Conversations where specific facts must survive indefinitely, such as user constraints, system boundaries, or agreed-upon decisions.
Pinned messages consume context space permanently. Too many pins can crowd out recent context and defeat the purpose of windowing.
Topic Segmentation
Detects topic shifts and summarises completed topics while keeping the current topic in full.
Monitor conversation for topic boundaries using semantic similarity or explicit markers. When a topic concludes, compress it into a summary. Maintain the active topic with full message history.
Structured conversations that move through distinct phases, such as requirements gathering followed by implementation discussion.
Topic detection is imperfect. Misidentified boundaries can split a single topic or merge unrelated ones, degrading summary quality.
Architectural Approaches
Retrieval Augmented Context
Stores the full conversation in a vector database and retrieves relevant chunks on demand.
Every message is embedded and stored externally. When the agent needs context, it queries the database for messages relevant to the current turn rather than keeping everything in the context window.
Very long conversations or multi-session agents where full history far exceeds any context window.
Adds retrieval latency and embedding cost. Retrieved context may miss important messages that are semantically distant from the current query but still relevant.
Checkpoint Summarisation
Creates summaries at natural breakpoints rather than on token overflow.
Detect task completions, topic changes, or explicit user signals. Generate a summary at that boundary point. This produces cleaner summaries because the model summarises a completed unit of work rather than an arbitrary slice of tokens.
Task-oriented conversations with clear milestones, such as debugging sessions or step-by-step workflows.
Summaries only happen at boundaries, so context can temporarily exceed ideal limits between checkpoints.
Conversation Forking
Starts fresh context with a summary when context overflows, but retains the ability to query old context.
When the context limit is reached, create a comprehensive summary and begin a new context window. The old context is archived and remains queryable. The agent operates with the summary plus new messages, pulling from the archive when needed.
Open-ended, multi-hour sessions where starting fresh is acceptable as long as nothing is permanently lost.
Two-tier system adds complexity. Summary handoff can lose nuance, and archive queries add latency.
