Will Percey — Portfolio

Conversation Management

> > Updated Feb 2026

swap_horiz

Core Strategies

Sliding Window

Maintains a fixed number of recent messages and drops the oldest when the limit is reached.

How it works

When message count exceeds the configured window size, the oldest messages are removed. Incomplete message sequences are cleaned up to keep conversation state valid. Optionally truncates large tool results to preserve context space.

Best for

Conversations with predictable length where recent context matters most and older turns can be safely discarded.

Trade-off

Simple and predictable, but permanently loses older context. If the agent needs to reference something from early in the conversation, it is gone.

Summarising

Compresses older messages into summaries rather than discarding them, preserving key information in fewer tokens.

How it works

When context reduction is needed, a configurable percentage of older messages are summarised by a secondary model call. Recent messages are preserved in their original form. Tool use and result pairs are kept intact to avoid breaking conversation structure.

Best for

Long conversations where historical context remains relevant but must be compressed, such as technical discussions or customer service interactions.

Trade-off

Retains more information than sliding window, but adds latency and cost from the summarisation model call. Summary quality depends on what the summariser considers important.

Token Budgeting

Allocates fixed token budgets to different context sections (system prompt, conversation history, current task).

How it works

Each section gets a hard ceiling. If conversation history exceeds its budget, it triggers truncation or summarisation only within that section. Other sections remain untouched.

Best for

Systems with strict latency or cost constraints where predictable token usage matters.

Trade-off

Rigid boundaries can waste tokens if one section consistently underutilises its budget while another is starved.

Lossy Compression

Keeps all user messages verbatim, compresses only assistant responses.

How it works

User messages are preserved exactly, while assistant reasoning and responses are condensed. User intent is always available for re-interpretation.

Best for

Support and advisory contexts where faithfully retaining user requests is critical but assistant reasoning can be reconstructed.

Trade-off

Asymmetric compression means total savings are limited. Works best when assistant messages are significantly longer than user messages.

filter_alt

Selective Retention

Hierarchical Summarisation

Multiple summary layers at different levels of detail.

How it works

Recent messages stay in full. Older messages get summarised. Very old summaries get summarised again, creating a hierarchy of decreasing detail.

Best for

Very long-running sessions where both recent detail and distant context matter, such as multi-day projects or ongoing research.

Trade-off

Retains more signal than single-pass summarisation, but each layer introduces some information loss. Multiple model calls increase cost.

Semantic Chunking

Drops by relevance instead of age.

How it works

Use embeddings to score each message against the current query. Keep messages most similar to what is being discussed. Discard tangential exchanges regardless of when they occurred.

Best for

Conversations that jump between topics, where the most relevant context may not be the most recent.

Trade-off

Requires embedding computation for each message. Relevance scoring can miss context that is important for reasons not captured by semantic similarity.

Pinned Messages

Certain messages are marked as permanent and never dropped or summarised.

How it works

User preferences, critical decisions, key facts, and important instructions are flagged as pinned. The sliding window or summariser skips these entries. They persist even as the rest of the window moves forward.

Best for

Conversations where specific facts must survive indefinitely, such as user constraints, system boundaries, or agreed-upon decisions.

Trade-off

Pinned messages consume context space permanently. Too many pins can crowd out recent context and defeat the purpose of windowing.

Topic Segmentation

Detects topic shifts and summarises completed topics while keeping the current topic in full.

How it works

Monitor conversation for topic boundaries using semantic similarity or explicit markers. When a topic concludes, compress it into a summary. Maintain the active topic with full message history.

Best for

Structured conversations that move through distinct phases, such as requirements gathering followed by implementation discussion.

Trade-off

Topic detection is imperfect. Misidentified boundaries can split a single topic or merge unrelated ones, degrading summary quality.

architecture

Architectural Approaches

Retrieval Augmented Context

Stores the full conversation in a vector database and retrieves relevant chunks on demand.

How it works

Every message is embedded and stored externally. When the agent needs context, it queries the database for messages relevant to the current turn rather than keeping everything in the context window.

Best for

Very long conversations or multi-session agents where full history far exceeds any context window.

Trade-off

Adds retrieval latency and embedding cost. Retrieved context may miss important messages that are semantically distant from the current query but still relevant.

Checkpoint Summarisation

Creates summaries at natural breakpoints rather than on token overflow.

How it works

Detect task completions, topic changes, or explicit user signals. Generate a summary at that boundary point. This produces cleaner summaries because the model summarises a completed unit of work rather than an arbitrary slice of tokens.

Best for

Task-oriented conversations with clear milestones, such as debugging sessions or step-by-step workflows.

Trade-off

Summaries only happen at boundaries, so context can temporarily exceed ideal limits between checkpoints.

Conversation Forking

Starts fresh context with a summary when context overflows, but retains the ability to query old context.

How it works

When the context limit is reached, create a comprehensive summary and begin a new context window. The old context is archived and remains queryable. The agent operates with the summary plus new messages, pulling from the archive when needed.

Best for

Open-ended, multi-hour sessions where starting fresh is acceptable as long as nothing is permanently lost.

Trade-off

Two-tier system adds complexity. Summary handoff can lose nuance, and archive queries add latency.